reproducibility | UCSC OSPO

Reconfigurable and Placement-Aware Replication for Edge Systems

Sat, 31 Jan 2026 00:00:00 +0000

Project Description

Topics: Distributed systems
Skills: Rust, Java, Go, Python, Bash scripting, Linux, Docker.
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fadhil I. Kurnia

Modern replicated systems are typically evaluated under static configurations with fixed replica placement. However, real-world edge deployments are highly dynamic: workloads shift geographically, edge nodes join or fail, and latency conditions change over time. Our existing testbed provides reproducible evaluation for replicated systems but lacks support for dynamic reconfiguration and adaptive edge placement policies.

This project extends the existing open testbed to support:

Dynamic Replica Reconfiguration
- Membership changes (add/remove replicas)
- Leader migration and shard movement
- Online reconfiguration cost measurement (latency spikes, recovery overhead, state transfer cost)
Edge-Aware Placement Policies
- Demand-aware placement based on geographic workload skew
- Latency-aware and bandwidth-aware replica selection
- Comparison of static vs. adaptive placement strategies
- Evaluation under real-world latency matrices (e.g., US metro-level or cloud region traces)
What-if Simulation Framework
- Replay workload traces with time-varying demand
- Simulate hundreds of edge sites with realistic network conditions
- Quantify trade-offs between consistency, availability, reconfiguration overhead, and cost

The outcome will be an open-source framework that enables researchers to evaluate not only steady-state replication performance, but also how systems behave under churn, scaling events, and demand shifts. They are central challenges in real edge environments.

Expected Deliverables

Reconfiguration abstraction layer (API for membership & placement changes)
Placement policy plugin framework (k-means, facility-location heuristics, latency-minimizing, cost-aware)
Trace-driven dynamic workload engine
Public benchmark scenarios and reproducible experiment scripts
Artifact-ready documentation and evaluation report

Reproducible CXL Emulation

Fri, 30 Jan 2026 00:00:00 +0000

Compute Express Link (CXL) is an emerging memory interconnect standard that enables shared, coherent memory across CPUs, accelerators, and multiple hosts, unlocking new possibilities in hyperscale, HPC, and disaggregated systems. However, because access to real multi-host CXL hardware is limited, it is difficult for researchers and students to experiment with, evaluate, and reproduce results on advanced CXL topologies. OCEAN (Open-source CXL Emulation At Hyperscale) [https://github.com/cxl-emu/OCEAN] is a full-stack CXL emulation platform built on QEMU that enables detailed emulation of CXL 3.0 memory systems, including multi-host shared memory pools, coherent fabric topologies, and latency modeling. This project will create reproducible experiment pipelines, automated deployment workflows, and user-friendly tutorials so that others can reliably run and extend CXL emulation experiments without requiring specialized hardware.

Reproducible CXL Emulation for Multi-Host Memory Systems

Streamline multi-host CXL emulation without specialized hardware.

Topics: CXL emulation Memory Systems Reproducibility
Skills: C/C++, Virtualization (QEMU), Scripting, Performance Modeling
Difficulty: Medium
Size: Large (350 hours)
Mentors: Mujahid Al Rafi, Luanzheng "Lenny" Guo.

Tasks:

Create automated deployment scripts and configuration templates for OCEAN-based CXL emulation topologies (single-host and multi-host).
Develop a standardized experiment harness for running memory performance benchmarks (e.g., OSU micro-benchmarks, STREAM-style tests) in emulated CXL environments.
Build reproducible experiment pipelines that others can run to evaluate latency, bandwidth, and scaling properties of CXL memory systems.
Produce tutorials, documentation, and reproducibility artifacts to guide new users through setup, execution, and analysis.
Package and contribute all scripts, configurations, and documentation back to the OCEAN open-source repository.

Exploring Security and Isolation in CXL-Based Memory Systems

Investigate security and isolation properties of CXL-based memory systems using software emulation.

Topics: CXL Systems Security Memory Isolation Side Channel Emulation
Skills: C/C++, Virtualization (QEMU), Scripting, Computer Architecture, Security
Difficulty: Medium
Size: Large (350 hours)
Mentors: Mujahid Al Rafi, Luanzheng "Lenny" Guo.

Tasks:

Study the CXL memory model and fabric architecture to identify potential security and isolation risks in multi-host shared memory environments (e.g., contention, timing variation, and resource interference).
Set up multi-host or multi-VM CXL emulation environments using OCEAN that mimic realistic multi-tenant deployments.
Design and implement reproducible micro-benchmarks to measure timing, bandwidth contention, or observable interference through shared CXL memory pools.
Analyze how fabric configuration choices (e.g., topology, latency injection, memory partitioning, or allocation policies) affect isolation and leakage behavior.
Explore and prototype mitigation strategies—such as memory partitioning, throttling, or policy-driven allocation—and evaluate their effectiveness using the emulation platform.

StatWrap

Thu, 29 Jan 2026 00:00:00 +0000

StatWrap is a free and open-source assistive, non-invasive discovery and inventory tool to document research projects. It inventories project assets (e.g., code files, data files, manuscripts, documentation) and organizes information without additional input from the user. It also provides structure for users to add searchable and filterable notes connected to files to help communicate metadata about intent and analysis steps.

At its core, StatWrap helps investigators identify and track changes in a research project as it evolves - which may affect reproducibility. For example: (1) people on the project can change over time, so processes may not be consistently executed due to transitions in employment; (2) data changes over time, due to accruing additional cases, adding new variables, or correcting mistakes in existing data; (3) software (e.g. used for data preparation and statistical analysis) evolves as it is edited, improved, and optimized; and (4) software can break or produce different results due to changes ‘under the hood’ such as updates to statistical packages, compilers, or interpreters. StatWrap passively and actively documents these changes to support reproducibility.

Additional information:

Group and Individual Customizations

Topics: configuration, user interface
Skills: JavaScript, React
Difficulty: Medium
Size: Large (350 hours)
Mentor: Luke Rasmussen, Eric Whitley

The goal of this project is to expand the existing capabilities of StatWrap to provide more flexibility to individual users and groups. Currently, features within StatWrap such as the directory template for creating new projects and the reproducibility checklist are static, meaning everyone who downloads StatWrap has the same configuration. However, each user and team work differently and should be able to configure StatWrap to support their needs.

When a user creates a new project, StatWrap provides a collection of project templates. These create a directory hierarchy, along with some seed files (e.g., a README.md file in the project root). Different groups have their own conventions for creating project directories. While StatWrap can be released with additional project templates defined, there are many situations in which users would want to keep their project template local. StatWrap should allow a user to create a project template configuration, from scratch or being seeded by the contents of an existing project. A user should then be able to export this configuration, share it with others, and other user should have the ability to import the configuration into their instance of StatWrap.

Similarly, StatWrap provides a reproducibility checklist that includes six existing checklist items. However, individual users and groups may have their own checklists, including institution-specific steps. Similar to the project template, a user should be able to configure additional items for the checklist. A user should be able to create a “checklist template” that can be used and applied in multiple projects. A specific project’s template should also be modifiable once the checklist has been created.

The specific tasks of the project include:

Developing a configuration scheme for New Project templates
Provide a way for a user to import/export a template for New Projects
Develop a configuration scheme for Reproducibility Checklist questions
Provide a way for a user to import/export a template for the Reproducibility Checklist
Develop a configuration scheme for asset (file) attributes
Develop unit tests and conduct system testing

Final Report for Smart Environments

Wed, 05 Nov 2025 00:00:00 +0000

Introduction

The process of creating the necessary software environment for code to run is a significant challenge in software development. Given a piece of open-source software intended for research, setting up the environmental dependencies to run the software could take significant manual effort. Existing automation methods struggle due to the complexity of managing diverse languages, dependencies, and hardware. In Smart Environments, I have created ENVAGENT, a general multi-agent framework designed to automate the construction of executable environments for reproducing research prototypes from top-tier conferences and journals. While reproducibility has become a growing concern in the research community, the process of setting up environments remains time-consuming, error-prone, and often poorly documented.

To assess this capability, a new benchmark, ENVBENCH, was created, containing 54 popular projects across seven languages. Results show ENVAGENT dramatically improves environment construction compared to current agents (+16.2%). Furthermore, the system shows initial promise in dynamically adjusting cloud-based hardware resources based on the code’s needs.

Method

EnvAgent

The EnvAgent I created during my time at OSRE utilizes a multi-agent workflow to automatically build software execution environments. The process is structured into three phases: preparation, construction, and refinement.

Phase 1 (Preparation): Specialized agents collect information about the software repository – its structure, relevant files, and the host system’s hardware specifications (CPU, memory, etc.). This data is then used by a planning agent to generate a detailed, step-by-step instruction set for creating a functional Dockerfile.

Phase 2 (Construction): Two agents work in tandem: one generates or modifies the Dockerfile based on the plan, while the other executes the Dockerfile within an isolated container, capturing any errors.

Phase 3 (Refinement): A final agent analyzes the container execution data, identifying areas for improvement in the Dockerfile. This process repeats until a stable, executable environment is achieved.

To improve efficiency, EnvAgent incorporates rule-based tools for predictable tasks like directory setup and log management, reducing the need for complex agent reasoning. This combination of intelligent agents and automated routines (“scaffolding”) ensures a robust and adaptive system.

EnvEval Benchmark

In addition to the agent, one significant contribution is the manual curation of a benchmark that measures the quality of generated environments. EnvEval is a benchmark specifically designed to assess environment setup qualities across 54 carefully curated open-source repositories. They are chosen from both Chameleon reproducible artifacts and Multi-SWE-bench dataset. EnvEval contains json rubrics that can be used to automatically determine the quality of constructed environments.

Each rubric is divided into three parts, corresponding to three major objectives that a successfully constructed environment should have:

Structure: Checks for basic directory structure, file presence, and environment variables.
Configuration: Asks the question “Is this configured?”, checks for whether dependencies have been correctly configured.
Functionality: Asks the question “Is this usable?”, runs actual tests to see if the functionalities are present.

There are many tests in each category, and their weights are adjusted based on their importance.

Evaluation

Baseline Systems:

The study compared EnvAgent to two established automated code generation systems: one utilizing Anthropic’s advanced reasoning models and the other employing OpenAI’s code-focused models. These systems were chosen for their strong performance in creating software code and their prevalence in automated engineering processes. Both baselines were given full access to the target software repositories and complete details about the host system’s hardware.

Evaluation Metrics:

The performance of EnvAgent was assessed using three key metrics. These included the ability to create working environments, the quality of those environments, and a single combined score. Results showed EnvAgent significantly outperformed the baselines, achieving a 33.91% improvement in the final overall score – reaching 74.01, which was higher than the best baseline score of 30.10. This suggests EnvAgent produced both more functional environments and ensured greater accuracy through extensive testing.

Conclusion

The process of creating the necessary software environments for code agents is a major hurdle in scaling up research and development. Currently, this task relies heavily on manual labor. To address this, a new system, ENVAGENT, was created to automatically build these environments using intelligent agents and by understanding dependencies. A new benchmark, ENVBENCH, was also developed to assess this system’s effectiveness. Preliminary results demonstrate a significant improvement – ENVAGENT achieved a 33.91% increase in success rates compared to existing automated agents, representing a substantial step towards more efficient and reproducible research.

Thank you!

Autofill

; 20251105-Sam_Huang

Final Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows

Thu, 18 Sep 2025 00:00:00 +0000

Final Report: Applying MLOps to Overcome Reproducibility Barriers in ML

Background

Hello! I’m Ahmed Alghali, and this is my final report the project Applying MLOps to Overcome Reproducibility Barriers in ML under the mentorship of Professor Fraida Fund and Mohamed Saeed.

This project aims to address the reproducibility problem in machine learning—both in core ML research and in applications to other areas of science.

The focus is on making large-scale ML experiments reproducible on Chameleon Cloud. To do this; we developed ReproGen, a template generator that produces ready-to-use, reproducible ML training workflows. The goal: is to make the cloud easy for researchers setting up experiments without the worry about the complexity involved in stitching everything together.

Progress Since Mid-Report

Migration from Cookiecutter to Copier

we initially used Cookiecutter for template generation as a templating engine, but it lacked features we were interested in (e.g., conditional questions). we switched to Copier, which provides more flexibility and better matches our use case.

Support for Multiple Setup Modes

We now offer two setup modes, designed to serve both beginners and users who want advanced options/customization:

Basic Mode – minimal prompts (project name, repository link, framework).
Advanced Mode – detailed control (compute site, GPU type, CUDA version, storage site, etc.).

this ensures accessibility for new users, while still enabling fine-grained control for users.

Automated Credential Generation

previously, users had to manually generate application credentials (via Horizon OpenStack UI). now, we provide scripts that can generate two types of credentials programmatically—Swift and EC2—using Chameleon JupyterHub credentials with python-chi and the openstack-sdk client.

Automatic README.md Generation

each generated project includes a customized README.md, containing setup guidance and commands tailored to the user’s configuration.

Bug Fixes and UX Enhancements

Alongside major features, we implemented numerous smaller changes and fixes to improve the reliability and user experience of the tool.

Deliverables

ReproGen GitHub Repository: source code for the template generator.
mlflow-replay branch: explore a past experiment, artifacts, and logged insights.
LLM-Demo branch: hands-on demo to track fine-tuning of an LLM using infrastructure generated by ReproGen.

Next Steps

Compatibility Matrix
- the tool and the generated setup both depend on software dependencies that required paying attention to compatibility. in all level Hardware, OS, Drivers, Computing Platforms, core and 3rd-party libraries. writing a documentation as a start to help future debugging and adding pieces without breaking what is there. .
Maintain Docker Images

so far we have a cpu and GPU docker images for multiple most frequently used framework.
- CPU based image: for data science workload (Scikit-Learn)
- GPU-Nvidia Variant: for Deep Learning workload on Nvidia Machines (Pytorch, Lightning, TensorFlow)
- GPU-AMD Variant: for Deep Learning workload on AMD Machines (Pytorch, Lightning, TensorFlow) adding more variants for more frameworks + Enhancing the experience of the existing images is recommended.

Reflection

When I first joined SoR 2025, I had a problem crystallizing the idea of how I can practically achieve reproducibility and package a tool that would maximizes the chance of reproducing experiment build using it. throughout the journey my mentors took me under their wings and helped me to understand the reproducibility challenges in ML, my Mentor Professor Fraida Fund wrote materials that saved me a lot of time to familiarize my self with the testbed,important Linux tools and commands, and even getting to have hand on practice how large model training happen with MLflow tracking server system is done in the cloud. and Mohamed Saeed. who took the time reviewing my presentation pushing me to do my best. I’m forever thankful in the way they shaped the project and my personal growth. this hands-on experience help me viewing MLOps , cloud APIs, and workflow design in different lenses, and I’m proud to have contributed a tool that can simplify help reproducible research for others.

Final Report: A Systematic Investigation into the Reproducibility of RAG Systems

Fri, 05 Sep 2025 00:00:00 +0000

I’m Baiqiang, and this is the final report for the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project, mentored by Luanzheng “Lenny” Guo and Dongfang Zhao. This project successfully developed a novel framework to quantitatively measure reproducibility in AI systems, yielding several surprising and impactful results.

The Challenge: The Need for Systematic Measurement

Retrieval-Augmented Generation (RAG) is a cornerstone of AI for science, but its reliability is often compromised by non-determinism. While this issue was a known concern, a fundamental challenge was the lack of standardized tools and methodologies to systematically measure and quantify the sources of this inconsistency. Without a rigorous way to analyze the problem, it was difficult to move beyond ad-hoc tests and establish the true root causes, hindering the development of truly trustworthy AI systems for science.

Our Contribution: The ReproRAG Framework

To address this gap, the central contribution of this project is ReproRAG, a comprehensive, open-source benchmarking framework. ReproRAG is designed to systematically investigate sources of uncertainty across the entire RAG pipeline by:

Isolating Variables: It allows for controlled experiments on embedding models, numerical precision, retrieval algorithms, hardware configurations (CPU/GPU), and distributed execution environments.
Quantifying Uncertainty: It employs a suite of metrics—including Exact Match Rate, Jaccard Similarity, and Kendall’s Tau—to precisely measure the impact of each variable on the final retrieved results.

Key Findings: A New Hierarchy of Uncertainty

Our large-scale empirical study using ReproRAG challenged common assumptions and established a clear hierarchy of what actually impacts reproducibility.

Core Algorithms Are Not the Problem: Our most surprising finding is that modern retrieval libraries like FAISS are perfectly reproducible out-of-the-box. Across all tested index types (including approximate ones like HNSW and IVF) and execution environments (single-node CPU/GPU and multi-node distributed systems), we achieved perfect run-to-run reproducibility (1.000 scores on all metrics) when environmental factors like random seeds were controlled. This falsifies the common hypothesis that approximate nearest neighbor algorithms are a primary source of randomness.
Embedding Model Choice is a Dominant Source of Variation: We found that the choice of the embedding model is a dominant factor driving result variation. When comparing outputs from different state-of-the-art models (BGE, E5, Qwen) for the same query, the agreement was very low (e.g., Overlap Coefficient of ~0.43-0.54). This means a scientific conclusion drawn with one model may not be reproducible with another, as they are fundamentally “seeing” different evidence.
Environmental Factors Introduce Measurable “Drift”:
- Numerical Precision: Changing floating-point precision (e.g., FP32 vs. FP16) was a guaranteed source of variation, but it caused a small and quantifiable “embedding drift” rather than chaotic changes.
- Data Insertion: Incrementally adding new data to an index caused a predictable “displacement” of old results, not a re-shuffling. The relative ranking of the remaining original documents was perfectly stable (Kendall’s Tau of 1.000).
Common Determinism Flags Can Be Ineffective: Our tests showed that popular software-level controls, like cudnn.deterministic flags in PyTorch, had no observable effect on the output of modern transformer-based embedding models. This underscores the necessity of empirical validation over assuming that framework settings work as advertised.

Conclusion

This project successfully shifted the focus of the RAG reproducibility problem. The key challenge is not to fix supposedly “random” algorithms, but to rigorously control the entire experimental environment. We delivered ReproRAG, a framework that empowers researchers to do just that. Our findings provide actionable insights for the community: efforts to improve reproducibility should focus less on the retrieval algorithms themselves and more on disciplined management of embedding models, data versioning, and numerical precision.

Final Report: MPI Appliance for HPC Research on Chameleon

Mon, 01 Sep 2025 00:00:00 +0000

Hi Everyone, This is my final report for the project I completed during my summer as a Summer of Reproducibility (SOR) student. The project, titled “MPI Appliance for HPC Research in Chameleon,” was undertaken in collaboration with Argonne National Laboratory and the Chameleon Cloud community. The project was mentored by Ken Raffenetti and was completed over the summer. This blog details the work and outcomes of the project.

Background

Message Passing Interface (MPI) is the backbone of high-performance computing (HPC), enabling efficient scaling across thousands of processing cores. However, reproducing MPI-based experiments remains challenging due to dependencies on specific library versions, network configurations, and multi-node setups.

To address this, we introduce a reproducibility initiative that provides standardized MPI environments on the Chameleon testbed. This is set up as a master–worker MPI cluster. The master node manages tasks and communication, while the worker nodes do the computations. All nodes have the same MPI libraries, software, and network settings, making experiments easier to scale and reproduce.

Objectives

The aim of this project is to create an MPI cluster that is reproducible, easily deployable, and efficiently configurable.

The key objectives of this project were:

Pre-built MPI Images: Create ready-to-use images with MPI and all dependencies installed.
Automated Cluster Configuration: Develop Ansible playbooks to configure master–worker communication, including host setup, SSH key distribution, and MPI configuration across nodes.
Cluster Orchestration: Develop orchestration template to provision resources and invoke Ansible playbooks for automated cluster setup.

Implementation Strategy and Deliverables

Openstack Image Creation

The first step was to create a standardized pre-built image, which serves as the base image for all nodes in the cluster.

Some important features of the image include:

Built on Ubuntu 22.04 for a stable base environment.
Spack + Lmod integration:
- Spack handles reproducible, version-controlled installations of software packages.
- Lmod (Lua Modules) provides a user-friendly way to load/unload software environments dynamically.
- Together, they allow users to easily switch between MPI versions, libraries, and GPU toolkits
MPICH and OpenMPI pre-installed for standard MPI support and can be loaded/unloaded.
Three image variants for various HPC workloads: CPU-only, NVIDIA GPU (CUDA 12.8), and AMD GPU (ROCm 6.4.2).

These images have been published and are available in the Chameleon Cloud Appliance Catalog:

MPI and Spack for HPC (Ubuntu 22.04) - CPU Only
MPI and Spack for HPC (Ubuntu 22.04 - CUDA) - NVIDIA GPU (CUDA 12.8)
MPI and Spack for HPC (Ubuntu 22.04 - ROCm) - AMD GPU (ROCm 6.4.2)

Cluster Configuration using Ansible

The next step is to create scripts/playbooks to configure these nodes and set up an HPC cluster. We assigned specific roles to different nodes in the cluster and combined them into a single playbook to configure the entire cluster automatically.

Some key steps the playbook performs:

Configure /etc/hosts entries for all nodes.
Mount Manila NFS shares on each node.
Generate an SSH key pair on the master node and add the master’s public key to the workers’ authorized_keys.
Scan worker node keys and update known_hosts on the master.
(Optional) Manage software:
- Install new compilers with Spack
- Add new Spack packages
- Update environment modules to recognize them
Create a hostfile at /etc/mpi/hostfile.

The code is publicly available and can be found on the GitHub repository: https://github.com/rohanbabbar04/MPI-Spack-Experiment-Artifact

Orchestration

With the image now created and deployed, and the Ansible scripts ready for cluster configuration, we put everything together to orchestrate the cluster deployment.

This can be done in two primary ways:

Python CHI(Jupyter) + Ansible

Python-CHI is a python library designed to facilitate interaction with the Chameleon testbed. Often used within environments like Jupyter notebooks.

This setup can be put up as:

Create leases, launch instances, and set up shared storage using python-chi commands.
Automatically generate inventory.ini for Ansible based on launched instances.
Run Ansible playbook programmatically using ansible_runner.
Outcome: fully configured, ready-to-use HPC cluster; SSH into master to run examples.

If you would like to see a working example, you can view it in the Trovi example

Heat Orchestration Template

Heat Orchestration Template(HOT) is a YAML based configuration file. Its purpose is to define/create a stack to automate the deployment and configuration of OpenStack cloud resources.

Challenges

We faced some challenges while working with Heat templates and stacks in particular in Chameleon Cloud

OS::Nova::Keypair(new version): In the latest OpenStack version, the stack fails to launch if the public_key parameter is not provided for the keypair, as auto-generation is no longer supported.
OS::Heat::SoftwareConfig: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.

To tackle these challenges, we designed an approach that is both easy to implement and reproducible. First, we launch instances by provisioning master and worker nodes using the HOT template in OpenStack. Next, we set up a bootstrap node, install Git and Ansible, and run an Ansible playbook from the bootstrap node to configure the master and worker nodes, including SSH, host communication, and MPI setup. The outcome is a fully configured, ready-to-use HPC cluster, where users can simply SSH into the master node to run examples.

Users can view/use the template published in the Appliance Catalog: MPI+Spack Bare Metal Cluster. For example, a demonstration of how to pass parameters is available on Trovi.

Conclusion

In conclusion, this work demonstrates a reproducible approach to building and configuring MPI clusters on the Chameleon testbed. By using standardized images, Ansible automation, and Orchestration Templates, we ensure that every node is consistently set up, reducing manual effort and errors. The artifact, published on Trovi, makes the entire process transparent, reusable, and easy to implement, enabling users/researchers to reliably recreate and extend the cluster environment for their own experiments.

Future Work

Maintaining these images and possibly creating a script to reproduce MPI and Spack on a different image base environment.

Final Update(Mid-Term -> Final): MPI Appliance for HPC Research on Chameleon

Sun, 31 Aug 2025 00:00:00 +0000

Hi everyone! This is my final update, covering the progress made every two weeks from the midterm to the end of the project MPI Appliance for HPC Research on Chameleon, developed in collaboration with Argonne National Laboratory and the Chameleon Cloud community. This blog follows up on my earlier post, which you can find here.

🔧 July 29 – August 11, 2025

With the CUDA- and MPI-Spack–based appliances published, we considered releasing another image variant (ROCm-based) for AMD GPUs. This will be primarily used in CHI@TACC, which provides AMD GPUs. We have successfully published a new image on Chameleon titled MPI and Spack for HPC (Ubuntu 22.04 - ROCm), and we also added an example to demonstrate its usage.

🔧 August 12 – August 25, 2025

With the examples now available on Trovi for creating an MPI cluster using Ansible and Python-CHI, my next step was to experiment with stack orchestration using Heat Orchestration Templates (HOT) on OpenStack Chameleon Cloud. This turned out to be more challenging due to a few restrictions:

OS::Nova::Keypair (new version): In the latest OpenStack version, the stack fails to launch if the public_key parameter is not provided for the keypair, as auto-generation is no longer supported.
OS::Heat::SoftwareConfig: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.

To address these issues, we adopted a new strategy for configuring and creating the MPI cluster: using a temporary bootstrap node.

In simple terms, the workflow of the Heat template is:

Provision master and worker nodes via the HOT template on OpenStack.
Launch a bootstrap node, install Git and Ansible on it, and then run an Ansible playbook from the bootstrap node to configure the master and worker nodes. This includes setting up SSH, host communication, and the MPI environment.

This provides an alternative method for creating an MPI cluster.

We presented this work on August 26, 2025, to the Chameleon Team and the Argonne MPICH Team. The project was very well received.

Stay tuned for my final report on this work, which I’ll be sharing in my next blog post.

[Final Blog] Distrobench: Distributed Protocol Benchmark

Sat, 30 Aug 2025 00:00:00 +0000

Introduction

This is the final blog for our contribution to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges project under the mentorship of Fadhil Kurnia for the OSRE program.

Distrobench is a framework to evaluate the performance of replication/coordination protocols for distributed systems. This framework standardizes benchmarking by allowing different protocols to be tested under an identical workload, and supports both local and remote deployment of the protocols. The frameworks tested are restricted under a key-value store application and are categorized under different consistency models, programming languages, and persistency (whether the framework stores its data in-memory or on-disk).

All the benchmark results are stored in a data.json file which can be viewed through a webpage we have provided. A user can clone the git repository, benchmark different protocols on their own machine or in a cluster of remote machines, then view the results locally. We also provided a webpage that shows our own benchmark results which ran on 3 Amazon EC2 t2.micro instances.

How to run a benchmark on Distrobench

Before running a benchmark using Distrobench, the protocol that will be benchmarked must first be built. This is to allow the script to initialize the protocol instance for local benchmark or to send the binaries into the remote machine. The remote machine running the protocol does not need to store the code for the protocol implementations, but does require dependencies for running that specific protocol such as Java, Docker, rsync, etc. The following are commands used to build the ailidani/paxi project which does not need any additional dependency to be run inside of a remote machine:

# Clone the Distrobench repository 
git clone git@github.com:fadhilkurnia/distro.git

# Clone the Paxi repository and build the binary 
cd distro/sut/ailidani.paxi
git clone git@github.com:ailidani/paxi.git
cd paxi/bin/
./build.sh

# Go back to the Distrobench root directory & run python script 
cd ../../../..
python main.py

By default, the script will start 3 local instances of a Paxi protocol implementation that the user chose through the CLI. The user can modify the number of running instances and whether or not it is deployed locally or in a remote machine by changing the contents of the .env file inside the root directory. The following is the contents of the default .env file:

NUM_OF_NODES=3

SSH_KEY=ssh-key.pem
REMOTE_USERNAME=ubuntu

PUBLIC_IP1=127.0.0.1
PUBLIC_IP2=127.0.0.1
PUBLIC_IP3=127.0.0.1

PRIVATE_IP1=127.0.0.1
PRIVATE_IP2=127.0.0.1
PRIVATE_IP3=127.0.0.1

CLIENT_IP=127.0.0.1

OUTPUT=data.json

When running a remote benchmark, a ssh-key should also be added in the root directory to allow the use of ssh and rsync from within the python script. All machines must also allow TCP connection through port 2000-2300 and port 3000-3300 because that would be the port range for communication between the running instances as well as for the YCSB benchmark. Running the benchmark requires the use of at least 3 nodes because it is the minimum number of nodes to support most protocols (5 nodes recommended).

To view the benchmark result in the web page locally, move data.json into the docs/ directory and run python -m http.server 8000. The page is then accessible through http://localhost:8000.

Deep dive on how Distrobench works

The following is the project structure of the Distrobench repository:

distro/
├── main.py // Main python script for running benchmark
├── data.json // Output file for main.py
├── README.md
├── .env // Config for running the benchmark
├── docs/
│ ├── index.html // Web page to show benchmark results
│ ├── data.json // Output file displayed by web page
│ ├── README.md
├── src/
│ ├── utils/
│ └── ycsb/ // Submodule for YCSB
└── sut/ // Systems under test
 ├── ailidani.paxi/
 └── run.py // Protocol-specific benchmark script called by main.py
 ├── apache.zookeeper/
 ├── etcd-io.etcd/
 ├── fadhilkurnia.xdn/
 ├── holipaxos-artifect.holipaxos/
 ├── otoolep.hraftd/
 └── tikv.tikv/

main.py will automatically detect directories inside sut/ and will call the main function inside run.py. The following is the structure of run.py written in pseudocode style:

FUNCTION main(run_ycsb: Function, nodes: List of Nodes, ssh: Dictionary)
 node_data = map_ip_port(nodes)

 SWITCH user\_input
 CASE 0:
 start()
 RETURN
 CASE 1:
 stop()
 RETURN
 CASE 2:
 client_data = []
 FOR EACH item IN node_data
 ADD item.client_addr TO client_data
 END FOR
 run_ycsb(client_data)
 RETURN
 END SWITCH
END FUNCTION

FUNCTION start()
 // Start the protocol instance (local or remote)
END FUNCTION

FUNCTION stop()
 // Stop the protocol instance (local or remote)
END FUNCTION

FUNCTION map_ip_port(nodes: List of Nodes) -> List of Dictionary
 // Generate port numbers based on the protocol requirements
END FUNCTION

The .env file provides both public and private IP addresses to add versatility when running a remote benchmark. Private IP is used for communication between remote machines if they are under the same network group. In the case of our own benchmark, four t2.micro EC2 instances are deployed under the same network group. Three of them are used to run the protocol and the fourth machine acts as the YCSB client. It is possible to use your local machine as the YCSB client instead of through another remote machine by specifying CLIENT_IP in the .env file as 127.0.0.1. The decision to use the remote machine as the YCSB client is made to reduce the impact of network latency between the client and the protocol servers to a minimum.

The main tasks of the start() function can be broken down into the following:

Generate custom configuration files for each remote machine instance (May differ between implementations. Some implementations does not require a config file because they support flag parameters out of the box, others require multiple configuration files for each instance)
rsync binaries into the remote machine (If running a remote benchmark)
Start the instances

The stop() function is a lot simpler since it only kills the process running the protocol and optionally removes the copied binary files in the remote machine. The run_ycsb() function passed onto run.py is defined in main.py and currently supports two types of workload:

Read-heavy: A single-client workload with 95% read and 5% update (write) operations
Update-heavy: A single-client workload with 50% read and 50% update (write) operations

A new workload can be added inside the src/ycsb/workloads directory. Both workloads above only run 1000 operations for the benchmark which may not be enough operations to properly evaluate the performance of the protocols. It should also be noted that while YCSB does support a scan operation, it is never used for our benchmark because none of our tested protocols implement this operation.

How to implement a new protocol in Distrobench

Adding a new protocol to distrobench requires implementing two main components: a Python integration script (run.py) and a YCSB database binding for benchmarking.

Create the protocol directory structure
- Create a new directory under sut/ using format yourrepo.yourprotocol/.
Write run.py integration
- Put script inside yourrepo.yourprotocol/ directory
- Must have the main(run_ycsb, nodes, ssh) function.
- Add start/stop/benchmark menu options
- Handle local (127.0.0.1) and remote deployment
Create YCSB client
- Make Java class extending YCSB’s DB class
- Put inside src/ycsb/yourprotocol/src/main/java/site/ycsb/yourprotocol
- Implement read(), insert(), update(), delete() methods
Register your client
- Register your client to src/pom.xml, src/ycsb/bin/binding.properties, and src/ycsb/bin/ycsb.
Build and test
- Run cd src/ycsb && mvn clean package
- Run python main.py
- Select your protocol and test it

Protocols which have been tested

Distrobench has tested 20 different distributed consensus protocols across 7 different implementation projects.

ailidani/paxi
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability, Eventual
- Protocol : Paxos, EPaxos, SDpaxos, WPaxos, ABD, chain, VPaxos, WanKeeper, KPaxos, Paxos_groups, Dynamo, Blockchain, M2Paxos, HPaxos.
apache/zookeeper
- Programming Language : Java
- Persistency : On-Disk
- Consistency Model : Linearizability + Primary Integrity
- Protocol : Zookeeper implements ZAB (Zookeper Atomic Broadcast)
etcd-io/etcd
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft
fadhilkurnia/xdn
- Programming Language : Java, Rust
- Persistency : On-Disk
- Consistency Model : Linearizability, Linearizability + Primary Integrity
- Protocol : Gigapaxos
Zhiying12/holipaxos-artifect
- Programming Language : Go, Rust
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Holipaxos, Omnipaxos, Multipaxos
otoolep/hraftd
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft
tikv/tikv
- Programming Language : Rust
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft

Challenges

When attempting to benchmark HoliPaxos, the main challenge was handling versions that rely on persistent storage with RocksDB. Since some implementations are written in Go, it was necessary to find compatible versions of RocksDB and gRocksDB (for example, RocksDB 10.5.1 works with gRocksDB 1.10.2). Another difficulty was that RocksDB is resource-intensive to compile, and in our project we did not have sufficient CPU capacity on the remote machine to build RocksDB and run remote benchmarks.
Some projects did not compile successfully at first and required minor modifications to run.

Conclusion and future improvements

The current benchmark result shows the performance of all the mentioned protocols by throughput and benchmark runtime. The results are subject to revisions because it may not reflect the best performance for the protocols due to unoptimized deployment script. We are also planning to switch to a more powerful EC2 machine because t2.micro does not have enough resources to support the use of RocksDB as well as TiKV.

In the near future, additional features will be added to Distrobench such as:

Multi-Client Support: The YCSB client will start multiple clients which will send requests in parallel to different servers in the group.
Commit Versioning: Allows the labelling of all benchmark results with the commit hash of the protocol’s repository version. This allows comparing different versions of the same project.
Adding more Primary-Backup, Sequential, Causal, and Eventual consistency protocols: Implementations with support for a consistency model other than linearizability and one that provides an existing key-value store application are notoriously difficult to find.
Benchmark on node failure
Benchmark on the addition of a new node

Final Blog:Improving Usability and Performance in cc-snapshot

Sun, 24 Aug 2025 00:00:00 +0000

My name is Zahra Temori, and I’m thrilled to collaborate with mentor Paul Marshall during this summer on the cc-snapshot project.

Introduction

Reproducibility is an important concept in high performance computing and research. It ensures that experiments can be repeated, validated, and extended with confidence. Achieving a reproducible environment requires identical software stacks, with the exact same dependencies, and configuration. The Chameleon Cloud testbed provides the cc-snapshot tool to support reproducibility by capturing the complete state of a running system. This allows researchers to rerun experiments exactly as before, share setups among each other, and avoid potential environmental issues such as missing dependencies or version mismatches. In this work, we explore how to enhance snapshotting as a reproducible method and make it an effective strategy for HPC research.

Key Achievements

The project was divided into two phases.The first phase focused on usability, reorganizing the tool, and expanding its capabilities. The second phase was benchmarking to evaluate alternative image formats and compression methods to improve snapshotting performance.

Usability Enhancements: The original snapshotting tool had challenges including a limited command line, tightly coupled logic, and minimal testing support, which made it difficult for users to interact with and developers to maintain. To enhance the command line interface, we added a flag to disable automatic updates, giving users more control over when to pull the latest version. We also added a dry-run flag to simulate actions before running a snapshot, allowing developers to test and run safely. Moreover, we implemented support for a custom source path, enabling snapshots of specific directories. This helps developers test smaller directories rather than full snapshots, which can be more complicated when testing functionalities. To improve maintainability, we refactored the codebase into five modular functions, allowing developers to make future changes more easily. In addition, we added automated tests with GitHub Actions to validate new and existing features and ensure that changes work as expected.
Performance Optimization: The default format and compression on snapshotting was Qcow2 with zlib, which often resulted in long snapshot creation time. To address this performance issue, we benchmarked other alternatives such as QCOW2 with zstd compression, and RAW with no compression. We also chose three images of varying sizes: small 4.47 GiB, medium 7.62 GiB, and large 12.7 GiB. The medium size image was user created to demonstrate the snapshotting and compression works for both Chameleon-supported images and user-created images.

Results: We ran each image with different compression methods and recorded four key metrics: creation time, upload time, boot time, and final image size. We calculated the overall time of each compression method from experiments on three different image sizes to evaluate which performed better. The results revealed that zstd compression reduced the creation time around 80.6% across the three image sizes. The upload time for zstd was nearly equal to the zlib method, while RAW images, due to no compression and larger size, uploaded much slower compared to images compressed with zlib and zstd. The boot time was nearly the same across all images, confirming that zlib and zstd take about the same time to uncompress, while RAW images take longer to boot due to large size. Our work suggested that QCOW2 with zstd compression should be used instead of QCOW2 with zlib compression when creating a snapshot. This enables researchers to generate and share reproducible environments faster.

Conclusion and Future Work

Snapshotting is a practical way to support reproducibility in HPC, but to be effective, it should be easy to use and fast enough for real research workflows. Our results show that using zstd compression can drop the snapshot creation time by over 80% compared to the common default zlib compression, without affecting upload or boot performance. Looking ahead, we plan to integrate zstd , try it on more workloads and image types, and explore ways to improve snapshotting for even greater speedups and reliable results.

Deliverables

Repository: All comprehensive analysis code and source code can be found in the CC-SNAPSHOT GitHub Repository.

End-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing

Sat, 23 Aug 2025 00:00:00 +0000

Introduction

Hello everyone!
I am Debangi Ghosh from India, an undergraduate student at the Indian Institute of Technology (IIT) BHU, Varanasi. As part of the StatWrap: Cross-Project Searching and Classification using Local Indexing project, my proposal, under the mentorship of Luke Rasmussen, focuses on developing a full-text search service within the StatWrap user interface. This involves evaluating different search libraries and implementing a classification system to distinguish between active and past projects.

About the Project

As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for researchers to discover relevant projects, notes, and assets across both current and archived work, using information that is either user-entered or passively collected by StatWrap.

Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:

Evaluating open-source search libraries suitable for local indexing and retrieval
Building the full-text search functionality directly into the StatWrap UI to allow seamless querying across projects
Ensuring reliability through the development of unit tests and comprehensive system testing
Implementing a classification system to label projects as “Active,” “Pinned,” or “Past” within the user interface

This project offers a great opportunity to work at the intersection of software development, information retrieval, and user-centric design—while contributing to research reproducibility and collaboration within scientific workflows.

Deliverables

The project has reached the end of its scope after 12 weeks of work. Here’s a breakdown:

1. Descriptive Comparison of Open-Source Libraries

Compared various open-source search libraries based on evaluation criteria such as indexing speed, search speed, memory usage, typo tolerance, fuzzy searching, partial matching, full-text queries, contextual search, Boolean support, exact word match, installation ease, maintenance, documentation, and developer experience. Decided upon the weights to assign to each of the features and point out the best library to use. According to our weights assigned,

These results are after tuning the hyperparameters to give the best set of results For huge data, FlexSearch has the least memory usage, followed by MiniSearch. The examples we used were limited, so Minisearch had the better memory usage results. Along with the research and evaluation, I looked upon the Performance Benchmark of Full-Text-Search Libraries (Stress Test), available here

The benchmark was measured in terms per seconds, higher values are better (except the test “Memory”). The memory value refers to the amount of memory which was additionally allocated during search.

FlexSearch performs queries up to 1,000,000 times faster compared to other libraries by also providing powerful search capabilities like multi-field search (document search), phonetic transformations, partial matching, tag-search, result highlighting or suggestions. Bigger workloads are scalable through workers to perform any updates or queries to the index in parallel through dedicated balanced threads.

2. The Search User Interface

3. Complete Search Execution Pipeline

4. FlexSearch Features

1. Persistent Indexing with Automatic Loading

Index persistence: Search index automatically saves to disk and loads on startup
Fast restoration: Rebuilds FlexSearch indices from saved document store without re-scanning files
Incremental updates: Detects project changes and updates only modified content
Background processing: Index updates happen asynchronously without blocking the User Interface.

2. Multi-Document Type Support

Unified search: Single search interface for projects, files, people, notes, and assets
Type-specific indices: Separate FlexSearch indices optimized for each document type
Cross-reference capabilities: Documents can reference and link to each other
Flexible schema: Each document type has tailored fields for optimal search performance

3. Intelligent File Content Indexing

Configurable file size limits: Admin-controlled maximum file size for content indexing
Smart file detection: Automatically identifies text files by extension and filename patterns
Content extraction: Full-text indexing with snippet generation for search results
Performance optimization: Skips binary files and respects size constraints to maintain speed

4. Advanced Query Processing

Multi-strategy search: Combines exact matches, fuzzy search, partial matches, and contextual search
Query preprocessing: Removes stop words and applies linguistic filters
Relevance scoring: Custom scoring algorithm considering multiple factors:
- Exact phrase matches (highest weight)
- Individual word matches
- Term frequency with logarithmic capping
- Position-based scoring (earlier matches rank higher)
- Proximity bonuses for terms appearing near each other
- Completeness penalties for missing query terms

5. Real-Time Search Suggestions

Autocomplete support: Dynamic suggestions based on indexed document titles
Search history: Maintains recent searches for quick re-execution
Debounced input: Prevents excessive API calls during typing
Contextual suggestions: Suggestions adapt based on current filters and context

6. Comprehensive Filtering System

Type filtering: Filter by document type (projects, files, people, etc.)
Project scoping: Limit searches to specific projects
File type filtering: Filter files by extension
Advanced search panel: Collapsible interface for power users
Filter persistence: Maintains filter state across searches

7. Performance Monitoring & Analytics

Real-time metrics: Track search times, cache hit rates, and index statistics
Performance dashboard: Visual indicators for system health
Cache management: LRU cache with configurable size and TTL
Search analytics: Historical data on search patterns and performance

8. Index Management Tools

Export/Import functionality: Backup and restore search indices
Full reindexing: Complete index rebuild with progress tracking
Index deletion: Clean slate functionality for troubleshooting
File size adjustment: Modify indexing constraints and rebuild affected content
Index statistics: Detailed breakdown of indexed content by type and project

9. Robust Error Handling & Resilience

Graceful degradation: System continues operating even with partial index corruption
File system error handling: Handles missing files, permission issues, and path changes
Memory management: Prevents memory leaks during large indexing operations
Recovery mechanisms: Automatic fallback to basic search if advanced features fail

10. User Experience Enhancements

Keyboard shortcuts: Ctrl+K to focus search, Escape to clear
Result highlighting: Visual emphasis on matching terms in results
Expandable results: Drill down into detailed information for each result
Loading states: Clear feedback during indexing and search operations
Responsive tabs: Organized results by type with badge counts

5. Classification of Active and Past Projects

A classification system is added within the User Interface similar to “Add to Favorites” option. A new project added by default moves to “Active” section, unless explicitely marked as “Past”. Similarly, when a project is unpinned from Favorites, it goes to “Active” Section.

Conclusion and future Scope

Building a comprehensive search system requires careful attention to performance, user experience, and maintainability. FlexSearch provided the foundation, but the real value came from thoughtful implementation of persistent indexing, advanced scoring, and robust error handling. The result is a search system that feels instant to users while handling complex queries across diverse document types.

The key to success was treating search not as a single feature, but as a complete subsystem with its own data management, performance monitoring, and user interface considerations. By investing in these supporting systems, the search functionality became a central, reliable part of the application that users can depend on.

The future scope would include:

Using a database (for example, SQLite), instead of JSON, which is better for this use case than JSON due to better and efficient query performance and atomic (CRUD) operations.
Integrating any suggestions from my mentors, as well as improvements we feel are necessary.
Developing unit tests for further functionalities and improvements.

[Final]Reproducibility of Interactive Notebooks in Distributed Environments

Wed, 20 Aug 2025 00:00:00 +0000

I am sharing a overview of my project Reproducibility of Interactive Notebooks in Distributed Environments and the work that I did this summer.

Project Overview

This project aims at improving the reproducibility of interactive notebooks which are executed in a distributed environment. Notebooks like in the Jupyter environment have become increasingly popular and are widely used in the scientific community due to their ease of use and portability. Reproducing these notebooks is a challenging task especially in a distributed cluster environment.

In the distributed environments we consider, the notebook code is divided into manager and worker code. The manager code is the main entry point of the program which divides the task at hand into one or more worker codes which run in a parallel, distributed fashion. We utlize several open source tools to package and containerize the application code which can be used to reproduce it across different machines and environments. They include Sciunit, FLINC, and TaskVine. These are the high-level goals of this project:

Generate execution logs for a notebook program.
Generate code and data dependencies for notebook programs in an automated manner.
Utilize the generated dependencies at various granularities to automate the deployment and execution of notebooks in a parallel and distributed environment.
Audit and package the notebook code running in a distributed environment.
Overall, support efficient reproducibility of programs in a notebook program.

Progress Highlights

Here are the details of the work that I did during this summer.

Generation of Execution Logs

We generate execution logs for the notebook programs in a distributed environment the Linux utility strace which records every system call made by the notebook. It includes all files accessed during its execution. We collect separate logs for both manager and the worker code since they are executed on different machines and the dependencies for both are different. By recording the entire notebook execution, we capture all libraries, packages, and data files referenced during notebook execution in the form of execution logs. These logs are then utilized for further analyses.

Extracting Software Dependencies

When a library such as a Python package like Numpy is used by the notebook program, an entry is made in the execution log which has the complete path of the accessed library file(s) along with additional information. We analyze the execution logs for both manager and workers to find and enlist all dependencies. So far, we are limited to Python packages, though this methodology is general and can be used to find dependencies for any programing language. For Python packages, their version numbers are also obtained by querying the package managers like pip or Conda on the local system.

Extracting Data Dependencies

We utilze similar execution logs to identify which data files were used by the notebook program. The list of logged files also contain various configuration or setting files used by certain packages and libraries. These files are removed from the list of data dependencies through post-processing done by analyzing file paths.

Testing the Pipeline

We have conducted our experiments on three use cases obtained from different domains using between 5 and 10 workers. They include distributed image convolution, climate trend analysis, and high energy physics experiment analysis. The results so far are promising with good accuracy and with a slight running time overhead.

Processing at Cell-level

We perform the same steps of log generation and data and software dependency extraction at the level of individual cells in a notebook instead of once for the whole notebook. As a result, we generate software and data dependencies at the level of individual notebook cells. This is achieved by interrupting control flow before and after execution of each cell to write special instructions to the execution log for marking boundaries of cell execution. We then analyze the intervals between these instructions to identify which files and Python packages are accessed by each specific cell. We use this information to generate the list of software dependencies used by that cell only.

We also capture data dependencies by overriding analyzing the execution logs generated by overriding the function of the open function call used to access various files.

Distributed Notebook Auditing

In order to execute and audit workloads in parallel, we use Sciunit Parallel which uses GNU Parallel for efficient parallel execution of tasks. The user specifies the number of tasks or machines to run the task on which is then distributed across them. Once the execution completes, their containerized executions need to be gathered at the host location.

Efficient Reproducibility with Checkpointing

An important challenge with Jupyter notebooks is that sometimes they are unnecessarily time-consuming and resource-intensive, especially when most cells remain unchanged. We worked on NBRewind which is a lightweight tool to accelerate notebook re-execution by avoiding redundant computation. It integrates checkpointing, application virtualization, and content-based deduplication. It enables two kinds of checkpoints: incremental and full-state. In incremental checkpoints, notebook states and dependencies across multiple cells are stored once such that only their deltas are stored again. In full-state checkpoints, the same is stored after each cell. During its restore process, it restores outputs for unchanged cells and thus enables efficient re-execution. Our empirical evaluation demonstrates that NBRewind can significantly reduce both notebook audit and repeat times with incremental checkpoints.

I am very happy abut the experience I have had in this project and I would encourage other students to join this program in the future.

Midterm Report: Simulation, Comparison, and Conclusion of Cache Eviction

Wed, 06 Aug 2025 00:00:00 +0000

Project Overview

CacheBench is a benchmarking suite designed for comprehensive cache performance evaluation, with a particular focus on analyzing the miss ratios of various cache eviction algorithms.

At the core of CacheBench lie two key components: the high-performance cache simulator, libCacheSim, and the extensive open-source cache datasets, which collectively contain over 8,000 traces from diverse applications. This ensures broad coverage across a range of realistic workloads.

Our primary goal is to evaluate all major and widely-used cache eviction algorithms on thousands of traces, in order to gain insights into their behaviors and design trade-offs. Additionally, we aim to identify and distill representative workloads, making benchmarking more efficient and comprehensive for future cache research.

Progress and Pain Points

We began by benchmarking prevalent eviction algorithms, including FIFO, LRU, CLOCK, LFU, Random, Belady (BeladySize), CAR, ARC, LIRS, LHD, Hyperbolic, GDSF, W-TinyLFU, 2Q, SLRU, S3-FIFO, SIEVE, and LeCaR. As we developed the suite, we made progressive improvements to both the simulator and dataset infrastructure. Our progress can be summarized as follows:

Collected miss ratio results for all listed algorithms across 8,000+ traces.
Identified best- and worst-performing traces for each algorithm, and conducted feature analysis of these traces.
Developed Python bindings: To increase accessibility, we provided a Python package that allows users to easily download traces and run simulation analyses using libCacheSim and the cache datasets.

However, analysis remains challenging because there is no universally accepted metric or baseline for objectively comparing cache eviction algorithms’ performance across all workloads.

Next Steps

For the second half of the project, my focus will shift to:

Evaluating More Complex Eviction Algorithms: Having concentrated mainly on static eviction policies so far (which are generally more deterministic and understandable), I will now investigate learning-based eviction algorithms such as LRB and 3L-Cache. These models incorporate learning components and incur additional computational overhead, making simulations slower and more complex.
Detailed Trace Analysis: Since eviction algorithms can have highly variable performance on the same trace, I plan to analyze why certain algorithms excel on specific traces while others do not. Understanding these factors is crucial to characterizing both the algorithms and the workload traces.
Constructing Representative Workload Sets: Based on ongoing simulations and trace analyses, I aim to identify a minimal but representative subset of traces that can serve as a basic evaluation suite, simplifying testing and improving accessibility.

Reflection

This project has truly been the highlight of my summer. By evaluating a wide range of cache eviction algorithms, I’ve significantly deepened my understanding of cache design and its underlying principles.

I’m especially grateful to my mentors for their constant support, patience, and guidance throughout this journey. It’s been a privilege to learn from you!

I’m excited to see the final results of CacheBench!

Mid-Term Update: MPI Appliance for HPC Research on Chameleon

Sun, 03 Aug 2025 00:00:00 +0000

Hi everyone! This is my mid-term blog update for the project MPI Appliance for HPC Research on Chameleon, developed in collaboration with Argonne National Laboratory and the Chameleon Cloud community. This blog follows up on my earlier post, which you can find here.

🔧 June 15 – June 29, 2025

Worked on creating and configuring images on Chameleon Cloud for the following three sites: CHI@UC, CHI@TACC, and KVM@TACC.

Key features of the images:

Spack: Pre-installed and configured for easy package management of HPC software.
Lua Modules (LMod): Installed and configured for environment module management.
MPI Support: Both MPICH and Open MPI are pre-installed, enabling users to run distributed applications out-of-the-box.

These images are now publicly available and can be seen directly on the Chameleon Appliance Catalog, titled MPI and Spack for HPC (Ubuntu 22.04).

I also worked on some example Jupyter notebooks on how to get started using these images.

🔧 June 30 – July 13, 2025

With the MPI Appliance now published on Chameleon Cloud, the next step was to automate the setup of an MPI-Spack cluster.

To achieve this, I developed a set of Ansible playbooks that:

Configure both master and worker nodes with site-specific settings
Set up seamless access to Chameleon NFS shares
Allow users to easily install Spack packages, compilers, and dependencies across all nodes

These playbooks aim to simplify the deployment of reproducible HPC environments and reduce the time required to get a working cluster up and running.

🔧 July 14 – July 28, 2025

This week began with me fixing some issues in python-chi, the official Python client for the Chameleon testbed. We also discussed adding support for CUDA-based packages, which would make it easier to work with NVIDIA GPUs. We successfully published a new image on Chameleon, titled MPI and Spack for HPC (Ubuntu 22.04 - CUDA), and added an example to demonstrate its usage.

We compiled the artifact containing the Jupyter notebooks and Ansible playbooks and published it on Chameleon Trovi. Feel free to check it out here. The documentation still needs some work.

📌 That’s it for now! I’m currently working on the documentation, a ROCm-based image for AMD GPUs, and some container-based examples. Stay tuned for more updates in the next blog.

Mid-Term Report: Uncovering the True Sources of Non-Reproducibility in AI for Science

Fri, 01 Aug 2025 00:00:00 +0000

Hello, I’m Baiqiang. I’m excited to share a mid-term update from the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project. This journey, mentored by Luanzheng “Lenny” Guo and Dongfang Zhao, has taken a fascinating and unexpected turn, leading to a much deeper understanding of what it takes to build truly reliable AI for science.

The Search for an Invisible Bug

As a quick recap, our project tackles the critical problem of non-determinism in Retrieval-Augmented Generation (RAG) systems. For science to be trustworthy, it must be repeatable. If an AI system gives different answers to the same question, it fails this fundamental test. Our initial goal, outlined in my proposal, was to find and fix the sources of this inconsistency, which we believed lay within the retrieval algorithms themselves.

To do this, we built a comprehensive testing framework capable of running thousands of controlled experiments. We designed it to meticulously measure the consistency of retrieval results while varying everything from the indexing algorithm to the underlying hardware.

A Surprising Discovery: The Usual Suspect is Innocent

The common wisdom in the community is that high-performance, approximate search libraries like FAISS are a major source of randomness. We put this to the test, running repeated queries against various index types, including complex ones like HNSW and IndexIVF.

Our results were clear and surprising: FAISS is remarkably reproducible out of the box. When run on a consistent hardware and software stack, it returns the exact same results, every single time. The library appears to have robust internal seed management that ensures deterministic behavior.

This finding was a pivotal moment. The non-reproducibility that researchers observe in practice is real, but it doesn’t come from where we expected. The problem isn’t the algorithm itself, but the environment it runs in. Our investigation immediately shifted to find the real culprits.

Pinpointing the True Sources of Non-Determinism

Our framework quickly helped us identify the true sources of inconsistency:

Hardware-Induced Variation (CPU vs. GPU): This is the most significant factor. Running the exact same retrieval code can produce different document rankings and even different document sets when executed on a CPU versus a GPU. This is likely due to subtle differences in floating-point arithmetic and library optimizations in the hardware stack.
The Impact of Numerical Precision: We also confirmed that changing the floating-point precision of the data (e.g., from FP32 to FP16) can introduce small numerical variations that are just large enough to reorder the results, potentially changing the evidence the LLM receives.

Our Mission Refined: Building Tools for Environmental Control

This discovery has sharpened our project’s mission. The challenge is not to “fix” a supposedly random algorithm, but to develop the tools and best practices to control for the entire experimental environment. Our focus for the second half of the project is to:

Develop a Hardware-Aware Configuration Tracker: We are building a tool that goes beyond logging software versions. It will capture the critical details of the hardware environment—CPU/GPU model, CUDA version, etc.—and link them directly to an experiment’s results.
Create a Cross-Environment Validation Suite: Our open-source benchmarking suite will empower researchers to test their own pipelines. Crucially, it will help them identify and diagnose inconsistencies when moving workflows between different machines, such as from a local laptop to a cloud-based GPU.
Establish New Best Practices: We will distill our findings into clear, actionable guidance. The key recommendation is no longer just about choosing the right algorithm, but ensuring a consistent and well-documented hardware and software environment to guarantee reproducible outcomes.

By following the evidence, we’ve uncovered the root cause of a critical problem in AI-driven research. We are now developing the solutions needed to manage it, paving the way for a future where scientific discoveries powered by AI are built on a foundation of verifiable trust.

Midterm Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows

Wed, 30 Jul 2025 00:00:00 +0000

Refresher about the Project

Hi everyone! for the last month I have been working with my mentors Professor Fraida Fund, and Mohamed Saeed on our Project Applying MLOps to overcome reproducibility barriers in machine learning research As a refresher, our goal is to build a template generator for a reproducible machine learning training workflows at the Chameleon testbed. We want to provide our users with the necessary environment configuration in a handy way. so they won’t be overwhelmed with all the intricate details of setting the environment. This will allow for validation and further development of their setup.

What we have done so far

The current workflow begins in JupyterHub, where the user provides basic details such as project name, site, and node type. the notebooks handle key setup tasks, like creating storage buckets, provisioning and configuring a server with GPU support, and mounting buckets locally via rclone. Once the host environment is ready, the user will SSH that machine, generates the necessary variables via a script and launches a containerized virtual lab that integrates Jupyter and MLflow. Inside the container, users authenticate with GitHub, connect or initialize their repositories, and can immediately begin training models, with all metrics, artifacts, and environment details logged for reproducibility.

The progress on the project so far is as follows:

We finalized the selection of frameworks and storage options.

Artifacts are now logged directly from the MLflow server to the Chameleon object store, without relying on a database backend or an intermediate MinIO S3 layer.

Different jupyter lab images for each framework.

We’ve started with the top ML frameworks — PyTorch Lightning, Keras/TensorFlow, and Scikit-Learn. Each framework now has its own image, which will later be tailored to the user’s selection.

Github CLI and Hugging Face integration inside the container.

The Jupyter container now integrates both the GitHub CLI and Hugging Face authentication. Users can manage their code repositories via GitHub CLI commands and authenticate with Hugging Face tokens to download/upload models and datasets. This eliminates the need for manual credential setup and streamlines ML experimentation within the environment.

Custom Logging Utility

To ensure robust tracking of code versioning and environment details, we added a custom logging utility.
These logs are stored alongside metrics and model artifacts in MLflow, ensuring every experiment is fully documented and reproducible. summary of the functionalities:

`log_git()` — Captures Code Versioning

Uses Git commands (via subprocess) to log:

Current branch name
Commit hash
Repository status (clean or dirty)

Example Output:

commit: a7c3e9d
branch: main
status: dirty (1 file modified)
# and git diff output

`log_python()`— Tracks the Python Environment

Platform information + Python environment info (version)
Exports a full pip freeze list to a .txt file
Saved as an MLflow artifact to guarantee exact package version reproducibility

Example Output (pip freeze extract):

numpy==1.26.4
pandas==2.2.1
scikit-learn==1.4.2
torch==2.2.0

`log_gpu()` - Records GPU Information

Detects available GPU devices
Collects details using NVIDIA’s pynvml or AMD’s ROCm tools
Logs:
GPU name
Driver version
CUDA/ROCm version
Captures gpu-type-smi output for deeper inspection

These utilities ensure that each run can be traced back with:

The exact code version
The full Python environment
The hardware details used

Initial customizable template

We’ve prototyped an initial customizable template using Cookiecutter. it provides an interactive CLI, users provide some key project details (e.g., project name, frameworks, GPU type and integrations if any). Cookiecutter then generates a ready-to-use project structure with pre-configured integrations, reducing manual setup and ensuring consistency across environments.

The user will have notebooks to communicate with chameleon testbed resources, containerized environment and custom training scripts to plug their code.

What’s Next

Template Generation via Config + interactive widgets
We are exploring different ways to generate experiment templates using configuration files and interactive widgets in jupyter notebooks. This would let users quickly customize logging setups and considered to be more user-friendly.
AMD-Compatible Images
Extend support by building and testing Docker images optimized for AMD GPUs. Up to now, our development efforts has focused on NVIDIA GPUs using CUDA-based images
End-to-End Lifecycle Example
Provide a larger example demonstrating the entire ML workflow:
- Data preparation
- Training with GPU logging
- Tracking metrics, artifacts, and environment info in MLflow
- Model evaluation and logging
- Reproducing results on different hardware backends

Working on this project so far has been both challenging and eye-opening. I’ve seen how many moving parts need to come together for a smooth workflow. The support from my mentors has been key in helping me turning challenges into real progress.

Thank you for following along — I’m looking forward to sharing more concrete results soon.

Type Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits

Tue, 29 Jul 2025 00:00:00 +0000

Hello! I’m Siva Sathyaseelan D N, a pre-final year B.Tech + M.Tech Engineering student at IIT BHU, Varanasi, India. With a deep-rooted passion for software development and scientific computing. I thrive at the intersection of code and real-world problem-solving. For two years, I’ve engaged in open-source work across scientific simulation, blockchain, and cloud-native technologies, through hobby projects, hackathons, internships, and an LFX mentee. I’m contributing toType Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits under the mentorship of Ben Greenman. My proposal can be viewed here!

Project Overview

Gradual typing enhances untyped languages like JavaScript and Python with static type checkers in systems like TypeScript, Flow, Mypy, Pyright, and Typed Racket, using type narrowing to refine types via runtime checks (e.g., typeof item[“price”] === “number”). Designs vary, TypeScript permits unverified predicates, Flow ensures soundness, and Typed Racket tracks types compositionally—prompting the If-T benchmark ift-benchmark to evaluate narrowing across five languages, though it omits tools like Sorbet, Hack, Luau, Pyre, Cinder/Static Python, Typed Clojure, and Elixir, and the risks of unsound narrowings remain unclear.

Objectives

Extend the If-T benchmark to Sorbet, Hack, Luau, Pyre, Cinder/Static Python, Typed Clojure, and potentially Elixir.
Analyze their type narrowing precision, expressiveness, and soundness.
Conduct a corpus study of TypeScript or Python code using GitHub or Software Heritage APIs.
Assess the prevalence and exploit potential of unsound narrowings.
Link corpus findings to benchmark results for broader insights.

Progress So Far

During the first half of the SoR 2025 period, I focused on lextending the If-T benchmark to Sorbet, Pyre, Cinder/Static Python, Typed Clojure. These are the PRs which extends If-T benchmark:

Sorbet -> https://github.com/utahplt/ifT-benchmark/pull/20
Pyre -> https://github.com/utahplt/ifT-benchmark/pull/26
Typed Clojure -> https://github.com/utahplt/ifT-benchmark/pull/27
Cinder -> https://github.com/utahplt/ifT-benchmark/pull/28

What’s Next

I will be working on Conduct a corpus study of TypeScript or Python code using GitHub or Software Heritage APIs. Assess the prevalence and exploit potential of unsound narrowings. Also Link corpus findings to benchmark results for broader insights TGUsage.

Final Thoughts

Working on Type Narrowing has been incredibly rewarding, it’s more than just code. It’s studying the type systems of different programming languages which is very important for the large scale software systems and softwware security, and I’m honored to be a part of that.

Big thanks to my mentors Ben Greenman for their support and thoughtful feedback throughout. I’ve learned a ton already, and I can’t wait to keep building.

Mid-term Blog: Building a Simulator for Benchmarking Replicated Systems

Fri, 25 Jul 2025 00:00:00 +0000

Introduction

Hello there, I’m Michael. In this report, I’ll be sharing my progress as part of the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges project under the mentorship of Fadhil Kurnia.

About the Project

The goal of the project is to build a language-agnostic interface that enables communication between clients and any consensus protocol such as MultiPaxos, Raft, Zookeeper Atomic Broadcast (ZAB), and others. Currently, many of these protocols implement their own custom mechanisms for the client to communicate with the group of peers in the network. An implementation of MultiPaxos from the MultiPaxos Made Complete paper for example, uses a custom Protobuf definition for the packets client send to the MultiPaxos system. With the support of a generalized interface, different consensus protocols can now be tested under the same workload to compare their performance objectively.

Progress

Literature Study: Reviewed papers and implementations of various protocols including GigaPaxos, Raft, Viewstamped Replication (VSR), and ZAB. Analysis focused on their log replication strategies, fault handling, and performance implications.
Development of Custom Protocol: Two custom protocols are currently under development and will serve as initial test subjects for the testbed:
- A modified GigaPaxos protocol
- A Primary-Backup Replication protocol with strict log ordering similar to ZAB (logs are ordered based on the sequence proposed by the primary)
Most of my time has been spent working on the two protocols, particularly on snapshotting and state transfer functionality in the Primary-Backup protocol. Ideally, the testbed should be able to evaluate protocol performance in scenarios involving node failure or a new node being added. In these scenarios, different protocol implementations often vary in their decision of whether to take periodic snapshots or to roll forward whenever possible and generate a snapshot only when necessary.

Challenges

Early in the project, the initial goal was to benchmark different consensus protocols using arbitrary full-stack web applications as their workload. Different protocols would replicate a full-stack application running inside Docker containers across multiple nodes and the testbed would send requests for them to coordinate between those nodes. In fact, the 2 custom protocols being worked on are specifically made to fit these constraints.

Developing a custom protocol that supports the replication of a Docker container is in itself already a difficult task. Abstracting away the functionality that allows communicating with the docker containers, as well as handling entry logs and snapshotting the state, is an order of magnitude more complicated.

As mentioned in the first blog, an application can be categorized into two types: deterministic and non-deterministic applications. The coordination of these two types of applications are handled in very different ways. Most consensus protocols support only deterministic systems, such as key-value stores and can’t easily handle coordination of complex services or external side effects. To allow support for non-deterministic applications would require abstracting over protocol-specific log structures. This effectively restricts the interface to only support protocols that conform to the abstraction, defeating the goal of making the interface broadly usable and protocol-agnostic.

Furthermore, in order to allow any existing protocols to support running something as complex as a stateful docker container without the protocol itself even knowing adds another layer of complexity to the system.

Future Goals

Given these challenges, I decided to pivot to using only key-value stores as the application being used in the benchmark. This aligns with the implementations of most of the existing protocols which typically use key-value stores. In doing so, now the main focus would be to implement an interface that supports HTTP requests from clients to any arbitrary protocols.

Midterm Blog: Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges

Fri, 25 Jul 2025 00:00:00 +0000

Hello! I’m Panji Sri Kuncara Wisma and I want to share my midterm progress on the “Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges” project under the mentorship of Fadhil I. Kurnia.

Project Overview

The goal of our project is to create an open testbed that enables fair, reproducible evaluation of different consensus protocols (Paxos variants, EPaxos, Raft, etc.) when deployed at network edges. Currently, researchers struggle to compare these systems because they lack standardized evaluation environments and often rely on mock implementations of proprietary systems.

XDN (eXtensible Distributed Network) is one of the important consensus systems we plan to evaluate in our benchmarking testbed. Built on GigaPaxos, it allows deployment of replicated stateful services across edge locations. As part of preparing our benchmarking framework, we need to ensure that the systems we evaluate, including XDN, are robust for fair comparison.

Progress

As part of preparing our benchmarking tool, I have been working on refactoring XDN’s FUSE filesystem from C++ to Rust. This work is essential for creating a stable and reliable XDN platform.

The diagram above illustrates how the FUSE filesystem integrates with XDN’s distributed architecture. On the left, we see the standard FUSE setup where applications interact with the filesystem through the kernel’s VFS layer. On the right, the distributed replication flow is shown: Node 1 runs fuselog_core which captures filesystem operations and generates statediffs, while Nodes 2 and 3 run fuselog_apply to receive and apply these statediffs, maintaining replica consistency across the distributed system.

This FUSE component is critical for XDN’s operation as it enables transparent state capture and replication across edge nodes. By refactoring this core component from C++ to Rust, we’re hopefully strengthening the foundation for fair benchmarking comparisons in our testbed.

Core Work: C++ to Rust FUSE Filesystem Migration

XDN relies on a FUSE (Filesystem in Userspace) component to capture filesystem operations and generate “statediffs” - records of changes that get replicated across edge nodes. The original C++ implementation worked but had memory safety concerns and limited optimization capabilities.

I worked on refactoring from C++ to Rust, implementing several improvements:

New Features Added:

Zstd Compression: Reduces statediff payload sizes
Adaptive Compression: Intelligently chooses compression strategies
Advanced Pruning: Removes redundant operations (duplicate chmod/chown, created-then-deleted files)
Bincode Serialization: Helps avoid manual serialization code and reduces the risk of related bugs
Extended Operations: Added support for additional filesystem operations (mkdir, symlink, hardlinks, etc.)

Architectural Improvements:

Memory Safety: Rust’s ownership system helps prevent common memory management issues
Type Safety: Using Rust enums instead of integer constants for better type checking

Findings

The optimization results performed as expected:

Statediff Size Reductions:

MySQL workload: 572MB → 29.6MB (95% reduction)
PostgreSQL workload: 76MB → 11.9MB (84% reduction)
SQLite workload: 4MB → 29KB (99% reduction)

The combination of write coalescing, pruning, and compression proves especially effective for database workloads, where many operations involve small changes to large files.

Performance Comparison: Remarkably, the Rust implementation matches or exceeds C++ performance:

POST operations: 30% faster (10.5ms vs 15ms)
DELETE operations: 33% faster (10ms vs 15ms)
Overall latency: Consistently better (9ms vs 11ms)

Current Challenges

While the core implementation is complete and functional, I’m currently debugging occasional latency spikes that occur under specific workload patterns. These edge cases need to be resolved before moving on to the benchmarking phase, as inconsistent performance could compromise the reliability of the evaluation.

Next Steps

With the FUSE filesystem foundation nearly complete, next steps include:

Resolve latency spike issues and complete XDN stabilization
Build benchmarking framework - a comparison tool that can systematically evaluate different consensus protocols with standardized metrics.
Run systematic evaluation across protocols

The optimized filesystem will hopefully provide a stable base for reproducible performance comparisons between distributed consensus protocols.

Midterm for Smart Environments

Thu, 24 Jul 2025 00:00:00 +0000

What is EnvGym?

EnvGym is a general multi-agent framework designed to automate the construction of executable environments for reproducing research prototypes from top-tier conferences and journals. While reproducibility has become a growing concern in the research community, the process of setting up environments remains time-consuming, error-prone, and often poorly documented.

EnvGym addresses this gap by leveraging LLM-powered agents to analyze project instructions, resolve dependencies, configure execution environments, and validate results—thereby reducing human overhead and improving reproducibility at scale.

Progress

New Tools

Initially, our agent had access to only one tool: the command line. This constrained the agent’s ability to decompose complex tasks and respond flexibly to failures. Over the last few weeks, we introduced a modular tool system, enabling the agent to handle specific subtasks more effectively.

The new toolset includes:

dockerrun: Executes Dockerfiles.
hardware_checking, hardware_adjustment: Tailor builds to available resources.
history_manager, stats: Tracks historical data for improvement and reproducibility.
planning: Generates high-level execution plans.
summarize: Interprets build results to adjust subsequent iterations.
writing_docker_initial, writing_docker_revision: Generate and refine Dockerfiles.

While some of those tools, such as dockerrun, run programmatic scripts, other scripts such as planning are more complex and use LLMs themselves.

Agent Re-Architecture: Moving Beyond Codex

We transitioned away from OpenAI’s Codex agent implementation. While powerful, Codex’s framework was overly reliant on its CLI frontend, which added unnecessary complexity and limited customizability for our research context.

We implemented our own lightweight, customizable agent pipeline that integrates LLM-based planning with iterative execution. Conceptually, the agent executes the following loop:

Repo Scanning
Hardware Check
Planning & Initial Dockerfile Generation
Docker Execution
Progress Summarization & Adjustment
Iterative Dockerfile Refinement (up to 20 rounds)
Success Check & Logging

This new agent design is easier to control, extend, and debug—aligning better with the needs of reproducibility research.

Prompt Engineering

For each tool that requires LLMs to function, we created a set of custom prompts that outline the task and breaks down the goals. For instance, the prompt used in summarize differs from the one in planning, allowing us to optimize the behavior of LLM agents per context.

Performance Gains

With these improvements, EnvGym now successfully replicates 9 repositories, surpassing our baseline Codex agent which struggled with the same set. We’ve observed more reliable planning, better handling of edge-case dependencies, and faster convergence in iterative Dockerfile revisions.

Next Steps

Granular Evaluation Metric

We plan to adopt a tree-structured rubric-based evaluation, inspired by PaperBench. Instead of binary success/failure, each repo will be assigned a reproducibility score from 0–100.

Key tasks include:

Rubric Design: Define a hierarchical rubric with criteria like dependency resolution, test success rate, runtime match, etc.
Manual Annotation: Build a dataset of ground-truth rubrics for a subset of repos to calibrate our automatic judge.
Judge Implementation: Develop an LLM-based judge function that takes (i) rubric and (ii) environment state, and returns a reproducibility score.

Source: Starace, Giulio, et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research.” arXiv preprint arXiv:2504.01848 (2025).

This will make EnvGym suitable for benchmarking. We will run our new method and obtain a score to compare with baseline methods!

Conclusion

EnvGym has made strong progress toward automating reproducibility in computational research. Through modularization, agentic design, and prompt optimizations, we’ve surpassed existing baselines and laid the groundwork for even more improvement.

The upcoming focus on metrics and benchmarking will elevate EnvGym from a functional prototype to a standardized reproducibility benchmark tool and also quantitatively prove that our new agentic method is better than existing tools such as Codex. Excited for what’s to come!

Autofill

; 20250724-Sam_Huang

Mid-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing

Tue, 15 Jul 2025 00:00:00 +0000

Introduction

About the Project

As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for investigators to discover relevant projects, notes, and assets—across both current and archived work—using information that is either user-entered or passively collected by StatWrap.

Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:

Evaluating open-source search libraries suitable for local indexing and retrieval
Building the full-text search functionality directly into the StatWrap UI to allow seamless querying across projects
Ensuring reliability through the development of unit tests and comprehensive system testing
Implementing a classification system to label projects as “Active,” “Pinned,” or “Past” within the user interface

Progress

It has been more than six weeks since the project began, and significant progress has been made. Here’s a breakdown:

1. Descriptive Comparison of Open-Source Libraries

2. The Libraries

Lunr.js
A small, client-side full-text search engine that mimics Solr capabilities.
- Field-based search, boosting
- Supports TF-IDF, inverted index
- No built-in fuzzy search (only basic wildcards)
- Can serialize/deserialize index
- Not designed for large datasets
- Moderate memory usage and indexing speed
- Good documentation
- Best for: Static websites or SPAs needing simple in-browser search
ElasticLunr.js
A lightweight, more flexible alternative to Lunr.js.
- Dynamic index (add/remove docs)
- Field-based and weighted search
- No advanced fuzzy matching
- Faster and more customizable than Lunr
- Smaller footprint
- Easy to use and maintain
- Best for: Developers wanting Lunr-like features with simpler customization
Fuse.js
A fuzzy search library ideal for small to medium datasets.
- Fuzzy search with typo tolerance
- Deep key/path searching
- No need to build index
- Highly configurable (threshold, distance, etc.)
- Linear scan = slower on large datasets
- Not full-text search (scoring-based match)
- Extremely easy to set up and use
- Best for: Fuzzy search in small in-memory arrays (e.g., auto-suggest, dropdown filters)
FlexSearch
A blazing-fast, modular search engine with advanced indexing options.
- Extremely fast search and indexing
- Supports phonetic, typo-tolerant, and partial matching
- Asynchronous support
- Multi-language + Unicode-friendly
- Low memory footprint
- Configuration can be complex for beginners
- Best for: High-performance search in large/multilingual datasets
MiniSearch
A small, full-text search engine with balanced performance and simplicity.
- Fast indexing and searching
- Fuzzy search, stemming, stop words
- Field boosting and prefix search
- Compact, can serialize index
- Clean and modern API
- Lightweight and easy to maintain
- Best for: Balanced, in-browser full-text search for moderate datasets
Search-Index
A persistent, full-featured search engine for Node.js and browsers.
- Persistent storage with LevelDB
- Real-time indexing
- Fielded queries, faceting, filtering
- Advanced queries (Boolean, range, etc.)
- Slightly heavier setup
- Good for offline/local-first apps
- Browser usage more complex than others
- Best for: Node.js apps, not directly compatible with the Electron + React environment of StatWrap

3. Developer Experience and Maintenance

We analyzed the download trends of the search libraries using npm trends, and also reviewed their maintenance statistics to assess how frequently they are updated.

4. Comparative Analysis After Testing

Each search library was benchmarked against a predefined set of queries based on the same evaluation criteria.
We are yet to finalize the weights for each criterion, which will be done during the end-term evaluation.

5. The User Interface

The user interface includes options to search using three search modes (Basic, Advanced, Boolean operators) with configurable parameters. Results are sorted based on relevance score (highest first), and also grouped by category.

6. Overall Functioning

Indexing Workflow
- Projects are processed sequentially
- Metadata, files, people, and notes are indexed (larger files are queued for later)
- Uses a “brute-force” recursive approach to walk through project directories
  - Skips directories like node_modules, .git, .statwrap
  - Identifies eligible text files for indexing
  - Logs progress every 10 files
Document Creation Logic
- Reads file content as UTF-8 text
- Builds searchable documents with filename, content, and metadata
- Auto-generates tags based on content and file type
- Adds documents to the search index and document store
- Handles errors gracefully with debug logging
Search Functionality
- Uses field-weighted search
- Enriches results with document metadata
- Supports filtering by type or project
- Groups results by category (files, projects, people, etc.)
- Implements caching for improved performance
- Search statistics are generated to monitor performance

Challenges and End-Term Goals

In-memory Indexing Metadata Storing
Most JavaScript search libraries (like Fuse.js, Lunr, MiniSearch) store indexes entirely in memory, which can become problematic for large-scale datasets. A key challenge is designing a scalable solution that allows for disk persistence or lazy loading to prevent memory overflows.
Deciding the Weights Accordingly
An important challenge is tuning the relevance scoring by assigning appropriate weights to different aspects of the search, such as exact word matches, prefix matches, and typo tolerance. For instance, we prefer exact matches to be ranked higher than fuzzy or partial matches.
Implementing the Selected Library
Once a library is selected (based on speed, features, and compatibility with Electron + React), the next challenge is integrating it into StatWrap efficiently—ensuring local indexing, accurate search results, and smooth performance even with large projects.
Classifying Active and Past Projects in the User Interface
To improve navigation and search scoping, we plan to introduce three project sections in the interface: Pinned, Active, and Past projects. This classification will help users prioritize relevant content while enabling smarter indexing strategies.

Stay tuned for the next blog!

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Wed, 25 Jun 2025 00:00:00 +0000

Hello, I’m Baiqiang. As part of the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project, I am excited to introduce my work on a crucial challenge in modern computational science. My proposal under the mentorship of Luanzheng “Lenny” Guo at Pacific Northwest National Laboratory and Dongfang Zhao at the University of Washington aims to enhance the reproducibility of AI-driven scientific workflows.

The Problem: A Crisis of Confidence in AI for Science

Large Language Models (LLMs) are transforming scientific research, from accelerating literature reviews to generating novel hypotheses. However, their power is matched by their pitfalls: a tendency to “hallucinate” facts and a lack of transparency. Retrieval-Augmented Generation (RAG) was developed as a powerful solution, grounding LLM outputs in factual evidence retrieved from a specific knowledge base (like a database of scientific papers).

But a hidden problem lurks within RAG: non-determinism. The very first step of a RAG system—the similarity search that finds relevant documents—can produce different results even when asked the same question. Variations in indexing algorithms, data updates, or even the underlying software can change which documents are retrieved. For science, this is a critical flaw. If an experiment cannot be repeated with the same results, its conclusions cannot be trusted. This project tackles that challenge head-on.

Our Mission: Forging a Path to Reproducible RAG

This project proposes a comprehensive solution to systematically identify, measure, and mitigate non-determinism in RAG frameworks. Our goal is to empower researchers to build and use AI tools with confidence.

Our approach is built on four key pillars:

Systematic Analysis: We will conduct a deep dive into popular RAG components (like FAISS, ScaNN, and HNSW) to pinpoint the exact sources of randomness and variability.
Rigorous Benchmarking: We will develop a public, open-source benchmarking suite using standardized scientific datasets (from PubMed, arXiv, etc.). This will allow anyone to quantitatively measure the reproducibility of their own RAG pipeline using clear metrics like retrieval overlap and rank correlation.
Targeted Enhancements: Based on our findings, we will implement practical solutions, including:
- Promoting deterministic algorithms and configurations.
- Building robust data versioning and provenance tracking tools (inspired by DVC and Git LFS).
- Creating tools for precise configuration management to capture the entire experimental setup.
Practical Guidance and Open Source Tools: We will distill our insights into comprehensive documentation, reusable code examples, and best practices. All tools and findings will be contributed back to the open-source community.

From Friction to Flow: Why I'm Building Widgets for Reproducible Research

Tue, 24 Jun 2025 00:00:00 +0000

This summer, I’m building Jupyter Widgets to reduce friction in reproducible workflows on Chameleon. Along the way, I’m reflecting on what usability teaches us about the real meaning of reproducibility.

Supercomputing Competition: Reproducibility Reality Check

My first reproducibility experience threw me into the deep end—trying to recreate a tsunami simulation with a GitHub repository, a scientific paper, and a lot of assumptions. I was part of a student cluster competition at the Supercomputing Conference, where one of our challenges was to reproduce the results of a prior-year paper. I assumed “reproduce” meant something like “re-run the code and get the same numbers.” But what we actually had to do was rebuild the entire computing environment from scratch—on different hardware, with different software versions, and vague documentation. I remember thinking: If all these conditions are so different, what are we really trying to learn by conducting reproducibility experiments? That experience left me with more questions than answers, and those questions have stayed with me. In fact, they’ve become central to my PhD research.

Summer of Reproducibility: Lessons from 100+ Experiments on Chameleon

I’m currently a PhD student and research software engineer exploring questions around what computational reproducibility really means, and when and why it matters. I also participated in the Summer of Reproducibility 2024, where I helped assess over 100 public experiments on the Chameleon platform. Our analysis revealed key friction points—especially around usability—that don’t necessarily prevent reproducibility in the strictest sense, but introduce barriers in terms of time, effort, and clarity. These issues may not stop an expert from reproducing an experiment, but they can easily deter others from even trying. This summer’s project is about reducing that friction—some of which I experienced firsthand—by improving the interface between researchers and the infrastructure they rely on.

From Psychology Labs to Jupyter Notebooks: Usability is Central to Reproducibility

My thinking shifted further when I was working as a research software engineer at Purdue, supporting a psychology lab that relied on a complex statistical package. For most researchers in the lab, using the tool meant wrestling with cryptic scripts and opaque parameters. So I built a simple Jupyter-based interface to help them visualize input matrices, validate settings, and run analyses without writing code. The difference was immediate: suddenly, people could actually use the tool. It wasn’t just more convenient—it made the research process more transparent and repeatable. That experience was a turning point for me. I realized that usability isn’t a nice-to-have; it’s critical for reproducibility.

Since that first experience, I’ve leaned into building better interfaces for research workflows—especially using Jupyter Widgets. Over the past few years, I’ve developed and taught tutorials on how to turn scientific notebooks into interactive web apps, including at the SciPy conference in 2023 and 2024. These tutorials go beyond the basics: I focus on building real, multi-tab applications that reflect the complexity of actual research tools. Teaching others how to do this has deepened my own knowledge of the widget ecosystem and reinforced my belief that good interfaces can dramatically reduce the effort it takes to reproduce and reuse scientific code. That’s exactly the kind of usability work I’m continuing this summer—this time by improving the interface between researchers and the Chameleon platform itself.

Making Chameleon Even More Reproducible with Widgets

This summer, I’m returning to Chameleon with a more focused goal: reducing some of the friction I encountered during last year’s reproducibility project. One of Chameleon’s standout features is its Jupyter-based interface, which already goes a long way toward making reproducibility more achievable. My work builds on that strong foundation by improving and extending interactive widgets in the Python-chi library — making tasks like provisioning resources, managing leases, and tracking experiment progress on Chameleon even more intuitive. For example, instead of manually digging through IDs to find an existing lease, a widget could present your current leases in a dropdown or table, making it easier to pick up where you left off and avoid unintentionally reserving unnecessary resources. It’s a small feature, but smoothing out this kind of interaction can make the difference between someone giving up or trying again. That’s what this project is about.

Looking Ahead: Building for People, Not Just Platforms

I’m excited to spend the next few weeks digging into these questions—not just about what we can build, but how small improvements in usability can ripple outward to support more reproducible, maintainable, and accessible research. Reproducibility isn’t just about rerunning code; it’s about supporting the people who do the work. I’ll be sharing updates as the project progresses, and I’m looking forward to learning (and building) along the way. I’m incredibly grateful to once again take part in this paid experience, made possible by the 2025 Open Source Research Experience team and my mentors.

Applying MLOps to overcome reproducibility barriers in machine learning research

Sun, 22 Jun 2025 00:00:00 +0000

About the Project

Hello! I’m Ahmed, an undergraduate Computer Science student at the University of Khartoum I’m working on making machine learning research more reproducible for open access research facilities like Chameleon testbed, under the project Applying MLOps to overcome reproducibility barriers in machine learning research, mentored by Prof. Fraida Fund and Mohamed Saeed. as part of this project my proposal aims to build a template generator that generates repositories for reproducible model training on the Chameleon testbed.

Reproducibility

We argue that unless reproducing research becomes as vital and mainstream part of scientific exploration as reading papers is today, reproducibility will be hard to sustain in the long term because the incentives to make research results reproducible won’t outweigh the still considerable costs

— Three Pillars of Practical Reproducibility Paper

By Reproducibility in science we refer to the ability to obtain consistent results using the same methods and conditions as the previous study. in simple words if I used the same data and metholodgy that was used before, I should obtain the same results. this principle is mapped to almost every scientific field including both Machine Learning research in science and core Machine Learning.

Challenges in Reproducibility

The same way the famous paper about the repoducibility crisis in science was published in in 2016, similar discussions have been published discussing this in machine learning research setting, the paper state of the art reproducibility in artificial intelligence after analayzing 400 hundereds papers from top AI conferences, it was found that around 6% shared code, approximately 33% shared test data. In contrast, 54% only shared a pseudocode (summary of the algorithm).

The lack of software dependency management, proper version control, log tracking, and effective artifacts sharing made it very difficult to reproduce research in machine learning.

Reproducibility in machine learning is largely supported by MLOps practices which is the case in the industry where the majority of researchers are backed by software engineers who are responsible of setting experimental environments or develop tools that streamline the workflow.However, in academic settings reproducibility remains a great challenge, researchers prefer to focus on coding, and worry a little about the the complexities invloved in configuring their experimental environment,As a result, the adaptation and standardization of MLOps practices in academia progress slowly. The best way to ensure a seamleas experience with MLOps, is to make these capabilities easily accessible to the researchers’ workflow. by developing a tool that steamlines the process of provisioning resources, enviornment setup, model training and artifacts tracking, that ensures reproducible results.

Proposed Solution

We want the researchers to spin up ML research instances/bare metal on Chameleon testbed while keeping the technical complexity involved in configuring and stitching everything together abstracted, users simply answer frew questions about their project info, frameworks, tools, features and integrations if there are any, and have a full generated,reproducible project. it contains a provisioning/infrastracture config layer for provisioning resources on the cloud, a dockerfile to spin up services and presistent storage for data,the ML tracking server system that logs the artifacts, metadata, environment configuration, system specification (GPUs type) and Git status using Mlflow, powered by a postgresSQL for storing metadata and a S3 Minio bucket to store artifacts.ML code at its core is a containarized training environment backed by persistent storage for the artifacts generated from the experiment and the datasets and containarization of all these to ensure reproducibility.we aim to make the cloud experience easier, by dealing with the configuration needed for setting up the environment having a 3rd party framework, enabling seamless access to benchmarking dataset or any necessary components from services like Hugging face and GitHub as an example will be accessible from the container easily. for more techincal details about the solution you can read my propsal here.

By addressing these challenges we can accelerate the scientific discovery. this not benefits those who are conducting the research but also the once building on top of it in the future. I look forward to share more updates as the project progresses and I welcome feedback from others interested in advancing reproducibility in ML research.

Building a Benchmarking Suite for Cache Performance Evaluation

Sat, 21 Jun 2025 00:00:00 +0000

Hi! I’m Haocheng Xia, a Computer Science student at the University of Illinois Urbana-Champaign, passionate about the intersection of machine learning and storage systems. Specifically, I’m keen on workload analysis and KV cache management for large language models.

This summer, I’m happy to be a part of SoR 2025 and OSRE 2025. I’m contributing to the CacheBench project. My initiative, ‘Building a Benchmarking Suite for Cache Performance Evaluation,’ will create a robust platform. This involves extensive simulation of existing eviction algorithms using libCacheSim, developing microbenchmarks, and building a user-friendly platform for researchers to effortlessly evaluate novel cache designs. The ultimate goal is to establish a competitive leaderboard.

My contributions will include a comprehensive dataset detailing simulated miss ratios and throughput of current cache eviction algorithms, an extension to libCacheSim for executing microbenchmarks both locally and on our online platform, and the creation and ongoing maintenance of a public web leaderboard. I’m grateful to be mentored by Juncheng Yang and Yazhuo Zhang.

I’m thrilled to be part of building tools that empower users and advance the vision of a more decentralized web. Looking forward to a productive summer!

EnvGym – An AI System for Reproducible Custom Computing Environments

Mon, 16 Jun 2025 00:00:00 +0000

Hello, My name is Yiming Cheng. I am a Pre-doc researcher in Computer Science at University of Chicago. I’m excited to be working with the Summer of Reproducibility and the Chameleon Cloud community as a project leader. My project is EnvGym that focuses on developing an AI-driven system to automatically generate and configure reproducible computing environments based on natural language descriptions from artifact descriptions, Trovi artifacts, and research papers.

The complexity of environment setup often hinders reproducibility in scientific computing. My project aims to bridge the knowledge gap between experiment authors and reviewers by translating natural language requirements into actionable, reproducible configurations using AI and NLP techniques.

Project Overview

EnvGym addresses fundamental reproducibility barriers by:

Using AI to translate natural language environment requirements into actionable configurations
Automatically generating machine images deployable on bare metal and VM instances
Bridging the knowledge gap between experiment authors and reviewers
Standardizing environment creation across different hardware platforms

June 10 – June 16, 2025

Getting started with the project setup and initial development:

I began designing the NLP pipeline architecture to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into structured environment “recipes”
I set up the initial project repository and development environment
I met with my mentor Prof. Kexin Pei to discuss the project roadmap and technical approach
I started researching existing artifact descriptions from conferences and Trovi to understand common patterns in environment requirements
I began prototyping the backend environment builder logic that will convert parsed requirements into machine-image definitions
I explored Chameleon’s APIs for provisioning servers and automated configuration

Next Steps

Continue developing the NLP component for requirement parsing
Implement the core backend logic for environment generation
Begin integration with Chameleon Cloud APIs
Start building the user interface for environment specification

This is an exciting and challenging project that combines my interests in AI systems and reproducible research. I’m looking forward to building a system that will help researchers focus on their science rather than struggling with environment setup issues.

Thanks for reading, I will keep you updated as I make progress on EnvGym!

Smart Environments – An AI System for Reproducible Custom Computing Environments

Mon, 16 Jun 2025 00:00:00 +0000

Hi everyone, I’m Sam! I’m excited to be working with the Argonne National Laboratory and SoR this summer on Smart Environments. Have you ever encountered a great opensource project and wanted to run it or use it locally, only to find that it’s such a headache to set up all the dependencies? Maybe your system version wasn’t correct, or a piece of software was outdated, or the dependencies were incompatible with something you had already on your machine?

In comes EnvGym to save the day! We want EnvGym to be an agent that would help reproduce opensource projects by automatically setting up the environmental dependencies required to get them running. That’s what I will be working on for the rest of the summer! To make EnvGym work, we will be leveraging LLM agents to tackle the problem. We will use EnvGym to read documentations, understand code structures, run commands to set up environments, and reflectively react to any errors and warnings.

To build EnvGym, I have the following to-do’s in mind:

Building a dataset that includes repos to be reproduced
Establishing a baseline using current methods
Implementing the actual EnvGym algorithm
Testing EnvGym against baseline performance and iteratively improving it
Deploying EnvGym to real-world use cases and gathering feedback

Here is the repo that we are working on: https://github.com/EaminC/EnvGym/tree/main

More updates to come, thanks for reading!

Assessing and Enhancing CC-Snapshot for Reproducible Experiment Enviroments

Sun, 15 Jun 2025 00:00:00 +0000

Hello, My name is Zahra Temori. I am a rising senior in Computer Science at University of Delaware. I’m excited to be working with the Summer of Reproduciblity and the Chameleon Cloud community. My project is cc-snapshot that focuses on enhancing features for helping researchers capture and share reproducible experimental environments within the Chameleon Cloud testbed.

Here is a detailed information about my project and plans to work for summer proposal.

June 10 – June 14, 2025

Getting started with the first milestone and beginning to explore the Chameleon Cloud and the project:

I began familiarizing myself with the Chameleon Cloud platform. I created an account and successfully accessed a project.
I learned how to launch an instance and create a lease for using computing resources.
I met with my mentor to discuss the project goals and outline the next steps.
I experimented with the environment and captured a snapshot to understand the process.

It has been less than a week and I have learned a lot specially about the Chameleon Cloud and how it is different from other clouds like AWS. I am exited to learn more and make progress.

Thanks for reading, I will keep ypu updated as I work :)

Developing an Open Testbed for Edge Replication System Evaluation

Sun, 15 Jun 2025 00:00:00 +0000

Hi, I’m Panji. I’m currently contributing to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges under the mentorship of Fadhil I. Kurnia. You can find more details on the project proposal here.

The primary challenge we’re addressing is the current difficulty in fairly comparing different edge replication systems. To fix this, we’re trying to build a testing platform with four key parts. We’re collecting real data about how people actually use edge services, creating a tool that can simulate realistic user traffic across many locations, building a system that mimics network delays between hundreds of edge servers, and packaging everything into an open-source toolkit.

This will let researchers test different coordination methods like EPaxos, Raft, and others using the same data and conditions. We hope this will help provide researchers with a more standardized way to evaluate their systems. We’re working with multiple programming languages and focusing on making complex edge computing scenarios accessible to everyone in the research community.

One of the most interesting aspects of this project is tackling the challenge of creating realistic simulations that accurately reflect the performance characteristics different coordination protocols would exhibit in actual edge deployments. The end goal is to provide the research community with a standardized, reproducible environment for edge replication.

Type Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits

Sun, 15 Jun 2025 00:00:00 +0000

Hello! I’m Siva Sathyaseelan D N, a pre-final year B.Tech + M.Tech Engineering student at IIT BHU, Varanasi, India. With a deep-rooted passion for software development and scientific computing. I thrive at the intersection of code and real-world problem-solving. For two years, I’ve engaged in open-source work across scientific simulation, blockchain, and cloud-native technologies, through hobby projects, hackathons, internships, and an LFX mentee. I will be working on Type Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits under the mentorship of Ben Greenman. My proposal can be viewed here!

Building a Simulator for Benchmarking Replicated Systems

Sat, 14 Jun 2025 00:00:00 +0000

Hi, I’m Michael. I’m currently contributing to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges under the mentorship of Fadhil Kurnia. You can find more details on the project proposal here.

What we are trying to achieve is to create a system to test and evaluate the performance of different consensus protocols and consistency models under the same application and workload. The consensus protocols and consistency models are both tested on various replicated black-box applications. Essentially, the testbed itself is able to deploy any arbitrary stateful application on multiple machines (nodes) as long as it is packaged in the form of a docker image. The consensus protocol is used to perform synchronization between the stateful part of the application (in most cases, the database). The goal is that by the end of this project, the testbed we are building has provided the functionality and abstraction to support the creation of new consensus protocols to run tests on.

One major challenge in implementing this is with regards to the handling of replication on the running docker containers. Generally, the services that can be deployed in this system would be of two types:

A Deterministic Application (An application that will always return the same output when given the same input. e.g., a simple CRUD app)
A Non-Deterministic Application (An application that may return the different outputs when given the same input. e.g., an LLM which may return different response from the same prompt request)

Both of these application types requires different implementation of consensus protocols. In the case of a deterministic application, since all request will always yield the same response (and the same changes inside the database of the application itself), the replication protocol can perform replication on the request to all nodes. On the other hand, in a non-determinisitic application, the replication protocol applies synchronization on the state of the database directly since a different response may be returned on the same request.

MPI Appliance for HPC Research on Chameleon

Sat, 14 Jun 2025 00:00:00 +0000

Hi Everyone,

I’m Rohan Babbar from Delhi, India. This summer, I’m excited to be working with the Argonne National Laboratory and the Chameleon Cloud community. My project focuses on developing an MPI Appliance to support reproducible High-Performance Computing (HPC) research on the Chameleon testbed.

For more details about the project and the planned work for the summer, you can read my proposal here.

👥 Community Bonding Period

Although the project officially started on June 2, 2025, I made good use of the community bonding period beforehand.

I began by getting access to the Chameleon testbed, familiarizing myself with its features and tools.
I experimented with different configurations to understand the ecosystem.
My mentor, Ken Raffenetti, and I had regular check-ins to align our vision and finalize our milestones, many of which were laid out in my proposal.

🔧 June 2 – June 14, 2025

Our first milestone was to build a base image with MPI pre-installed. For this:

We decided to use Spack, a flexible package manager tailored for HPC environments.
The image includes multiple MPI implementations, allowing users to choose the one that best suits their needs and switch between them using simple Lua Module commands.

📌 That’s all for now! Stay tuned for more updates in the next blog.

Thanks for reading!

StatWrap: Cross-Project Searching and Classification using Local Indexing

Sat, 14 Jun 2025 00:00:00 +0000

Hello👋! I am Debangi Ghosh, currently pursuing a degree in Mathematics and Computing at IIT (BHU) Varanasi, India. This summer, I will be working on the StatWrap: Cross-Project Searching and Classification using Local Indexing project under the mentorship of Luke Rasmussen. You can view my project proposal for more details.

My project aims to address the challenges in project navigation and discoverability by integrating a robust full-text search capability within the user interface. Instead of relying on basic keyword-based search—where remembering exact terms can be difficult—we plan to implement a natural language-based full-text search. This approach involves two main stages: indexing, which functions like creating a searchable map of the content, and searching, which retrieves relevant information from that map. We will evaluate and compare available open-source libraries to choose and implement the most effective one. In addition, my project aims to enhance project organization by introducing a new classification system that clearly distinguishes between “Active” and “Past” projects in the user interface. This will improve clarity, reduce clutter, and provide a more streamlined experience as the number of projects grows.

Stay tuned for updates on my progress in the coming weeks! 🚀

Applying MLOps to overcome reproducibility barriers in machine learning research

Sat, 01 Mar 2025 00:00:00 +0000

Topics: machine learning, MLOps, reproducibility
Skills: Python, machine learning, GitOps, systems, Linux, data, Docker
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fraida Fund and Mohamed Saeed

Project Idea Description

Reproducibility remains a significant problem in machine learning research, both in core ML and in the application of ML to other areas of science. In many cases, due to inadequate experiment tracking, dependency capturing, source code versioning, data versioning, and artifact sharing, even the authors of a paper may find it challenging to reproduce their own study several years later. This makes it difficult to vaidate and build on previous work, and raises concerns about its trustworthiness.

In contrast, outside of academic research, MLOps tools and frameworks have been identified as a key enabler of reliable, reproducible, and trustworthy machine learning systems in production. A good reference on this topic is:

Firas Bayram and Bestoun S. Ahmed. 2025. Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps Approach. ACM Comput. Surv. 57, 5, Article 121 (May 2025), 35 pages. https://doi.org/10.1145/3708497

This project seeks to bridge the gap between widely adopted practices in industry and academic research:

by making it easier for researchers and scientists to use MLOps tools to support reproducibility. To achieve this, we will develop starter templates and recipes for research in computer vision, NLP, and ML for science, that have reproducibility “baked in” thanks to the integration of MLOps tools and frameworks. Researchers will launch these templates on open access research facilities like Chameleon.
and, by developing complementary education and training materials to emphasize the important of reproducibility in ML, and how the tools and frameworks used in the starter templates can support this goal.

Writing a successful proposal for this project

A good proposal for this project should -

demonstrate a good understanding of the current barriers to reproducibility in machine learning research (specific examples are welcome),
describe a “base” starter template, including the platforms and tools that will be integrated, as well as specific adaptations of this template for computer vision, NLP, and ML for science,
explain the “user flow” - how a researcher would use the template to conduct an experiment or series of experiments, what the lifecycle of that experiment would look like, and how it would be made reproducible,
include the contributor’s own ideas about how to make the starter templates more usable, and how to make the education and training materials relatable and useful,
and show that the contributor has the necessary technical background and soft skills to contribute to this project. In particular, the contributor will need to create education and training materials that are written in a clear, straightforward, and concise manner, without unncessary jargon. The proposal should show evidence of the contributor’s writing abilities.

Github link

There is no pre-existing Git repository for this project - at the beginning of the summer, the contributor will create a new repository in the Teaching on Testbeds organization, and the project materials will “live” there.

FairFace

Fri, 28 Feb 2025 00:00:00 +0000

FairFace: Reproducible Bias Evaluation in Facial AI Models via Controlled Skin Tone Manipulation

Bias in facial AI models remains a persistent issue, particularly concerning skin tone disparities. Many studies report that AI models perform differently on lighter vs. darker skin tones, but these findings are often difficult to reproduce due to variations in datasets, model architectures, and evaluation settings. The goal of this project is to investigate bias in facial AI models by manipulating skin tone and related properties in a controlled, reproducible manner. By leveraging BioSkin, we will adjust melanin levels and other skin properties on existing human datasets to assess whether face-based AI models (e.g., classification and vision-language models) exhibit biased behavior toward specific skin tones.

Topics: Fairness & Bias in AI, Face Recognition & Vision-Language Models, Dataset Augmentation for Reproducibility
Skills: Machine Learning & Computer Vision, Deep Learning (PyTorch/TensorFlow), Data Augmentation & Image Processing, Reproducibility & Documentation (GitHub, Jupyter Notebooks).
Difficulty: Moderate
Size: Medium or Large ( Can be completed in either 175 or 350 hours, depending on the depth of analysis and number of models tested.)
Mentors: James Davis, Alex Pang

Key Research Questions

Do AI models perform differently based on skin tone?
- How do classification accuracy, confidence scores, and error rates change when skin tone is altered systematically?
What are the underlying causes of bias?
- Is bias solely dependent on skin tone, or do other skin-related properties (e.g., texture, reflectance) contribute to model predictions?
- Is bias driven by dataset imbalances (e.g., underrepresentation of certain skin tones)?
- Do facial features beyond skin tone (e.g., structure, expression, pose) contribute to biased predictions?
Are bias trends reproducible?
- Can we replicate bias patterns across different datasets, model architectures, and experimental setups?
- How consistent are the findings when varying image sources and preprocessing methods?

Specific Tasks:

Dataset Selection & Preprocessing
- Choose appropriate face/human datasets (e.g., FairFace, CelebA, COCO-Human).
- Preprocess images to ensure consistent lighting, pose, and resolution before applying transformations.
Skin Tone Manipulation with BioSkin
- Systematically modify melanin levels while keeping facial features unchanged.
- Generate multiple variations per image (lighter to darker skin tones).
Model Evaluation & Bias Analysis
- Test face classification models (e.g., ResNet, FaceNet) and vision-language models (e.g., BLIP, LLaVA) on the modified images.
- Compute fairness metrics (e.g., demographic parity, equalized odds).
Investigate Underlying Causes of Bias
- Compare model behavior across different feature sets.
- Test whether bias persists across multiple datasets and model architectures.
Ensure Reproducibility
- Develop an open-source pipeline for others to replicate bias evaluations.
- Provide codebase and detailed documentation for reproducibility.

Enhancing Reproducibility in Distributed AI Training: Leveraging Checkpointing and Metadata Analytics

Fri, 21 Feb 2025 09:00:00 -0700

Reproducibility in distributed AI training is a crucial challenge due to several sources of uncertainty, including stragglers, data variability, and inherent randomness. Stragglers—slower processing nodes in a distributed system—can introduce timing discrepancies that affect the synchronization of model updates, leading to inconsistent states across training runs. Data variability, stemming from non-deterministic data shuffling and differing data partitions across nodes, can also lead to variations in model performance. Additionally, inherent randomness in algorithm initialization, such as random weight beginnings and stochastic processes like dropout, further compounds these challenges. Reproducibility in AI is pivotal for ensuring the credibility of AI-driven scientific findings, akin to how reproducibility underpins traditional scientific research.

To enhance AI reproducibility, leveraging metadata analytics and visualization along with saved checkpoints offers a promising solution. Checkpointing in AI training is a pivotal technique that involves saving snapshots of a model and its parameters at regular intervals throughout the training process. This practice is essential for maintaining progress in the face of potential interruptions, such as hardware failures, and enables the resumption of training without having to restart from scratch. In the context of distributed AI training, checkpointing also provides a framework for analyzing and ensuring reproducibility, offering a means to systematically capture and review the training trajectory of models. Analyzing checkpoints can specifically help identify issues like stragglers, which are slower computing nodes in a distributed system that can impede synchronized progress. For example, by examining the time stamps and resource utilization data associated with each checkpoint, anomalies in processing time can be detected, revealing nodes that consistently lag behind others. This analysis enables teams to diagnose performance bottlenecks and optimize resource allocation across the distributed system, ensuring smoother and more consistent training runs. By combining checkpointing with metadata analytics, it becomes possible to pinpoint the exact training iterations where delays occur, thereby facilitating targeted investigations and solutions to improve overall system reproducibility and efficiency.

Workplan

The proposed work will include: 1) Setting up a checkpointing system within the distributed AI training framework to periodically save model states and metadata; 2) Designing a metadata analysis schema for populating model and system statistics from the saved checkpoints; 3) Conducting exploratory data analysis to identify patterns, anomalies, and sources of variability in the training process; 4) Creating visualization tools to represent metadata insights with collected statistics and patterns; 5) Using insights from metadata analytics and visualization to optimize resource distribution across the distributed system and mitigate straggler effects; and 6) Disseminating results and methodologies through academic papers, workshops, and open-source contributions.

Topics: Reproducibility AI distributed AI checkpoint metadata analysis
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Thu, 20 Feb 2025 09:00:00 -0700

Retrieval-Augmented Generation (RAG) frameworks, which merge the capabilities of retrieval systems and generative models, significantly enhance the relevance and accuracy of responses produced by large language models (LLMs). These frameworks retrieve relevant documents from a large corpus and use these documents to inform the generative process, thereby improving the contextuality and precision of the generated content. Ensuring reproducibility in data queries using similarity search within these RAG frameworks is critical for maintaining the reliability and consistency of scientific workflows. Reproducibility ensures that the same input query consistently yields the same output, which is vital for scientific tasks that rely on precise and repeatable results. Inconsistencies can arise from various sources, affecting the trustworthiness of scientific outcomes. Differences in retrieval algorithms can lead to variable sets of documents being retrieved for the same query. Variations in data indexing methods can cause inconsistencies in how documents are ranked and accessed. The stochastic nature of LLM operations introduces an element of randomness in the generative process. Updates in datasets can also alter the baseline against which queries are processed and interpreted, leading to different results over time.

This proposal aims to address these reproducibility challenges in similarity searches within RAG frameworks. This work involves analyzing the root causes of non-determinism, benchmarking and validating the consistency of query results, implementing enhancements to minimize variability, and developing tools and best practices to ensure reproducibility. Reproducibility in data queries can be influenced by several factors, including updates in datasets, differences in retrieval algorithms, varying data indexing methods, and the stochastic nature of LLM operations. Each of these factors can cause variability in the documents retrieved and in the generated responses. Ensuring consistency in query results across different runs is crucial for maintaining the integrity of LLM-driven scientific research, allowing researchers to confidently build upon prior work and achieve reliable, trustworthy outcomes.

Workplan

The proposed work will include: (1) Identifying sources of non-determinism and variability, such as algorithmic differences and indexing methods, in RAG; (2) Utilizing standardized scientific datasets to benchmark the reproducibility of similarity search results across different RAG frameworks; (3) Establishing protocols for handling dataset updates to ensure that such changes do not impact the reproducibility of similarity search results; and (4) Implementing mechanisms to track and document updates to datasets, ensuring that changes are reflected consistently across all instances of the RAG framework. By addressing these areas, the proposed work aims to mitigate challenges related to reproducibility in similarity search queries within RAG frameworks, ultimately enhancing the reliability and trustworthiness of scientific research outcomes.

Topics: Reproducibility LLM RAG Scientific Workflows
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Exploration of I/O Reproducibility with HDF5

Wed, 19 Feb 2025 09:00:00 -0700

Parallel I/O is a critical component in high-performance computing (HPC), allowing multiple processes to read and write data concurrently from a shared storage system. HDF5—a widely adopted data model and library for managing complex scientific data—supports parallel I/O but introduces challenges in I/O reproducibility, where repeated executions do not always produce identical results. This lack of reproducibility can stem from non-deterministic execution orders, variations in collective buffering strategies, and race conditions in metadata and dataset chunking operations within HDF5’s parallel I/O hierarchy. Moreover, many HDF5 operations that leverage MPI I/O require collective communication; that is, all processes within a communicator must participate in operations such as metadata creation, chunk allocation, and data aggregation. These collective calls ensure that the file structure and data layout remain consistent across processes, but they also introduce additional synchronization complexity that can impact reproducibility if not properly managed. In HPC scientific workflows, consistent I/O reproducibility is essential for accurate debugging, validation, and benchmarking, ensuring that scientific results are both verifiable and trustworthy. Tools such as h5bench—a suite of I/O kernels designed to exercise HDF5 I/O on parallel file systems—play an important role in identifying these reproducibility challenges, tuning performance, and ultimately supporting the overall robustness of large-scale scientific applications.

Workplan

The proposed work will include (1) analyzing and characterizing parallel I/O operations in HDF5 with h5bench miniapps, (2) exploring and validating potential reproducibility challenges within the parallel I/O hierarchy (e.g., MPI I/O), and (3) implementing solutions to address parallel I/O reproducibility.

Topics: Parallel I/O MPI-I/O Reproducibility HPC HDF5
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo and [Wei Zhang]Wei Zhang

Assessing and Enhancing CC-Snapshot for Reproducible Experiment Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

A critical challenge in computer systems research reproducibility is establishing and sharing experimental environments. While open testbeds like Chameleon provide access to hardware resources, researchers still face significant barriers when attempting to recreate the precise software configurations, dependencies, and system states needed for reproducible experiments. Environment snapshotting tools offer a solution, but face technical challenges in consistently capturing running systems without introducing distortions or requiring disruptive system modifications. This project addresses these fundamental reproducibility barriers by enhancing CC-Snapshot, an tool capturing the experimental environment configured by the user on bare metal images, to create more reliable and consistent system captures that can be shared and redeployed without loss of fidelity.

CC-Snapshot is a tool on the Chameleon testbed that enables users to package their customized environments as complex images or appliances. By allowing researchers to share these environments easily, CC-Snapshot offers a powerful mechanism for reproducibility, ensuring that experiments can be replicated and extended by others.

In this project, you will review existing CC-Snapshot workflows, research the latest snapshotting technologies, and develop enhancements that improve the tool’s usability and reliability. This includes ensuring snapshots are created consistently (even when the OS is actively running), preserving the integrity of user systems, and exploring advanced features such as out-of-band snapshotting and API-based triggers.

Key Outcomes

Improved Snapshot Consistency: New methods to capture the full state of a disk without risking corruption or data inconsistency.
Enhanced Reproducibility: A refined workflow that allows researchers to reliably share custom environments, facilitating collaborative and repeatable experiments.
User-Friendly Tooling: Streamlined processes that reduce disruption to running systems—so installing dependencies or rebooting into special environments is less burdensome.
Exploratory Features (Stretch Goals): Advanced mechanisms to stream disk data in real time during snapshotting and to initiate snapshots via an API call (for parity with VM snapshots).

Topics: Cloud Computing, Systems & Infrastructure, Reproducibility, Operating System Internals

Skills: Linux / OS Concepts, Cloud Tools, Systems Programming / Scripting, DevOps / CI

Difficulty: Moderate

Size: Medium

Mentors: Michael Sherman, Mark Powers

Tasks:

Ensure Snapshot Consistency
- Reboot into a ramdisk and copy the offline disk.
- Use kexec to switch to/from a ramdisk environment without a full reboot.
- Change images to use a snapshot-capable filesystem (e.g., LVM) for safer live snapshots.
- Investigate additional methods (e.g., blog.benjojo.co.uk) for safely imaging live disks.
Prevent System Modifications During Snapshot
- Currently, CC-Snapshot installs dependencies (e.g., qemu-img) on the running system, affecting its state.
- In-Band Fix: Download and run tools in a temp directory with static linking, avoiding system-level changes.
- Out-of-Band Approach: Snapshots done via ramdisk or kexec do not require altering the running system.
API-Triggered Snapshots
- Extend or integrate with the Nova “snapshot instance” API to support the same workflow for bare metal.
- Leverage Ironic’s new “service steps” feature for an automated snapshot pipeline.
(Stretch Goal) Streaming Snapshots
- Modify the workflow to stream data directly to storage, rather than making a full local copy first.
- Explore incremental or differential snapshot techniques to reduce bandwidth usage and storage overhead.

Chameleon Trovi Support for Complex Experiment Appliances

Tue, 18 Feb 2025 00:00:00 +0000

Overview

The discoverability and accessibility of research artifacts remains a significant barrier to reproducibility in computer science research. While digital libraries index research papers, they rarely provide direct access to the artifacts needed to reproduce experiments, especially complex multi-node systems. Additionally, when artifacts are available, they often lack standardized metadata, versioning, and deployment mechanisms that would enable researchers to easily find and reuse them. This project addresses these challenges by extending Trovi, a repository of experimental artifacts executable on open platforms, to support complex, multi-node appliances, making sophisticated experimental environments discoverable, shareable, and deployable through a standardized interface - ultimately lowering the barriers to reproducing complex systems experiments.

Chameleon has historically enabled researchers to orchestrate complex appliances—large, multi-node clusters configured via OpenStack Heat—to conduct advanced experiments. Meanwhile, Chameleon team introduced Trovi as repository for open platforms (beyond Chameleon) that pioneers mechanisms for artifact and platform integration leading to immediate execution for pratical reproducibility. This project aims to bridge the two by adding support in Trovi for importing, discovering, and launching complex appliances. By integrating these capabilities, researchers will be able to one-click deploy complex appliances directly from the Trovi dashboard, archive them for future reference, and reproduce experiments on demand.

Key Outcomes

Extended Trovi API: Enable the import and management of complex appliances as artifacts.
Streamlined One-Click Launch: Integrate with Chameleon’s existing provisioning workflows so users can launch multi-node clusters directly from Trovi.
Enhanced Dashboard Experience: Provide UI assistance for discovering, reviewing, and customizing complex appliance artifacts.
Improved Artifact Reproducibility: Automate the process of exporting CC-snapshot images and other resources to ensure everything is preserved across sites (UC, TACC), highlighting any parameters that need user attention for cross-site portability.

Topics: Reproducible Research, Cloud Computing & Orchestration, OpenStack Heat, UI/UX & Web Development

Skills: Python, APIs, Cloud (OpenStack), DevOps & Automation, Frontend

Difficulty: Hard

Size: Large

Mentors: Mark Powers

Tasks:

Extensions to the Trovi API
- Add support for importing complex appliances as artifacts (including Heat templates, metadata, and associated disk images).
- Develop methods for tagging, versioning, and categorizing these appliances, making them easier to discover.
One-Click Launch of Complex Appliances
- Integrate with Chameleon’s orchestration engine, enabling single-click cluster deployments from the Trovi UI.
- Validate correct configuration and resource availability through automated checks.
Trovi Dashboard Enhancements
- Update the front-end to provide intuitive controls for customizing or parameterizing complex appliances before launching.
- Offer a clear workflow for reviewing dependencies, resource requirements, and usage instructions.
Automated Export & Multi-Site Testing
- Streamline the export of snapshots or images into Trovi as part of the appliance import process.
- Optionally re-run the imported appliances at multiple sites (UC, TACC), detecting any unparameterized settings or missing dependencies.

Contextualization – Extending Chameleon’s Orchestration for One-Click Experiment Deployment

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Reproducibility in computer systems research is often hindered by the quality and completeness of artifact descriptions and the complexity of establishing experimental environments. When experiments involve multiple interconnected components, researchers struggle with hardcoded configurations, inadequate documentation of setup processes, and missing validation steps that would verify correct environment establishment. This project addresses these challenges by extending orchestration capabilities beyond basic hardware provisioning to include comprehensive contextualization—making complex, multi-component experimental environments deployable via parameterized templates with clear validation points, standardized metadata, and minimal user intervention—thus significantly reducing the barriers to reproducing complex distributed systems experiments.

Chameleon already provides powerful capabilities to orchestrate and configure resources through Heat templates (similar to Terraform) and the python-chi library. However, these focus primarily on provisioning (i.e., allocating and configuring hardware resources). This project goes a step further by addressing contextualization—the process of creating complete, ready-to-use experimental environments that incorporate everything from network layout to instance-level configuration and discovery—with additional features such as parameterized templates, experiment-level metadata, and output reporting.

Key Outcomes

Template-Based One-Click Launch: Users can deploy multi-resource experiments (VMs, networks, storage, etc.) via a single click or a minimal set of input parameters.
Enhanced Experiment Contextualization: Each launched resource can gain access to global “experiment-level” metadata (e.g., IP-to-hostname mappings for cluster authentication) and outputs that summarize important details.
Streamlined User Experience: An asynchronous deployment workflow that provides notifications and uses “outputs” to highlight critical connection information (e.g., bastion host IP, final results).
Optional Advanced Features: Partial reconfiguration to avoid full rebuilds when changes are minor, an “export” function to capture existing deployments into a new template, and potential publishing to Trovi for reproducibility and archiving.

Topics: Cloud Computing & Orchestration, Infrastructure as Code, DevOps & Automation, Reproducible Research Environments

Skills:

OpenStack & Heat Templates: Familiarity with provisioning resources on Chameleon using Heat or Terraform-like workflows.
Python & Scripting: For enhancing or extending the python-chi library.
Systems / Network Knowledge: Understanding multi-VM topologies, cluster configurations, and network-level interactions.
CI/CD & DevOps: Experience building or integrating asynchronous deployment and notifications.

Difficulty: Hard

Size: Large (suitable for a semester-long project or a summer internship)

Mentors: Paul Marshall

Tasks:

One-Click Template Launch
- Design a template (in Heat or similar) specifying multiple cloud resources (images, networks, disk images, SSH keys, etc.).
- Ensure the template author can define input parameters with defaults.
- Allow the user to launch the template quickly with default values or adjust parameters before deployment.
Asynchronous Provisioning & Notifications
- Implement a long-running process that deploys resources step-by-step.
- Provide status updates to the user (e.g., via UI notifications, email, or logs) when deployments complete or fail.
Experiment-Level Metadata
- Inject metadata such as IP-to-hostname mappings to each instance for easy cluster authentication.
- Allow the template to define “outputs” (like a public IP of a bastion or location of final results).
Partial Reconfiguration (Optional)
- Enable partial updates if only one of several servers changes, saving time and resources.
- Improve fault tolerance by avoiding full redeploys in the event of partial failures.
Export Running Configurations into a New Template (Optional)
- Build a web-interface or script to detect existing user-owned resources (servers, networks, etc.).
- Generate a proposed template from those resources, suggesting parameters (e.g., flavor, disk image, or SSH key).
- Extend or modify existing templates by adding discovered resources.
Integration with Trovi / Multi-Site Testing (Optional)
- Provide a method to archive or publish the final template (and associated disk images, data sets) in Trovi.
- Attempt to re-run the template at multiple Chameleon sites (e.g., UC, TACC) to identify parameters or modifications needed for cross-site reproducibility.

MPI Appliance for HPC Research on Chameleon

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Message Passing Interface (MPI) is the dominant programming model for high-performance computing (HPC), enabling applications to scale efficiently across thousands of processing cores. In reproducibility initiatives for HPC research, MPI implementations are critical as they manage the complex communications that underpin parallel scientific applications. However, reproducing MPI-based experiments remains challenging due to the need for specific library versions, network configurations, and multi-node setups that must be precisely orchestrated.

The popularity of an “MPI cluster” as a base layer for many results in HPC caused support for MPI template and appliance to be specifically requested by the SC24 reproducibility chair to support the conference’s reproducibility initiative, providing researchers with standardized environments for validating results. By extending the work begun for SC24, this project aims to create higher-quality, ready-to-use, and maintainable MPI environments for the Chameleon testbed that abstracts away complex configuration details while ensuring consistent performance across experiments—thus making HPC experiments more accessible and reproducible for the broader research community.

You will lead efforts to configure disk images with the necessary MPI dependencies and provide orchestration templates that set up networking and instances automatically. The resulting appliance will allow researchers to quickly and consistently deploy distributed computing environments with MPI. The goal is to facilitate reproducible and scalable computational experiments for a wide range of scientific and engineering applications.

Key Outcomes

Ready-to-Use MPI Disk Images: Create one or more images pre-configured with the correct versions of MPI and dependencies, ensuring a consistent environment.
Simple Cluster Configuration Scripts: Provide scripts or playbooks that efficiently bring up a fully functional MPI cluster on Chameleon, abstracting away manual setup steps.
Orchestration Template: An automated workflow that sets up networks, instances, and additional resources needed to run large-scale MPI workloads.

Topics: High-Performance Computing (HPC), Cloud Computing, MPI & Distributed Systems, DevOps & Automation

Skills:

MPI & Parallel Programming: Understanding of MPI libraries, cluster configuration, and typical HPC workflows.
Cloud Orchestration: Familiarity with OpenStack Heat or other Infrastructure-as-Code (IaC) tools for provisioning resources.
Linux System Administration: Experience configuring and troubleshooting packages, network settings, and performance optimizations.
Scripting & Automation: Ability to write scripts (e.g., Bash, Python) to automate setup and deployment steps.

Difficulty: Moderate to Hard

Size: Medium

Mentor: Ken Raffenetti

Tasks

Disk Images with MPI Dependencies
- Build base images with the correct versions of MPI (e.g., MPICH, OpenMPI) and any required libraries (e.g., GCC, network libraries).
- Ensure all packages are up to date and tested for compatibility with Chameleon’s bare metal and/or VM environments.
Cluster Setup Scripts
- Develop lightweight scripts or Ansible playbooks that join new instances into an MPI cluster, configuring hostnames, SSH keys, and MPI runtime settings.
- Validate cluster functionality by running simple distributed “Hello World” tests and more advanced benchmarks (e.g., Intel MPI Benchmarks).
Orchestration Template
- Provide a Heat template (or similar) specifying the network configuration, instance counts, and environment variables for MPI.
- Enable easy parameterization of cluster size, disk images, and other variables so users can customize their setups on the fly.
Integration & Testing
- Document best practices for launching and using the MPI images in Chameleon.
- Demonstrate reproducibility with multiple cluster sizes and workloads to ensure reliability.

Smart Environments – An AI System for Reproducible Custom Computing Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

The complexity of environment setup and the expertise required to configure specialized software stacks can often hinder efforts to reproduce important scientific achievements in HPC and systems studies. Researchers often struggle with incomplete or ambiguous artifact descriptions that make assumptions about “common knowledge” that is actually specific domain expertise. When trying to reproduce experiments, reviewers may spend excessive time debugging environment inconsistencies rather than evaluating the actual research. These challenges are compounded when experiments need to run on different hardware configurations.

This project seeks to address these fundamental reproducibility barriers by using AI to translate natural language environment requirements often used in papers or artifact descriptions into actionable, reproducible configurations—bridging the knowledge gap between experiment authors and reviewers while standardizing environment creation across different hardware platforms. We will develop an AI-driven system that automatically generates and configures reproducible computing environments based on artifact descriptions from conferences, Trovi artifacts on the Chameleon testbed, and other reliable sources for scientific experiment code and associated documentation. Leveraging Natural Language Processing (NLP), the system will allow researchers to describe desired environments in plain English, then map those descriptions onto predefined configuration templates. By simplifying environment creation and ensuring reproducibility, the system promises to eliminate duplicate setup efforts, accelerate research workflows, and promote consistent experimentation practices across diverse hardware.

Key Outcomes

Working Prototype: A system that automatically generates machine images deployable on bare metal and VM instances, based on user-provided requirements.
Comprehensive Documentation: Detailed user manuals, guides, and best practices tailored to researchers, ensuring a smooth adoption process.
Live Demo: A demonstration environment (e.g., a web app or Jupyter notebook) that shows how to request, configure, and launch reproducible cloud environments on both hardware profiles.
Long-Term Impact: Building blocks for future AI-driven automation of cloud infrastructure, reducing human error and enabling fast, repeatable research pipelines.

Topics: Reproducibility, AI & NLP, Cloud Computing, DevOps and Automation

Skills:

Machine Learning / AI: Familiarity with NLP methods to interpret user requirements.
Python: Primary language for backend services and cloud interactions.
Cloud API Integration: Experience with OpenStack or similar APIs to provision and configure images on both bare metal and virtual machines.
DevOps: Automated environment configuration, CI/CD workflows, and containerization.

Difficulty: Hard

Size: Large

Mentors: Paul Marshall

Tasks:

Requirement Gathering & NLP Design
- Research the specific needs of researchers building experimental setups.
- Design an NLP pipeline to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into environment “recipes.”
Backend Environment Builder
- Implement logic that converts parsed user requirements into machine-image definitions for bare metal and VM instances.
- Integrate with Chameleon’s APIs to provision servers, install software, and run configuration validation automatically.
Front-End & User Experience
- Develop an intuitive web or CLI interface that researchers can use to capture experiment environment requirements.
- Provide real-time status updates during environment setup, along with meaningful error messages and quick-start templates.
Testing & Validation
- Conduct end-to-end tests using diverse software stacks (e.g., HPC libraries, machine learning frameworks) on bare metal and VM instances.
- Ensure reproducibility by re-creating the same environment multiple times and comparing configurations.
Documentation & Demonstration
- Produce user-facing documentation, including tutorials and best practices for researchers who frequently run experiments on Chameleon Cloud.
- Create a short live demo or screencast showcasing how to configure an environment for a specific research workflow.

Smart Environments – An AI System for Reproducible Custom Computing Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Key Outcomes

Working Prototype: A system that automatically generates machine images deployable on bare metal and VM instances, based on user-provided requirements.
Comprehensive Documentation: Detailed user manuals, guides, and best practices tailored to researchers, ensuring a smooth adoption process.
Live Demo: A demonstration environment (e.g., a web app or Jupyter notebook) that shows how to request, configure, and launch reproducible cloud environments on both hardware profiles.
Long-Term Impact: Building blocks for future AI-driven automation of cloud infrastructure, reducing human error and enabling fast, repeatable research pipelines.

Topics: Reproducibility, AI & NLP, Cloud Computing, DevOps and Automation

Skills:

Machine Learning / AI: Familiarity with NLP methods to interpret user requirements.
Python: Primary language for backend services and cloud interactions.
Cloud API Integration: Experience with OpenStack or similar APIs to provision and configure images on both bare metal and virtual machines.
DevOps: Automated environment configuration, CI/CD workflows, and containerization.

Difficulty: Hard

Size: Large

Mentors: Paul Marshall

Tasks:

Requirement Gathering & NLP Design
- Research the specific needs of researchers building experimental setups.
- Design an NLP pipeline to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into environment “recipes.”
Backend Environment Builder
- Implement logic that converts parsed user requirements into machine-image definitions for bare metal and VM instances.
- Integrate with Chameleon’s APIs to provision servers, install software, and run configuration validation automatically.
Front-End & User Experience
- Develop an intuitive web or CLI interface that researchers can use to capture experiment environment requirements.
- Provide real-time status updates during environment setup, along with meaningful error messages and quick-start templates.
Testing & Validation
- Conduct end-to-end tests using diverse software stacks (e.g., HPC libraries, machine learning frameworks) on bare metal and VM instances.
- Ensure reproducibility by re-creating the same environment multiple times and comparing configurations.
Documentation & Demonstration
- Produce user-facing documentation, including tutorials and best practices for researchers who frequently run experiments on Chameleon Cloud.
- Create a short live demo or screencast showcasing how to configure an environment for a specific research workflow.

Widgets for Python-chi in Jupyter

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Reproducibility challenges in research extend beyond code and environments to the experimental workflow itself. When experiments involve dynamic resource allocation, monitoring, and reconfiguration, researchers often struggle to document these interactive steps in a way that others can precisely follow. The lack of structured workflow documentation and real-time feedback creates barriers for reviewers attempting to reproduce experiments, as they cannot easily verify whether their resource configurations match the original experiment’s state. This project addresses these challenges by developing interactive Jupyter widgets that make experiment resource management more visual, intuitive, and self-documenting—transforming ad-hoc command sequences into reproducible workflows that automatically log interactions and configuration changes while providing immediate visual feedback on experiment topology and resource states.

As cloud researchers often work with Jupyter Notebooks for interactive data analysis and experimentation, the python-chi library offers a powerful way to automate and control resources on Chameleon Cloud. This project will extend python-chi by adding interactive widgets specifically designed for use in Jupyter, empowering users to launch, monitor, and manage their experiments without leaving the notebook environment. By bringing visual and intuitive controls directly into the user’s workflow, we aim to improve both reproducibility and usability for complex resource management tasks.

Key Outcomes

User-Friendly Jupyter Widgets: Develop a suite of widgets to visualize reserved resources, hardware availability, and experiment topologies in real time.
Integrated Experiment Management: Enable researchers to orchestrate experiments (launch, configure, monitor) within a single, notebook-centric workflow.
Enhanced Feedback & Usability: Provide clear, asynchronous status updates and resource reconfiguration progress, reducing confusion and user error.
Improved Reproducibility: By automating and logging widget interactions, experiments become more traceable and easier to replicate.

Topics: Interactive Data Tools, Cloud Resource Management, DevOps & Automation, User Experience (UX)

Skills:

Python & Jupyter: Experience creating custom Jupyter widgets, using ipywidgets or similar frameworks.
Cloud Automation: Familiarity with how resources are provisioned, monitored, and deprovisioned on Chameleon.
Frontend / GUI Development: Basic understanding of web technologies (HTML/CSS/JavaScript) can be helpful for widget design.
Software Engineering & CI: Ability to version-control, test, and deploy Python packages.

Difficulty: Moderate

Size: Medium

Mentor: Michael Sherman, Mark Powers

Tasks:

Resource Visualization Widgets
- Build custom widgets that show reserved resources (nodes, networks, storage) in Jupyter.
- Provide an interactive topology view for experiments, indicating node statuses and connections.
Experiment Setup & Execution
- Add controls for launching and managing experiments directly from notebooks.
- Show feedback (e.g., progress bars, status messages) as resources are being allocated or reconfigured.
Hardware Availability & Status Tracking
- Implement a widget that provides real-time data on Chameleon’s hardware availability (bare metal, VMs, GPU nodes, etc.).
- Allow users to filter or select specific resources based on current hardware states.
Usability & Feedback Loop
- Gather user feedback on the widget designs and workflows.
- Refine the interface to minimize clicks, improve clarity, and reduce friction for common tasks.

Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges

Sat, 15 Feb 2025 00:00:00 +0000

Project Description

Topics: Distributed systems
Skills: Java, Go, Python, Bash scripting, Linux, Docker.
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fadhil I. Kurnia

Replication is commonly employed to improve system availability and reduce latency. By maintaining multiple copies, the system can continue operating even if some replicas fail, thereby ensuring consistent availability. Placing replicas closer to users further decreases latency by minimizing the distance data must travel. A typical illustration of these advantages is a Content Delivery Network (CDN), where distributing content to edge servers can yield latencies of under 10 milliseconds when users and contents are in the same city.

In recent times, numerous edge datastores have emerged, allowing dynamic data to be served directly from network-edge replicas. Each of these replicated systems may employ different coordination protocols to synchronize replicas, leading to varied performance and consistency characteristics. For instance, Workers KV relies on a push-based coordination mechanism that provides eventual consistency, whereas Cloudflare Durable Objects and Turso deliver stronger consistency guarantees. Additionally, researchers have introduced various coordination protocols—such as SwiftPaxos, EPaxos, OPaxos, WPaxos, Raft, PANDO, and QuePaxa—each exhibiting its own performance profile, especially when being used in geo-distributed deployment.

This project aims to develop an open testbed for evaluating replicated systems and their coordination protocols under edge deployment. Currently, researchers face challenges in fairly comparing different replicated systems, as they often lack control over replica placement. Many previous studies on coordination protocols and replicated systems relied on mock implementations, particularly for well-known systems like Dynamo and Spanner, which are not open source. An open testbed would provide a standardized environment where researchers can compare various replicated systems, classes of coordination protocols, and specific protocol implementations using common benchmarks. Since the performance of replicated systems and coordination protocols varies depending on the application, workload, and replica placement, this testbed would offer a more systematic and fair evaluation framework. Furthermore, by enabling easier testing and validation, the testbed could accelerate the adoption of research prototypes in the industry.

Project Deliverables

Compilation of traces and applications from various open traces and open benchmarks.
Distributed workload generator to run the traces and applications.
Test framework to simulate latency of 100s of edge servers for measurement.
Open artifact of the traces, applications, workload generator, and test framework, published on Github.

StatWrap

Sun, 09 Feb 2025 00:00:00 +0000

Additional information:

Project Search

Topics: search, user interface, indexing
Skills: JavaScript, React
Difficulty: Medium
Size: Large (350 hours)
Mentor: Luke Rasmussen, Eric Whitley

The goal of this project is to leverage the information entered by users and passively discovered by StatWrap to facilitate cross-project searching. This functionality will allow investigators to search across projects (current and past) to find relevant projects, assets, and notes. Given the potentially sensitive nature of data included in projects, the indexing of content for searching must be done locally.

The specific tasks of the project include:

Identify and evaluate open-source projects to index content for searching
Add a new classification for projects of “Active” and “Past” in the user interface
Implement the search capability within the user interface
Develop unit tests and conduct system testing

Final Blog: SS_Bench - Benchmarking SciStream

Fri, 31 Jan 2025 00:00:00 +0000

Introduction

Hello! My name is Acheme, and I’m thrilled to have collaborated with my mentors Joaquin Chung and Flavio Castro under the SciStream project. This project aims to develop SciStream-bench, a set of benchmarks and artifacts designed to precisely evaluate the performance of scientific streaming applications across diverse traffic patterns when running over the SciStream framework.

In the first half of the project, I focused on describing scientific streaming profiles based on use-cases experienced at Argonne National Lab. The necessary python scripts were developed to generate bursty and constant rate streaming traffic profiles.

In the second half, I built upon this foundation by conducting experiments with the traffic profiles and measuring performance through metrics of latency, jitter and throughput. These experiments were conducted with different message sizes across LAN and WAN network topology.

Key Achievements

Streaming Traffic Profile:
- Developed scripts to generate streaming traffic profiles with configurable parameters.
Created an Artifact:
- I created an artifact using a Jupyter notebook to document an easy to follow integration of SciStream with FABRIC testbed for future experimenters.

Conclusion and Future Work

The work demonstrated that SciStream offers tolerable overhead for secure data streaming and experimentation with this middlebox is possible in publicly available testbed like FABRIC. Future work would be to look into the comparative analysis of the performance of SciStream with or without hardware acceleration or offloading.

Deliverables

SciStream on FABRIC Demo: A demo can be found here on how to integrate SciStream on the FABRIC testbed SciStream on FABRIC.
Jupyter Notebook: An Artifact on FABRIC portal: FABRIC Artifact.

Final Report: Deriving Realistic Performance Benchmarks for Python Interpreters

Tue, 12 Nov 2024 00:00:00 +0000

Hi, I am Mrigank. As a Summer of Reproducibility 2024 fellow, I have been working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah. In particular, we want to benchmark Meta’s Static Python interpreter (which is a part of their Cinder project) and compare its performance with CPython on different levels of typing. In this post, I will share updates on my work since my last update. This post forms my final report for the Summer of Reproducibility 2024.

Since Last Time: Typing Django Files

Based on the profiling results from load testing a Wagtail blog site, I identified three modules in Django that were performance bottlenecks and added shallow types to them. These are available on our GitHub repository.

I also wrote a script to mix untyped, shallow-typed, and advanced-typed versions of a Python module and create a series of such gradually typed versions.

Summary of Experience and Contributions

I tried to set up different versions of Zulip to make them work with Static Python. My setup scripts are available in our repository. Unfortunately, Zulip’s Zerver did not run with Static Python due to incompatibility of some Django modules. A few non-Django modules were also initially throwing errors when run with Static Python due to a bug in Cinder – but I was able to get around with a hack (which I have described in the linked GitHub issue I opened on Cinder’s repository).
I created a locust-version of the small Django-related benchmarks available in pyperperformance and skybison. This helped me confirm that Django is by itself compatible with Static Python, and helped me get started with Locust. This too is available in our repository.
As described in the midterm report, I created a complete pipeline with Locust to simulate real-world load on a Wagtail blog site. The instructions and scripts for running these load tests as well as profiling the Django codebase are available (like everything else!) in our repository.
We added shallow types to the three Django modules mentioned above, and I created scripts to mix untyped, shallow-typed, and advanced-typed versions of a Python module to create a series of gradually typed versions to be tested for performance. We found that advanced-typed code may often be structurally incompatible with shallow-typed code and are looking for a solution for this. We are tracking some examples of this in a GitHub issue.

Going Forward

I had a great time exploring Static Python, typing in Python, load testing, and all other aspects of this project. I was also fortunate to have a helpful mentor along with other amazing team members in the group. During this project, we hit several roadblocks like the challenges in setting up real-world applications with Static Python and the difficulty in adding advanced types – but are managing to work around them. I will be continuing to work on this project until we have a complete set of benchmarks and a comprehensive report on the performance of Static Python.

Our work will continue to be open-sourced and available on our GitHub repository for anyone interested in following along or contributing.

[Final Report] Automated Reproducibility Checklist support within StatWrap

Sat, 02 Nov 2024 00:00:00 +0000

Namaste🙏🏻! I’m Adi Akhilesh Singh, and I’m excited to share my final updates on the Reproducibility Checklists project by StatWrap, under the mentorship of Luke Rasmussen.

Project Overview

This project introduces customizable reproducibility checklists in StatWrap, enabling metadata-driven and user-guided generation of checklists. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklist to ensure their work is reproducible.

Project Links

Explore the StatWrap project repository and my contributions during GSoC ‘24:

Progress And Achievements

During the timeline of this project, I worked on designing the interface for the checklist page and the data structure to support the project needs.

The interface was designed with user needs in mind, featuring components such as:

URLs component to manage external links or file URIs, attached to the project.
Images component to display project image files.
Checklist Notes component to manage user-added notes.

All these assets (Files, URLs, Images) can be added to each checklist statement using the existing assets and external resources(urls) present in the project.

Additionally, for each checklist item, StatWrap runs relevant scans to provide meaningful data based on its requirements. For example, for the item, “All the software dependencies for the project are documented,” StatWrap scans project files to list the languages and dependencies detected. For each checklist statement supported in StatWrap, we implement methods to retrieve specific information by scanning project data. StatWrap currently supports six such checklist statements identified as foundational for ensuring research reproducibility. Additionally, the checklist can be exported as a PDF summary, generated by StatWrap using the checklist data, with options to include notes.

Future Prospects

As the project concludes, several areas for growth have emerged:

Expanding language support within StatWrap. While StatWrap already includes key languages used in research, there is always a scope to extend compatibility to cover even more technologies.
Options to export a data-extensive report that includes checklist and their associated scan results. These and other enhancements, like adding new checklist statements with their scanning methods, will extend StatWrap’s impact on reproducibility in research.

Earlier Blogs

If you’re interested in seeing the project’s evolution, check out my earlier posts:

Thank you for reading!

ML-Powered Problem Detection in Chameleon

Fri, 18 Oct 2024 00:00:00 +0000

Hello! My name is Syed Mohammad Qasim, a PhD candidate at the Department of Electrical and Computer Engineering, Boston University. This summer I worked on the project ML-Powered Problem Detection in Chameleon as part of the Summer of Reproducibility (SoR) program with the mentorship of Ayse Coskun and Michael Sherman.

Chameleon is an open testbed that has supported over 5,000 users working on more than 500 projects. It provides access to over 538 bare metal nodes across various sites, offering approximately 15,000 CPU cores and 5 petabytes of storage. Each site runs independent OpenStack services to deliver its offerings. Currently, Chameleon Cloud comprehensively monitors the sites at the Texas Advanced Computing Center (TACC) and the University of Chicago. Metrics are collected using Prometheus at each site and fed into a central Mimir cluster. All logs are sent to a central Loki, with Grafana used for visualization and alerting. Chameleon currently collects around 3,000 metrics. Manually reviewing and setting alerts for them is time-consuming and labor-intensive. This project aims to help Chameleon operators monitor their systems more effectively and improve overall reliability by creating an anomaly detection service to augment the existing alerting framework.

Over the summer, we focused on analyzing the data and identified 33 key metrics, after discussions with Chameleon operators, from the Prometheus Node Exporter that serve as leading indicators of resource usage on the nodes. For example:

CPU usage: Metrics like node_load1, node_load5, and node_load15.
Memory usage: Including buffer utilization.
Disk usage: Metrics for I/O time, and read/write byte rates.
Network activity: Rate of bytes received and transmitted.
Filesystem metrics: Such as inode_utilization_ratio and node_procs_blocked.
System-level metrics: Including node forks, context switches, and interrupts.

Collected at a rate of every 5 minutes, these metrics provide a comprehensive view of node performance and resource consumption. After finalizing the metrics we wanted to monitor, we selected the following four anomaly detection methods, primarily due to their popularity in academia and recent publication in high-impact conferences such as SIG-KDD and SC.

Omni Anomaly, [KDD 2019] [without POT selection as it requires labels.]
USAD, [KDD 2020]
TranAD, [KDD 2022]
Prodigy, [SC 2023] [Only the VAE, not using their feature selection as it requires labels.]

We collected 75 days of healthy data from Chameleon, and after applying min-max scaling, we trained the models. We then used these models to run inference on the metrics collected during outages, as marked by Chameleon operators. The goal was to determine whether the outage data revealed something interesting or anomalous. We can verify our approach by manually reviewing the results generated by these four anomaly detection methods. Below are the results from the four methods on different outages, followed by an example of how these methods identified the root cause of an anomaly.

The above figure shows the percentage of outage data that was flagged as anomalous by different models.

The above two plots shows two examples of the top 5 metrics which contributed to the anomaly score by each anomaly detection model.

Although the methods seem to indicate anomalies during outages, they are not able to pinpoint the affected service or the exact cause. For example, the first partial authentication outage was due to a DNS error, which can manifest in various ways, such as reduced CPU, memory, or network usage. This work is still in progress, and we are conducting the same analysis on container-level metrics for each service, allowing us to narrow the scope to the affected service and more effectively identify the root cause of anomalies. We will share the next set of results soon.

Thanks for your time, please feel free to reach out to me for any details or questions.

Data Leakage in Applied ML: model uses features that are not legitimate

Tue, 24 Sep 2024 00:00:00 +0000

Hello everyone!

I have been working on reproducing the results from Identification of COVID-19 Samples from Chest X-Ray Images Using Deep Learning: A Comparison of Transfer Learning Approaches. This study aimed to distinguish COVID-19 cases from normal and pneumonia cases using chest X-ray images. Since my last blog post, we have successfully reproduced the results using the VGG19 model, achieving a 92% accuracy on the test set. However, a significant demographic inconsistency exists: normal and pneumonia chest X-ray images were from pediatric patients, while COVID-19 chest X-ray images were from adults. This allowed the model to achieve high accuracy by learning features that were not clinically relevant.

In Reproducing “Identification of COVID-19 samples from chest X-Ray images using deep learning: A comparison of transfer learning approaches” without Data Leakage, we followed the methodology outlined in the paper, but with a key change: we used datasets containing adult chest X-ray images. This time, the model achieved an accuracy of 51%, a 41% drop from the earlier results, confirming that the metrics reported in the paper were overly optimistic due to data leakage, where the model learned illegitimate features.

To further illustrate this issue, we created a toy example demonstrating how a model can learn illegitimate features. Using a small dataset of wolf and husky images, the model achieved an accuracy of 90%. We then revealed that this performance was due to a data leakage issue: all wolf images had snowy backgrounds, while husky images had grassy backgrounds. When we trained the model on a dataset where both wolf and husky images had white backgrounds, the accuracy dropped to 70%. This shows that the accuracy obtained earlier was an overly optimistic measure due to data leakage.

You can explore our work on the COVID-19 paper here.

Lastly, I would like to thank Fraida Fund and Mohamed Saeed for their support and guidance throughout my SoR journey.

Towards Scalable Performance Benchmarking of Genomics Workflows

Thu, 19 Sep 2024 00:00:00 +0000

Project Background

Optimizing genomics workflows execution on a large-scale & heterogeneous cluster requires in-depth understanding of resource requirement and utilization pattern of each application in the workflows. Such information can be obtained by using a benchmarking tool. However, performance data generated by such tool should represent the scale of its target system, lest the design decisions made from it is misguided. My project aims to build GenScale, the first benchmarking tool which can rapidly generate genomics workload performance data at the scale representative of production systems.

As Summer of Reproduciblity (SoR) 2024 comes to an end, I took the time to reflect on my time working on GenScale, the challenges I faced, and the future works & impacts I hope GenScale create for our community.

Milestones & Challenges

The time I spent working on GenScale during SoR can be classified into three phases:

1. Per-Application Container & Input Creation.

Containerization is the current de-facto standard for genomics workflow execution, thus I designed GenScale to execute applications as containers. This requires me to package each application included in the benchmark as a container. I use state-of-art DNA-Seq & RNA-Seq alignment workflows as references for the list of applications & workflow structure. The container images & source files I created are publicy available in GitHub (Deliverables #1)

I also prepare sample inputs for each application to ease the burden of users who do not have sufficient familiarity with genomics applications. The effort is not trivial, because in a workflow, the inputs for a certain step depend on the outputs of previous step(s). Simply speaking, to prepare inputs for the last application in a workflow, we need to get the outputs of applications executed before it, which also requires the outputs of another set of applications, and so on until we arrive at the beginning of workflow. This translates into significant manual labor of carefully tracing & collecting intermediate files from each step of the reference workflows.

All inputs are hosted in a public Google Drive and ChameleonCloud object store (Deliverables #2). In total, I prepared containers and inputs for 7 popular genomics applications: BWA, FastQC, Fastq Cleaner, GATK, Picard, STAR, and Trimmomatic.


Figure 1. Production-grade softwares used in GenScale: Kubernetes for task orchestration, and Prometheus + Grafana for real-time resource monitoring.

2. Components Development.

In this phase, GenScale main components were developed. GenScale consists of three components: (a) Workflow Manager, (b) Task Orchestrator, and (c) Resource Monitor. The Workflow Manager is built from scratch to allow high degree of freedom when scheduling workflows. I use industry-grade solutions for the other components, namely Kubernetes for orchestrating tasks / containers, and Prometheus + Grafana for real-time resource monitoring. My deliverables include semi-automatic installation scripts & easy-to-follow instructions to set up all three components. (Deliverables #3)

3. Performance Data Generation.

The last phase is to use GenScale prototype to generate performance data of each application. I focused on collecting data for three types of resources: compute (CPU utilization), memory (resident set size), and I/O (read & write operations over time). GenScale export these information into a single CSV file to facilitate easy analysis. My deliverables include performance data for DNA-Seq and RNA-Seq workflows. I also provide a sample Python Notebook which analyzes the CPU utilization pattern of each application in DNA-Seq workflow. (Deliverables #4)


Figure 2. CPU utilization pattern of 9 applications in DNA-Seq Alignment workflow collected by GenScale. y-axis: (num. cores) x 100%, x-axis: time elapsed in seconds.

Deliverables

This project’s deliverables can be found in the following Github repo: https://github.com/martinluttap/sor24-genscale/tree/main. In summary, the deliverables include:

Container Images
Input Dataset
Source Code
Performance Data & Sample Analysis Notebook

Future Works, Broader Impacts

Understanding workload characteristics is a crucial step for designing efficient scheduling policy & resource management techniques. GenScale and the performance data it can generate might be a starting point for such effort. Furthermore, I hope GenScale will catalyze meaningful engagements between the computer systems community and bioinformatics community. I believe state-of-arts systems techniques can greatly aid the computing efforts among bioinformatics community. Similarly, domain-specific knowledge & problems within bioinformatics provide unique grounds for the systems community to further advance their field.

[Final] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Wed, 18 Sep 2024 00:00:00 +0000

Hello everyone,

In my SoR 2024 project, ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. I’m excited to share the final progress and insights we’ve gathered on tackling scalability bugs in large-scale distributed systems. I aimed to tackle the reproducibility challenges posed by scalability bugs in large-scale distributed systems. Below is a detailed summary of the investigations and findings we’ve conducted on scalability bugs in large-scale distributed systems.

Project Overview

As you may recall, our project, ScaleRep, aimed to tackle the challenge of scalability bugs—those insidious issues that often arise in large-scale distributed systems under heavy workloads. These bugs, when triggered, can lead to significant system issues such as downtime, performance bottlenecks, and even data loss. They are particularly difficult to catch using traditional testing methods.

Our primary focus was on reproducing these bugs, documenting the challenges involved, and providing insights into how these bugs manifest under various conditions. This documentation will help researchers identify, benchmark, and resolve similar issues in the future.

Progress

Since the midterm update, several Apache Ignite bugs have been investigated, some of which have been successfully reproduced and uploaded to Trovi for the research community to access and reuse. Below is the progress on the bugs investigated:

Bugs Investigated

Key Insights & Challenges

Complexity of Scalability Bugs Many scalability bugs involve subtle and complex interactions that are not easily detected in standard testing environments. For instance, IGNITE-20602 only manifested under certain high-load conditions and required a specific workload and environment to reliably trigger the issue. This highlights the importance of large-scale testing when investigating scalability issues.
Dependency and Documentation Gaps We encountered significant challenges with outdated dependencies and incomplete documentation, particularly in older bugs like IGNITE-16072. In these cases, reproducing the bug required extensive modifications or wasn’t feasible without investing disproportionate effort in updating dependencies.
Effectiveness of Trovi and Chameleon Packaging and sharing our reproducible investigations through Trovi and Chameleon have proven highly effective. By providing researchers with pre-configured environments and detailed documentation, we’ve laid the groundwork for future collaboration and further research on these bugs. We expect this to greatly benefit others attempting to reproduce similar issues.
Impact of Speed-Based Throttling Our investigation into IGNITE-16600 revealed several important insights into speed-based throttling and its impact on system performance under high-load conditions. By analyzing the checkpoint starvation and thread throttling mechanisms, we were able to identify areas for improvement in the latest Ignite releases.

Next Steps

Expanding Collaboration: The packaged bugs and replayable Trovi experiments will be made available to the broader research community, encouraging further investigation and enhancements to large-scale distributed systems.

The ScaleRep project has been an exciting journey into the world of scalability bugs, pushing the boundaries of what’s possible in terms of reproducibility and benchmarking. Through this project, we’ve demonstrated the importance of rigorous testing and comprehensive documentation in improving the reliability of distributed systems.

Final Blog: Enhancing User Experience Reproducibility through TROVI Redesign

Wed, 18 Sep 2024 00:00:00 +0000

Hello! My name is Alicia Esquivel Morel, and I’m a graduate research assistant at the University of Missouri – Columbia, pursuing a PhD in Computer Science. This summer, I worked on a project to improve user experience reproducibility through a redesign of TROVI, as part of the Summer of Reproducibility (SoR) program.

Before even starting this project, and me as a rising researcher, I always saw reproducibility as one of the biggest challenges in research. What I wanted to see was always as reproducibility—being able to consistently replicate experiments and share them in a way that others can follow.

TROVI, is a platform designed to help with this. However, as I joined the project, I knew it had room for improvement, not oly in the user interface, but also in the ease of integrating code and data.

This project aimed to address these challenges by redesigning TROVI to streamline experiment replication, making the platform more intuitive and accessible. The goal was simple: create a user-friendly experience that eliminates confusion and frustration, allowing researchers to focus on their work instead of the technical aspects of running a research experiment.

Our goals in the beginning of the summer:

We wanted to simplify TROVI’s interface for intuitive navigation, inspired by platforms like Google Colab.
We wanted to make uploading and sharing code and data easier, with seamless integration with tools like GitHub.
We wanted to create a mechanism for users to provide feedback, allowing TROVI to evolve based on real user needs.

How was the progress and what we have achieved

I started by conducting thorough UX research and a literature review on reproducibility platforms, establishing a solid foundation for the redesign. With user feedback guiding the process, I created wireframes and low-fidelity prototypes, focusing on making the platform more intuitive.

As the project progressed, I built a higher-fidelity prototype that connected various components of the platform, ensuring a seamless user journey. I then tackled the back-end integration, which tied together the front-end flows with TROVI’s API.

Throughout this project, I received valuable support and guidance from my mentors. Mark Powers walked me through TROVI’s architecture and helped me understand exactly what was needed for a successful redesign. Thanks to his mentorship, I not only completed the project but learned a great deal along the way. Thanks Mark Powers!!

Through iterations and feedback from initial user testing, and we the help of Kate Keahey, I refined the design to ensure it met the needs of the research community. By the end of the program, TROVI had evolved into a cohesive, user-friendly platform that leads to enhanced experiment reproducibility.

Accomplishments

A simplified interface that makes navigating, uploading, and collaborating much easier.
GitHub integration that streamlines the process of sharing code and data with collaborators.
A built-in feedback loop that enables TROVI to grow with its users, adapting to their needs as they arise.

The platform is also getting ready to move into production and will soon be available for the research community.

What’s Next?

While the core objectives have been successfully met, future improvements could further enhance the platform’s capabilities, such as additional integrations and more advanced collaboration features. User testing will continue to provide insights for ongoing development.

I’m grateful for this opportunity! Thank you for following along!

[MidTerm] StatWrap: Automated Reproducibility Checklists Generation

Mon, 16 Sep 2024 00:00:00 +0000

Namaste🙏🏻! I’m Adi Akhilesh Singh, and I’m excited to share progress updates on the Reproducibility Checklists project by StatWrap, under the mentorship of Luke Rasmussen.

Project Overview

The project aims to integrate customizable reproducibility checklists into StatWrap, using metadata and user input to automate their generation. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklists to ensure their work is reproducible.

Progress

Over the past few months, my mentors and I have worked on developing the interface for the checklists page and designed key components to support our project goals. We’ve implemented logic that iterates over each checklist item, displaying its statement along with Boolean controls (Yes/No buttons) for user interaction.

We’ve also developed components to display attached images and URLs linked to each checklist item. Additionally, we’ve integrated a notes feature that allows users to add, edit, and view project-related notes. Currently, we are writing methods to integrate real-time project data into the checklists. For example, one method we’ve implemented scans project files (assets) to detect the languages used.

What’s Next?

As we move closer to the final evaluation phase, our focus will be on the following objectives:

Implement methods for each checklist item, integrating real-time data from the project data to auto-populate checklist answers.
Enhance the Attached Images component to allow users to select and attach existing image assets from the project.
Display the results of the scans for each checklist item, providing users with detailed outputs based on the automated analysis.

Stay tuned for further updates as we continue developing this feature set! 🚀

Final Post: Enhancing Reproducibility and Portability in Network Experiments

Thu, 05 Sep 2024 00:00:00 +0000

Introduction

As my project with the Summer of Reproducibility (SoR) 2024 comes to a close, I’d like to reflect on the journey and the outcomes achieved. My project focused on enhancing the reproducibility and portability of network experiments by integrating the RO-Crate standard into the TUM intern testbed pos (plain orchestrating service), and deploying this testbed on the Chameleon cloud infrastructure. The aim was to ensure that experiments conducted on one platform could be seamlessly reproduced on another, adhering to the FAIR principles (Findable, Accessible, Interoperable, Reusable) for research data.

Project Recap

The core goal was to make the experiments reproducible and portable between different testbeds like TUM’s pos and Chameleon. To achieve this, I integrated the RO-Crate standard, which ensures that all experiment data is automatically documented and stored with metadata, making it easier for others and especially for machines to understand, replicate, and build on the results. Additionally, deploying a lightweight version of pos on the Chameleon testbed enabled cross-testbed execution, allowing experiments to be replicated across both environments without significant modifications.

Key Achievements

Over the course of the project, several key milestones were achieved:

RO-Crate Integration: The first step was restructuring the results folder and automating the generation of metadata using RO-Crate. This ensured that all experiment data was comprehensively documented with details like author information, hardware configurations, and experiment scripts resulting in comprehensive ro-crate-metadata.json files as important part of each result folder.
Improved Data Management: The integration of RO-Crate greatly simplified the process of organizing and retrieving experiment data and metadata with information about the experiment and the result files. All metadata was automatically generated, making it easier to share and document the experiments for other researchers to replicate.
Automatic Upload to Zenodo: Another crucial achievement was the implementation of automatic uploading of pos experiment result folders to Zenodo, an open-access repository. This step significantly improved the reproducibility and sharing of experiment results, making them easily accessible to the broader scientific community. By utilizing Zenodo, we ensured that experiment results, along with their RO-Crate metadata, could be archived and referenced, fostering greater transparency and collaboration in scientific research.
Chameleon Deployment: Deploying the pos testbed within the Chameleon environment required managing various complexities, particularly related to Chameleon’s OpenStack API, networking setup, and hardware configurations. Coordinating the network components and infrastructure to support pos functionality in this testbed environment demanded significant adjustments to ensure smooth integration and operation.

Challenges

Like any project, this one came with its own set of challenges:

Balancing Automation and Flexibility: While automating the generation of RO-Crate metadata, it was crucial to ensure that the flexibility required by researchers for customizing their documentation was not compromised. Finding this balance required in-depth adjustments to the testbed infrastructure.
Complexity of Testbed Systems: Integrating RO-Crate into a complex system like pos, and ensuring it works seamlessly with Chameleon, involved understanding and adapting to the complexities of both testbeds.

Future Directions

As I move forward with my master’s thesis working on these challenges, we plan to expand on this work by:

Extending the Chameleon Deployment: We aim to deploy the full version of pos on Chameleon, supporting more complex and larger-scale experiments.
Supporting Complex Experiment Workflows: Future work will focus on handling more intricate and larger datasets, ensuring reproducibility for complex workflows. Only by executing more complex experiments will we be able to thoroughly analyze and compare the differences between executions in pos and the pos deployed on Chameleon, helping us better understand the impact of different testbed environments on experiment outcomes.
Automation: The ultimate goal is to fully automate the process of experiment execution, result documentation, and sharing across testbeds, reducing manual intervention and further enhancing reproducibility.

Reflections

By integrating the RO-Crate standard and deploying pos on the Chameleon testbed, we have made significant steps toward enhancing the reproducibility, accessibility, and portability of network experiments across research platforms. These efforts contribute to more shareable, and replicable research processes in the scientific community.

I am excited about the future work ahead and am grateful for the mentorship and support I received during this project.

Deliverables and Availability

Due to the current non-public status of the pos framework, the code and deliverables are not publicly available at the moment.

Previous Blogs

Make sure to check out my other blogs to see how I started this project and the challenges I faced along the way:

Servus!

Understanding Data Leakage in Machine Learning: A Focus on TF-IDF

Thu, 05 Sep 2024 00:00:00 +0000

Hello again!

This is my final blog post, and I will be discussing the second material I created for the 2024 Summer of Reproducibility Fellowship. As you may recall from my first post, I am working on the Exploring Data Leakage in Applied ML: Reproducing Examples of Irreproducibility project with Fraida Fund and Mohamed Saeed as my mentors.

This blog post will explore how data leakage can occur during feature extraction, particularly with the commonly used TF-IDF vectorizer, and its impact on model generalization.

Introduction

In machine learning, data leakage is a critical issue that can severely impact model performance. It occurs when information from outside the training dataset is improperly used to create the model, leading to overly optimistic performance during evaluation. One common source of leakage comes from how features, such as those extracted using TF-IDF (Term Frequency-Inverse Document Frequency), are handled. In this post, we’ll explore how data leakage can happen during feature extraction with TF-IDF and how it affects model accuracy.

What is TF-IDF?

TF-IDF is a method used to evaluate how important a word is in a document relative to a collection of documents. It consists of two components:

Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Reduces the importance of terms that appear frequently across many documents.

Together, they provide a weighted value for each word, reflecting its importance relative to the dataset.

How Data Leakage Occurs with TF-IDF

Data leakage with TF-IDF happens when the inverse document frequency (IDF) is calculated using the entire dataset (including the test set) before splitting it into training and test sets. This means the model has access to information from the test set during training, leading to artificially inflated results. This is a subtle form of data leakage, as it often goes unnoticed.

For example, when calculating the TF-IDF score, if the word “banana” appears more frequently in the test set but is considered during training, the model downplays its significance. As a result, the model may fail to predict correctly when “banana” is important in the test data.

Why Does This Matter?

If the test data is included when calculating the IDF, the model gains unintended insight into the test set’s word distribution. In real-world scenarios, the test data is supposed to be unseen during training. By allowing the model to see this information, you’re essentially reducing the uncertainty that the model should have about future data.

Impact of Data Leakage on Model Performance

Let’s consider two cases to understand the impact of data leakage in detail:

When a word is rare in the training set but common in the test set: The model will underestimate the importance of this word during training, leading to poor performance when the word is critical in test documents.
When a word is common in the training set but rare in the test set: The model will overemphasize the word during training, leading to poor predictions when the word doesn’t appear as often in unseen data.

Case Study: Data Leakage in TF-IDF

To see this effect in action, consider a small toy dataset where the presence of the word “banana” determines the label. If the word “banana” appears in a sentence, the label is 1; otherwise, the label is 0. Using TF-IDF to vectorize the text, we train a machine learning model to predict this label.

In the first scenario, we calculate the TF-IDF using the entire dataset before splitting it into training and testing sets. This causes data leakage since the model now knows the distribution of words across both sets. For instance, if “banana” is more common in the test set than the training set, the IDF score for “banana” will be lower across the entire dataset, leading the model to downplay its importance.

In the second scenario, we calculate TF-IDF only on the training set, ensuring that the test set remains unseen. This preserves the integrity of the test set, giving us a more realistic evaluation of the model’s performance.

In both scenarios, the model’s accuracy is drastically different. When leakage is present, performance is artificially high during training but poor when tested on unseen data. Without leakage, the model generalizes better, as it is evaluated on truly unseen data.

Avoiding Data Leakage

Avoiding data leakage is essential for building reliable machine learning models that generalize well to new data. Here are a few guidelines to help prevent leakage:

Split the dataset before feature extraction: Always divide your data into training and test sets before applying any feature engineering techniques.
Ensure proper cross-validation: When using cross-validation, ensure that the training and test splits do not overlap in any way that can leak information between them.
Be cautious with time-series data: In time-series models, avoid using future data to predict past events, as this can lead to leakage.

Conclusion

Avoiding data leakage is crucial for building robust machine learning models. In the case of TF-IDF, ensuring that feature extraction is done only on the training set and not on the entire dataset is key to preventing leakage. Properly addressing this issue leads to better generalization and more reliable models in real-world applications.

This blog post provided a case study on how TF-IDF can introduce data leakage and why it’s important to carefully handle your dataset before feature extraction. By splitting your data properly and ensuring that no test data “leaks” into the training process, you can build models that truly reflect real-world performance.

Thanks for reading!

AutoAppendix: Towards One-Click reproducibility of high-performance computing experiments

Wed, 04 Sep 2024 00:00:00 +0000

Hi everyone,

I’m excited to wrap up the AutoAppendix project with our final findings and insights. Over the course of this initiative, we’ve worked to assess the reproducibility of artifacts submitted to the SC24 conference and create guidelines that aim to improve the standard for reproducible experiments in the future. Here’s a summary of the project’s final phase and what we’ve learned.

Project Goals and Progress

The goal of AutoAppendix was to evaluate the computational artifacts provided by SC24 paper submissions, focusing on reproducibility. These artifacts accompany papers applying for the “Artifact Replicable” badge in the conference’s reproducibility initiative. Volunteer members of this initiative assess 1-2 paper appendices each. In this project, we analyzed a larger portion of artifacts to gain a broader perspective on potential improvements to the reproducibility process.

We selected 18 out of 45 submissions, focusing on experiments that could be easily replicated on Chameleon Cloud. Our evaluation criteria were based on simplicity (single-node setups) and availability of resources. The final analysis expanded on the earlier midterm findings, shedding light on various challenges and best practices related to artifact reproducibility.

Artifact Evaluation Process

During the evaluation process, we focused on examining the completeness and clarity of the provided artifacts, looking closely at documentation, setup instructions, and the degree of automation.

Our first step was to replicate the environments used in the original experiments as closely as possible using the resources from Chameleon. Many papers included instructions for creating the necessary software environments, but the clarity of these instructions varied significantly across submissions. In some cases, we even encountered challenges in reproducing results due to unclear instructions or missing dependencies, which reinforced the need for standardized, clear documentation as part of the artifact submission process.

We observed that containerization and semi-automated setups (with scripts that break down the experiment into smaller steps) were particularly effective in enhancing the reproducibility of the artifacts. One artifact particularly caught our attention due to its usage of the Chameleon JupyterHub platform, making it reproducible with a single click. This highlighted the potential for streamlining the reproducibility process and showcased that, with sufficient effort and the right tools, experiments can indeed be made replicable by anyone.

Results

Throughout the evaluation, we observed that reproducibility could vary widely based on the clarity and completeness of the documentation and the automation of setup procedures. Artifacts that were structured with clear, detailed steps for installation and execution tended to perform well in terms of replicability.

From our evaluation, we derived a set of guidelines (intended as must-haves) and best practices (recommended) for artifact reproducibility, which can be found below.

Due to our fascination of the potential of the Chameleon JupyterHub platform and its adjacent Trovi artifact repository, we decided to create several templates that can be used as a starting point for authors to make integration of their artifacts with the platform easier. In the design of these templates, we made sure that artifacts structured according to our guidelines are particularly easy to integrate.

Guidelines

Clear Documentation: Provide clear and detailed documentation for the artifact in the corresponding appendix, such that the artifact can be replicated without the need for additional information. For third-party software, it is acceptable to refer to the official documentation.
Software Setup: Clearly specify the versions of all (necessary) software components used in the creation of the artifact. This includes the operating system, libraries, and tools. Particularly, state all software setup steps to replicate the software environment
Hardware Specifications: Specify the hardware the experiment was conducted on. Importantly, state the architecture the experiments are intended to run on, and ensure that provided software (e.g. docker images) are compatible with commonly available architectures.
Expected Results: Always provide the expected outputs of the experiment, especially when run on different hardware, to make it easier for reviewers to assess the success of the replication.
Public Data: Publish the experiment data to a public repository, and make sure the data is available for download to reviewers and readers, especially during the evaluation period. Zenodo is a recommended repository for this purpose.
Automated Reproducibility: For long-running experiments, provide progress output to the reviewer to ensure the experiment is running as expected. Give an idea in the documentation of

how much time long-running steps in the reproduction will take
what the progress output looks like or how frequently it is emitted

Sample Execution: Conduct a sample evaluation with hardware and software as similar as possible to the intended reproduction environment.

Best Practices

Reproduciible Environment: Use a reproducible environment for the artifact. This can come in several forms:

Containerization: Provide instructions for building the environment, or, ideally, provide a ready-to-use image. For example, Docker, Signularity or VirtualBox images can be used for this purpose
Reproducible Builds: Package managers like Nix or Guix have recently spiked in popularity and allow their users to create reproducible environments, matching the exact software versions across different systems.

Partial Automation: It often makes sense to break an experiment down into smaller, more manageable steps. For Linux-based systems, bash scripts are particularly viable for this purpose. We recommend prefixing the scripts for each step with a number, such that the order of execution is clear.
X11 Availability: Usually, reviewers will not have access to a graphical user interface on the system where the artifact is evaluated. If the artifact requires a graphical user interface, provide a way to run the artifact without it. For example, save matplotlib plots to disk instead of showing them with plt.show().
Experiment output: Do not provide output files of the experiment in your artifact, unless explicitly intended. If provided output files are intended for comparison, they should be marked as such (e.g. in their filename). Similarly, any output logs or interactive outputs in Jupyter notebook should not be part of the artifact, but rather be initially generate during the artifact evaluation.

Trovi Templates

Our templates share a common base that features a central configuration file for modifying the Chameleon experiment parameters (such as node type). Building on this base, we provide three templates with sample experiments that each use different environments:

Docker template: This template is designed for containerized experiments and supports nvidia GPUs over the nvidia-container-toolkit integration.
Nix template: Sets up the Nix package manager with a shell.nix file that can be used to configure the environment.
Guix template: Installs the Guix package manager and executes a sample experiment from an existing reproducible paper that hinges on the reproducibility of the software environment.

Conclusion

In summary, the AutoAppendix project has been an insightful journey into the complexities of artifact reproducibility. Our evaluations highlight both the challenges and potential solutions for future reproducibility initiatives. By following these essential guidelines and implementing best practices, we aim for the research community to achieve higher standards of transparency and reliability in scientific research and help to ensure that the results of experiments can be replicated by others.

Thanks for following along with our progress! We’re excited to see the positive impact these findings will have on the research community.

If you are interested in the full project report, you can find it here, together with the Trovi templates.

Reflecting on the ScaleRep Project: Achievements and Insights

Mon, 02 Sep 2024 00:00:00 +0000

Hello everyone,

As we reach the conclusion of our ScaleRep project, I want to take a moment to reflect on the journey we’ve undertaken and the significant milestones we’ve achieved. Throughout this project, our primary focus was on identifying, reproducing, and analyzing scalability bugs in cloud systems such as Cassandra, HDFS, and Hadoop. Under the mentorship of Professor Yang Wang and Bogdan “Bo” Stoica, we have gained valuable insights into the complexities of scalability issues and their impact on large-scale distributed systems.

Key Accomplishments

Over the course of the project, we delved into various aspects of scalability bugs, reproducing some of the most challenging issues faced by cloud systems. One of our notable accomplishments was the successful reproduction and validation of developer fixes for several critical bugs in HDFS. These included:

1. Throttling Bugs in HDFS:

We investigated HDFS-17087, where the absence of a throttler in led to unregulated data reads, causing potential performance degradation. By reproducing the bug and applying the developer’s patch, we were able to observe significant improvements in system stability.DataXceiver#readBlock

2. Reducing DataNode Load:

HDFS-16386 was another crucial bug we worked on, which involved reducing the load on DataNodes when was working. By analyzing the effects of high CPU and memory usage, we proposed and validated a solution that reduced the number of concurrent threads, ultimately improving the DataNode’s performance.FsDatasetAsyncDiskService

3. Improving Log Throttling:

In HDFS-16872, we addressed excessive logging caused by unshared instances of . By making a static member, we were able to share throttling across instances, reducing unnecessary log entries and improving system efficiency.LogThrottlingHelperLogThrottlingHelper

Insights and Learnings

1. Systematic Bug Reproduction:

One of the most critical aspects of our work was developing a systematic approach to bug reproduction. This involved carefully setting up the environment, applying patches, and validating results through detailed monitoring and analysis. Our reproducible artifacts and investigation scripts will serve as a resource for future researchers and developers.

2. Impact of Throttling Mechanisms:

Our exploration of throttling bugs highlighted the importance of accurate throttling mechanisms in maintaining system performance and stability. Small issues, such as incorrect data rate calculations, can have significant ripple effects on system behavior, emphasizing the need for precise and effective solutions.

3. Collaboration and Open Source Contribution:

Working on an open-source project like ScaleRep underscored the importance of collaboration within the community. The bugs we analyzed and fixed not only improved the systems we worked on but also contributed to the broader effort of enhancing the reliability of cloud systems.

Conclusion

As we wrap up the ScaleRep project, I am proud of the progress we have made and the contributions we have delivered to the open-source community. The knowledge and experience gained from this project will undoubtedly shape our future endeavors in the field of distributed systems and cloud computing. I am grateful for the guidance and support provided by Professor Yang Wang and Bogdan “Bo” Stoica throughout this journey.

Thank you for following along, and I look forward to continuing to explore the future of scalable and reliable cloud systems!

Static and Interactive Visualization Capture

Fri, 30 Aug 2024 00:00:00 +0000

Introduction

Hello! My name is Arya Sarkar a machine learning engineer and researcher based out of Kolkata, a city in Eastern India dubbed the City of Joy. During summer of 2024, I worked closely with Professor David Koop on the project titled Reproducibility in Data Visualization. We explored multiple existing solutions and tested different stratergies and made great progress in the capture of visualiations using a relatively less used method of embedding visualization meta-information into the final resultant visualizations jpg as a json object.

Progress and Challenges

Static Visualization Capture

We successfully developed a method to capture static visualizations as .png files along with embedded metadata in a JSON format. This approach enables seamless reproducibility of the visualization by storing all necessary metadata within the image file itself. Our method supports both Matplotlib and Bokeh libraries and demonstrated near-perfect reproducibility, with only a minimal 1-2% pixel difference in cases where jitter (randomness) was involved.

Interactive Visualization Capture

For interactive visualizations, our focus shifted to capturing state changes in Plotly visualizations on the web. We developed a script that tracks user interactions (e.g., zoom, box, lasso, slider) using event listeners and automatically captures the visualization state as both image and metadata files. This script also maintains a history of interactions to ensure reproducibility of all interaction states.

The challenge of capturing web-based visualizations from platforms like ObservableHq remains, as iframe restrictions prevent direct access to SVG elements. Further exploration is needed to create a more robust capture method for these environments.

Future Work

We aim to package our interactive capture script into a Google Chrome extension.

Temporarily store interaction session files in the browser’s local storage.

Enable users to download captured files as a zip archive, using base64 encoding for images.

Conclusion

The last summer, we made significant strides in enhancing data visualization reproducibility. Our innovative approach to embedding metadata directly into visualization files offers a streamlined method for recreating static visualizations. The progress in capturing interactive visualization states opens new possibilities for tackling a long-standing challenge in the field of reproducibility.

Final Blog: BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Thu, 29 Aug 2024 00:00:00 +0000

Hello! I’m Qianru! I have been contributing to the BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking project under the mentorship of Ziheng Duan. My project aims to provide a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics.

Motivation and Overview

The “BenchmarkST” project was driven by the need to address a critical challenge in spatial transcriptomics: the impact of sparse data on downstream tasks, such as spatial domain identification. Sparse data can significantly degrade the performance of these tasks. For example, in a 10X Visium dataset of human brain Dorsolateral Prefrontal Cortex (DLPFC), using the complete dataset with GraphST (a state-of-the-art clustering method) for clustering resulted in an ARI (Adjusted Rand Index) of 0.6347. However, when using only 20% of the data—a common scenario—the performance dropped dramatically to 0.1880. This stark difference highlights the importance of effective gene imputation, which can help restore the lost information and improve the accuracy of downstream analyses.

To tackle this issue, the BenchmarkST project led to the creation of the Impeller package. This package provides a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics, offering preprocessed datasets, reproducible evaluation methods, and flexible inference interfaces. It spans across different platforms, species, and organs, aiming to enhance the integrity and usability of spatial transcriptomics data.

What Was Accomplished

Development of the Impeller Package

Data Aggregation and Preprocessing:

We aggregated and preprocessed spatial transcriptomic datasets from multiple platforms (10X Visium, StereoSeq, SlideSeqV2), species (human, mouse), and organs (Dorsolateral Prefrontal Cortex, olfactory bulb). These datasets are readily available for download within the package.

Unified Evaluation Framework:

A reproducible framework was developed, integrating methods such as K-Nearest Neighbors (KNN) and the deep learning-based Impeller method, enabling users to easily evaluate the performance of different gene imputation techniques.

Inference Interfaces:

We provided interfaces that allow users to apply gene imputation on custom datasets, offering the flexibility to predict any gene in any cell, maximizing the utility for diverse research needs.

Code Contributions and Documentation

Repository:

All code related to the Impeller package has been committed to the Impeller repository.

Link to Versions:

Here you can find all the versions made during the project, with detailed descriptions of each change.

README.md:

Detailed documentation on how to use the Impeller package, including installation instructions, usage examples, and explanations of the key components.

Final Blog: ML in Detecting and Addressing System Drift

Thu, 29 Aug 2024 00:00:00 +0000

Hello! I’m Joanna! I have been contributing to the ML in Detecting and Addressing System Drift project under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy. My project aims to design a pipeline to evaluate drift detection algorithms on system traces.

Methodology

Here is some background on my project: Model drift, or the degradation of model performance, is typically caused by data drift, which is a shift in the input distribution, and concept drift, which is a change in the relationship between input and output. This project focuses specifically on data drift, aiming to design a pipeline for evaluating drift detection algorithms on system traces. The goal is to benchmark different drift detection algorithms and have a better understanding of the features of system traces. The project is divided into two main parts: dataset construction and algorithm benchmarking.

PART 1: Dataset Construction

To benchmark drift detection algorithms in system data, it’s important to recognize that system trace data is inherently different from other data types, often containing more noise, which can complicate detection efforts. Therefore, constructing a labeled dataset specific to system data is crucial. In our case, we utilize the Tencent I/O block trace data as the dataset. This raw data was processed to extract timestamps along with various features such as IOPS, write size ratio, read write ratio, and etc., which were then used to create a data drift dataset.

I constructed this dataset by labeling segments of the trace data as either exhibiting drift or not. To identify where the drift occurs and to help construct the dataset, I employed several offline drift detection algorithms, including Kolmogorov-Smirnov, Cramer-von Mises, KL-Divergence, and Jensen-Shannon Distance.

To enhance the accuracy of the drift detection, especially in the presence of noise common in trace data, I applied additional preprocessing steps such as Fourier transform and moving average. These techniques help to smooth the data, making it easier to detect true drift signals. Finally, a voting strategy was used in combination with post-processing methods to build and refine the final datasets.

The first figure below illustrates the segments of IOPS where drift has been detected. The second figure shows the segments of data where no drift occurs.

PART 2: Benchmark Drift Detection Algorithms

This part focuses on benchmarking the Jensen-Shannon and Wasserstein drift detection methods using system trace data. The evaluation metrics are categorized into three main areas:

Detection Accuracy Metrics

True Positive Rate (Recall)
True Negative Rate (Specificity)
Precision
F1-Score

Detection Overhead Metrics

Time Taken: The computational time required to detect drifts, critical

Stability Metrics

False Positive Rate
False Negative Rate

(Additional) Comparative Analysis:

Accuracy Across Different Features: How well the detection algorithms perform when applied to various features within the system trace data.

Discussion

The results clearly demonstrate that the Jensen-Shannon distance method outperforms the Wasserstein distance method in detecting drift. Additionally, the write size ratio proves to be a more effective feature for representing the variations in the data, offering a more nuanced understanding of the underlying changes.

Conclusion and Next Steps

In conclusion, this project establishes a pipeline that encompasses data labeling, data processing, and the benchmarking of drift detection algorithms. This just serves as the first step in detecting drift in system data.

There is significant potential for further improvement. Future work should focus on enhancing dataset construction by incorporating large language models (LLMs) and other advanced techniques to further clean and refine the datasets. Additionally, the evaluation of drift detection methods should be expanded beyond the current benchmarks, which only include two statistical methods. Incorporating additional statistical methods, as well as machine learning (ML) and deep learning (DL) approaches, could provide a more comprehensive analysis. Furthermore, exploring a broader range of evaluation metrics will ensure a more robust and accurate assessment of drift detection performance. These steps will help to advance the accuracy and reliability of drift detection in system trace data.

Deliverables

The following are the deliverables of this project:

Trovi Artifact
Github Repository: This repository contains the code for generating drift datasets with labels and notebooks with benchmarking results

Final Blogpost: Reproducibility in Data Visualization

Wed, 28 Aug 2024 00:00:00 +0000

Hello everyone!

I’m Triveni, a Master’s student in Computer Science at Northern Illinois University (NIU). I’m excited to share my progress on the OSRE 2024 project Categorize Differences in Reproduced Visualizations focusing on data visualization reproducibility. Working under the mentorship of David Koop, I’ve made some significant strides and faced some interesting challenges.

Reproducibility in data visualization

Reproducibility is crucial in data visualization, ensuring that two visualizations accurately convey the same data. This is essential for maintaining transparency and trust in data-driven decision-making. When comparing two visualizations, the challenge is not just spotting differences but determining which differences are meaningful. Tools like OpenCV are often used for image comparison, but they may detect all differences, including those that do not impact the data’s interpretation. For example, slight shifts in labels might be flagged as differences even if the underlying data remains unchanged, making it challenging to assess whether the visualizations genuinely differ in terms of the information they convey.

A Breakthrough with ChartDetective

Among various tools like ChartOCR and ChartReader, ChartDetective proved to be the most effective. This tool enabled me to extract data from a range of visualizations, including bar charts, line charts, box plots, and scatter plots. To enhance its capabilities, I modified the codebase to capture pixel values alongside the extracted data and store both in a CSV file. This enhancement allowed for a direct comparison of data values and their corresponding pixel coordinates between two visualizations, focusing on meaningful differences that truly impact data interpretation.

Example: Comparing Two Bar Plots with ChartDetective

Consider two bar plots that visually appear similar but have slight differences in their data values. Using ChartDetective, I extracted the data and pixel coordinates from both plots and stored this information in a CSV file. The tool then compared these values to identify any discrepancies.

For instance, in one bar plot, the height of a specific bars were slightly increased. By comparing the CSV files generated by ChartDetective, I was able to pinpoint these differences precisely. The final step involved highlighting these differences on one of the plots using OpenCV, making it clear where visualizations diverged.This approach ensures that only meaningful differences—those that reflect changes in the data—are considered when assessing reproducibility.

ChartDetective: SVG or PDF file of the visualization is uploaded to extract data.

- Data Extraction: Data values along with pixel details are stored in the CSV files.

- Highlighting the differences: Differences are highlighted on one of the plots using OpenCV

Understanding User Perspectives on Reproducibility

To complement the technical analysis, I created a pilot survey to understand how users perceive reproducibility in data visualizations. The survey evaluates user interpretations of two visualizations and explores which visual parameters impact their decision-making. This user-centered approach is crucial because even minor differences in visual representation can significantly affect how data is interpreted and used.

Pilot Survey Example:

Pixel Differences: In one scenario, the height of two bars was altered slightly, introducing a noticeable yet subtle change.

Label Swapping: In another scenario, the labels of two bars were swapped without changing their positions or heights.

Participants will be asked to evaluate the reproducibility of these visualizations, considering whether the differences impacted their interpretation of the data. The goal was to determine which visual parameters—such as bar height or label positioning—users find most critical when assessing the similarity of visualizations.

Future Work and Conclusion

Going forward, I plan to develop a proof of concept based on these findings and implement an extensive survey to further explore the impact of visual parameters on users’ perceptions of reproducibility. Understanding this will help refine tools and methods for comparing visualizations, ensuring they not only look similar but also accurately represent the same underlying data.

Final Blogpost: Drift Management Strategies Benchmark

Sat, 24 Aug 2024 00:00:00 +0000

Background

Hello there! I’m William and this is my final blog for my proposal “Developing A Comprehensive Pipeline to Benchmark Drift Management Approaches” under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy under the LAST project.

If you’re not familiar with it, this project aims to address the issue of model aging, where machine learning (ML) models experience a decline in effectiveness over time due to environmental changes, known as drift. My goal is to design an extensible pipeline that evaluates and benchmarks the robustness of state-of-the-art algorithms in addressing these drifts.

Deliverables

You can find my list of deliverables here:

Final report, this blog is a summarized version of my final report, so do take a look if you’d like to know more!
Github repository, contains code as well as the raw experiment results.
Trovi artifact

Evaluation

Here are some of the graphs that show the performance of every algorithm on the created datasets. For more graphs and figures, you can check out my final report:

CIRCLE: AUE demonstrates stability, maintaining a high accuracy even as the data drifts, which may be due to its ensemble nature. It is even more stable than baseline retraining algorithms. Matchmaker is also able to recover quickly upon experiencing drift, which maybe again due to its ranking the most high performing models to do inference, recovering faster than RetrainWin. On the other hand, DriftSurf experiences several random drops in accuracy, indicating that it can be somewhat unstable.
SINE: Similar to CIRCLE, AUE demonstrates stability throughout the dataset, maintaining a high accuracy even as the data drifts. Matchmaker however was struggling to adapt as fast when encountering such a sudden drift, as it needed some time/windows to recover from the drop. Driftsurf’s performance was notably better than baseline, as unlike them, it was able to recover successfully fairly quickly upon experiencing drift.
CovCon: In CovCon, Matchmaker was able to achieve the best accuracy, as it is able to select the models most relevant to each incoming batch (model trained on the most similar features), performing comparably to retrain window. Most of the other algorithms suffered in this dataset, particularly AUE whose performance is now becoming comparable to the rest of the algorithms and baseline.
IOAdmission: Performance on this dataset was led by AUE, which was able to maintain impressive stability amongst all of the algorithms used. This is followed closely by Matchmaker. The other algorithms used undergo a lot of fluctuations in accuracy.

Findings / Discussion

From the experiments conducted, the findings are as follows:

Matchmaker was able to perform particularly well in the CovCon dataset. This maybe due to its ability to choose the most relevant trained model from its ensemble during inference time. Its training time is also the best compared to other algorithms, especially considering that it keeps data for training an additional random forest model for ranking the models. However, its inference time was the longest amongst all other algorithms. This may be due to the fact that on inference time, one needs to traverse all of the leaf nodes of the random forest used to rank it (computing covariate shift).
AUE was able to perform particularly well in the CIRCLE and IOAdmission dataset. However, it is quite competitive on other datasets too. It’s weighting function which incentives highly relevant models and eviction of less relevant ones may be key. Its inference time is decent compared to other algorithms, being slower than most baselines and Driftsurf, but faster than Matchmaker. However, its training time took the longest amongst other competitors, as it has an expensive weighting function to weight, evict, or retrain models on every retraining.
DriftSurf was performing very similarly to the RetrainWindow baseline, in almost all datasets, except for IO Admission and SINE where it did better. This may be because of the fact that it maintains only at most 2 models every iteration, and as such, its performance was not competitive against the mult-models approach used in Matchmaker and AUE. On the plus side, its inference time is comparable to the baseline single model, having almost no inference overhead compared to most of the competitors out there. Another plausible explanation for the lack of performance is the lack of tuning, such as the number of windows retained, the length of its reactive period, and its reactivity sensitivity threshold. A better performance could be achieved if these parameters were tuned further.

Next Steps

These are some of the potential extensions for this project:

Optimize Matchmaker’s inference time improving Matchmaker’s efficiency, especially in covariate shift ranking, can reduce inference time. Simplifying the random forest traversal could make Matchmaker faster without impacting performance.
Extending the work to include other frameworks like TensorFlow or PyTorch, as it can now only support a scikit-learn base model.

Thank you for reading!

Reproducing and addressing Data Leakage issue : Duplicates in dataset

Fri, 23 Aug 2024 00:00:00 +0000

Hello!

In this blog post, I will explore a common issue in machine learning called data leakage, using an example from the paper:

Benedetti, P., Perri, D., Simonetti, M., Gervasi, O., Reali, G., Femminella, M. (2020). Skin Cancer Classification Using Inception Network and Transfer Learning. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2020. ICCSA 2020. Lecture Notes in Computer Science(), vol 12249. Springer, Cham. https://doi.org/10.1007/978-3-030-58799-4_39 arXiv

Overview of the Paper

In this paper, the authors use transfer learning on a pretrained convolutional neural network (CNN) to classify skin lesions in dermatoscopic images from the HAM10000 (Human Against Machine with 10,000 training images) dataset. The paper reports a final accuracy of 78.9% on the validation set.

While this reported result appears to be impressive, there are concerns regarding the validity of this performance metric due to data leakage. Data leakage occurs when the model is trained or evaluated on data that it would not have access to during real-world deployment, leading to an overestimation of the model’s true performance.

Identifying Data Leakage in the Original Paper

Upon closer inspection, it appears that the original experiment suffers from data leakage in two significant ways:

Duplicate Images in Training and Validation Sets:

The HAM10000 dataset contains near-duplicate images of the same lesions in both the training and validation sets. This results in the model seeing very similar images during training and then again during validation. Consequently, the model’s performance is artificially inflated because it has already been “trained” on images similar to those in the validation set, making the task easier than it should be.
Using the Validation Set for Early Stopping and Final Evaluation:

Another critical issue is the use of the validation set for both early stopping and final model evaluation. Early stopping is a technique where training is halted when the model’s performance on a validation set no longer improves, preventing overfitting. However, if this same validation set is later used to evaluate the model’s final performance, it can lead to overfitting on the validation data itself, resulting in an overly optimistic estimate of model accuracy.

Our Reproduction and Results

To demonstrate the impact of these data leakage issues, we reproduced the experiment with corrected methodologies:

Corrected Data Split: We ensured that there were no duplicate images between the training and validation sets. This setup is crucial to simulate a realistic scenario where the model encounters completely unseen data during validation.
Separate Validation and Test Sets: We introduced a distinct test set to evaluate the final model performance, independent of the data used for early stopping.

Results Comparison

	Original results	Our results
Accuracy	78.9%	78.6%
Number of epochs	Approx. 42 epochs	40 epochs
Training size	Unknown	7000 samples
Validation size	478 samples	478 samples
Confusion martix

Analysis of the Results

While our reproduced accuracy of 78.6% is close to the original reported accuracy, it is based on a properly separated training and validation set, avoiding the data leakage pitfalls of the original paper. The slight drop in accuracy further highlights the overestimation of the original model’s performance due to data leakage.

Moreover, using a separate test set for final evaluation provides a more reliable measure of the model’s ability to generalize to new, unseen data. The confusion matrices show that our model’s performance is consistent across different lesion classes, confirming the robustness of the evaluation.

Conclusion

Data leakage is a common and often overlooked problem in applied machine learning, leading to misleading performance metrics and irreproducible results. By carefully examining and correcting these issues in our reproduction, we hope to provide a clearer understanding of the importance of proper data handling and validation practices.

It is crucial for researchers and practitioners to be vigilant about data leakage and ensure that their models are trained, validated, and tested under realistic conditions. This not only ensures the credibility of their results but also enhances the real-world applicability of their models.

Thank you for reading, and stay tuned for more insights on machine learning reproducibility!

Final blog: Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Thu, 22 Aug 2024 00:00:00 +0000

Introduction

Hello everyone,

I’m Archit from India, an undergraduate student at the Indian Institute of Technology, Banaras Hindu University (IIT BHU), Varanasi. As part of the Automatic Reproducibility of COMPSs Experiments through the Integration of RO-Crate in Chameleon project, my proposal, under the mentorship of Raül Sirvent, aims to develop a service that facilitates the automated replication of COMPSs experiments within the Chameleon infrastructure.

About the Project

The project proposes to create a service that can take a COMPSs crate (an artifact adhering to the RO-Crate specification) and, through analysis of the provided metadata, construct a Chameleon-compatible image for replicating the experiment on the testbed.

Final Product

The basic workflow of the COMPSs Reproducibility Service can be explained as follows:

The service takes the workflow path or link as the first argument from the user.
The program shifts the execution to a separate sub-directory, reproducibility_service_{timestamp}, to store the results from the reproducibility process.
Two main flags are required:
- Provenance flag: If you want to generate the provenance of the workflow via the runcompss runtime.
- New Dataset flag: If you want to reproduce the experiment with a new dataset instead of the one originally used.
If there are any remote datasets, they are fetched into the sub-directory.
The main work begins with parsing the metadata from ro-crate-metadata.json and verifying the files present inside the dataset, as well as any files downloaded as remote datasets. This step generates a status table for the user to check if any files are missing or have modified sizes.

The final step is to transform the compss-command-line.txt and all the paths specified inside it to match the local environment where the experiment will be reproduced. This includes:
- Mapping the paths from the old machine to new paths inside the RO-Crate.
- Changing the runtime to runcompss or enqueue_compss, depending on whether the environment is a SLURM cluster.
- Detecting if the paths specified in the command line are for results, and redirecting them to new results inside the reproducibility_service_{timestamp}\Results directory.
After this, the service prompts the user to add any additional flags to the final command. Upon final verification, the command is executed via Python’s subprocess pipe.

Logging System: All logs related to the Reproducibility Service are stored inside the reproducibility_service_{timestamp}\log.

You can view the basic pseudocode of the service.

Conclusion and Future Work

It’s been a long journey since I started this project, and now it’s finally coming to an end. I have learned a lot from this experience, from weekly meetings with my mentor to working towards long-term goals—it has all been thrilling. I would like to thank the OSRE community and my mentor for providing me with this learning opportunity.

This is only version 1.0.0 of the Reproducibility Service. If I have time from my coursework, I would like to fix any bugs or improve the service further to meet user needs.

However, the following issues still exist with the service and can be improved upon:

Third-party software dependencies: Automatic detection and loading of these dependencies on a SLURM cluster are not yet implemented. Currently, these must be handled manually by the user.
Support for workflows with data_persistence = False: There is no support for workflows where all datasets are remote files.

Deliverables

Reproducibility Service Repository: This repository contains the main service along with guidelines on how to use it. The service will be integrated with the COMPSs official distribution in its next release.
Chameleon Appliance : This is a single-node appliance with COMPSs 3.3.1 installed, so that anyone with access to Chameleon can reproduce experiments.

Previous Blogs

Make sure to check out my other blogs to see how I started this project and the challenges I faced along the way:

Thank you for reading the blog, have a nice day!!

Final Blogpost: HDEval's LLM Benchmarking for HDL Design

Wed, 21 Aug 2024 00:00:00 +0000

Introduction

Hello everyone! I’m Ashwin Bardhwaj, an undergraduate student studying at UC Berkeley. As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau and Sakshi Garg looks to create a suite of benchmark programs for HDEval.

The goal of this project is to create large-scale Verilog programs in order to benchmark that capability of LLMs to develop HDL code. Throughout this project, I have created 3 of the large Verilog testbenches called 3-Stage-RISC_V processor, Gameboy Emulator, and Sorts. The benchmark programs will lose their effectriveness if LLMs such as ChatGPT scrape over Github reposotires and learn from them. As a result, the code itself cannot be made public due to LLM scraping over repositories, this file will cover the test report for all 3 of these projects.

3 Stage RISC V Processor

This is a pipelined RISC processor developed to to handle RV32I instructions. A 3-Stage processsor will typically contain a Fetch, Decode, and Execute cycle. As a result, every instruction will take exactly 3 clock cycles. For this processor, instructions can be formatted into R, I (Load), S (Store), B (Cond), and J (Jump and Link) type instructions. Once a 32 bit instruction is fetched at the location in memory specifed by the pc (Program Counter) register, it is sent to be decoded by the “decode unit”. Through decoding an instruction, we can determine the exact operation code, register location of the 2 operands (rs1 and rs2), and the destination register (rd) at which to write the calculated result. After decoding, an activation flag is sent to the excetution cycle to then take and access the register file at address rs1 and rs2 in order to get the correct operand data. The data and operation is then sent to the ALU to compute the result based on the opcode. The result is then written back into the register file at the rd address and the program counter is incremented and the next instruction is fetched.

The prompts for each module in this processor have been generated and tested against a GPT 3 turbo and GPT 4o models as an example. In the RISC V tab in my test report, I have provided the exact prompts and results after running on MASC’s HDLAgent tool which can access the APIs of many LLMs.

Gameboy Emulator

The Gameboy Emulator is a Verilog implementation of the classic GameBoy console that was widely popular in the 1990s. The main aspects of the GameBoy that were focused on in this project were the Z-80 like CPU, memory objects like RAM, VRAM, and ROM, the PPU (Picture Processing Unit), and other peripherals. The instructions are given to the CISC (variable-length instructions) CPU where they are decoded and executed based on the details and expectations of that specific instruction. In some cases, timing becomes a concern and there is significant effort made to ensure that instructions can be parsed and run predictably and effictively. Instructions from the ROM may take between 1 to 4 clock cycles to run depending on the requirements. For example, the instruction “LD B, HL” , loads the data found at the 16 bit address given by registers H and L into register B is a 2 cycle instruction. The first cycle decodes the HL address and fetches the data at the accurate location, while the second cycle takes the new input data at writes it into register B. This requires accurate timing control between different asects of the GameBoy.

The Picture Processing Unit is also an integral feature of the gameboy. Three frames called Background, Window, and Sprite are combined into the classic Gameboy screens we know today. White the Background and Window data are consistently called from the VRAM after certain clock cycle times, the Sprite and sprtite attributes are accessed using DMA (Direct Memory Access) from OAM (Object Attribute Memory). This reduces the CPU load and improves the speed of sprite data.

Deliverables

HDEval Test Report: The HDEval Test Report contains the module prompts for each testbench, the results after testing on GPT 3 turbo and 4o, and test cases to ensure code correctness and reliability.
HDEval Repo: HDEval contains the encrypted version of the yaml files that encapsulate the code, prompts, and additional data.

Next Steps

Given these benchmarks, it is important to track the abilities of these LLMs to generate HDL code. Therefore, including GPT 3-turbo and 4o. I would like these benchmarks to be applied to more models so that we can track their growth and keep informed on their effectiveness in HDL and hardware.

Previous Blogs

Please feel free to check out my previous blogs!

Thank you for reading!

Deriving Realistic Performance Benchmarks for Python Interpreters

Sat, 17 Aug 2024 00:00:00 +0000

Hi, I am Mrigank. I am one of the Summer of Reproducibility fellows for 2024, and I will be working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah.

Background and Motivation

Recent work by Meta on a statically typed variant of Python – Static Python – which has provided immense promise in moving towards gradually typed languages without compromising on performance due to lack of complete soundness. Lu et al.¹ provide an evaluation of Static Python and conclude that the enhancement in performance reported by Meta on their web servers for Instagram is reasonable and is not just the result of refactoring. In fact, the study notes that very little refactoring is typically required for converting existing Python programs to Static Python. However, this study depends on a limited model of the language and does not represent real-world software applications.

In our project, we aim to create a realistic performance benchmark to reproduce performance improvements reported by Meta and to evaluate the performance of Static Python in real-world software applications. In addition, we will analyze partially-typed code to understand the performance implications of gradual typing in Python.

Key Objectives

We will use widely-used open-sourced applications to derive realistic performance benchmarks for evaluating Static Python. In particular, we will focus on projects that utilize the Python framework Django, which is also known to power the backend of Instagram. We plan to begin with Wagtail, a popular CMS built on Django. We have also identified other potential projects like Zulip, Plane and LibrePhotos. These are all actively maintained projects with significantly large codebases.

Further, we will analyze the performance of partially-typed code. This will be of value to the Python community as it will provide confidence in gradually moving towards Static Python for improving performance. We will make our benchmarks publicly available for the community to use, reproduce, and extend.

Methodology

Load Testing

For each project that we derive benchmarks from, we will design user pipelines that simulate real-world usage and implement them to create load tests using the open-sourced Locust framework. This will allow us to evaluate the performance of Static Python in real-world loads and scenarios. Locust can spawn thousands of users, each of which independently bombards the system with HTTP requests for a range of tasks that are defined in their user pipeline. We will host each project on a server (local or cloud) to run these load tests.

We will profile each project to ensure that our tests cover different parts of the codebase and to identify performance bottlenecks. We can then focus on these bottlenecks while gradually typing the codebase.

Gradual Typing

For typing the code in these projects, we will create two versions of each project: one with the so-called “shallow” type annotations and another with “advanced” type annotations. The former is relatively easier to implement and we can use tools like MonkeyType to generate stubs that can be quickly verified manually. The latter is quite non-trivial and will require manual effort. We will then mix-and-match the three versions of each project to create different combinations of typed and untyped code. Note that this mix-and-match can be done at both the module level and also at the function or class level.

Conclusion

This is my first time working on performance-benchmarking and I am excited to pick up new skills in the process. I am also looking forward to interacting with people from the Python community, people from Meta’s Static Python team, and also with the maintainers of the projects we will be working on. I will be posting more updates on this project as we make progress. Stay tuned!

Kuang-Chen Lu, Ben Greenman, Carl Meyer, Dino Viehland, Aniket Panse, and Shriram Krishnamurthi. Gradual soundness: Lessons from static python. The Art, Science, and Engineering of Programming. ↩︎

Midterm Report: Deriving Realistic Performance Benchmarks for Python Interpreters

Sat, 17 Aug 2024 00:00:00 +0000

Hi, I am Mrigank. As a Summer of Reproducibility 2024 fellow, I am working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah. In this post, I will provide an update on the progress we have made so far.

Creating a Performance Benchmark

We are currently focusing on applications built on top of Django, a widely used Python web framework. For our first benchmark, we chose Wagtail, a popular content management system. We created a pipeline with locust to simulate real-world load on the application. All of our work is open-sourced and available on our GitHub repository.

This load-testing pipeline creates hundreds of users who independently create many blog posts on a Wagtail blog site. At the same time, thousands of users are spawned to view these blog posts. Wagtail does not have a built-in API and so it took some initial effort to figure out the endpoints to hit, which I did by inspecting the network logs in the browser while interacting with the Wagtail admin interface.

A snapshot from a run of the load test with Locust is shown in the featured image above. This snapshot was generated by spawning users from 24 different parallel locust processes. This was done on a local server, and we plan to perform the same experiments on CloudLab soon.

Profiling

On running the load tests with a profiler, we found that the bottlenecks in the performance arose not from the Wagtail codebase but from the Django codebase. In particular, we identified three modules in Django that consumed the most time during the load tests: django.db.backends.sqlite3._functions, django.utils.functional, and django.views.debug. Dibri, a graduate student in Ben’s lab, is helping us add types to these modules.

Next Steps

Based on these findings, we are now working on typing these modules to see if we can improve the performance of the application by using Static Python. Typing Django is a non-trivial task, and while there have been some efforts to do so, previous attempts like django-stubs are incomplete for our purpose.

We are also writing scripts to mix untyped, shallow-typed, and advanced-typed versions of a Python file, and run each mixed version several times to obtain a narrow confidence interval for the performance of each version.

We will be posting more updates as we make progress. Thank you for reading!

Final Blog: FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Fri, 16 Aug 2024 00:00:00 +0000

Background

Hello, I’m Lihaowen (Jayce) Zhu, a 2024 SoR contributor for the FEP-bench project, under the mentorship of Yuyang (Roy) Huang. Before we started, let’s recap the goal of our project and our progress until mid term. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. In order to solve these challenges, we have collected the basic information of various common datasets for different machine learning tasks, and corresponding preprocessing pipelines.

Methodology

Since our goal is to improve the efficiency of the machine learning preprocessing pipeline and keep the training process of the Deep Learning model busy, it means that we need to enhance the preprocessing throughput which is the feed rate from the preprocessing stage to the training stage. According to some previous works, we have a new way to look at the Deep Learning Preprocessing Pipelines. The preprocessing pipeline can be split into 2 parts. The first part contains steps that are run once (S1-Sm). We can call it the “offline” part. The second part includes all the rest steps, which are run at every iteration of training. We call it the ”online” part. After the offline preprocessing steps, the output data is written back to disk. Then the online preprocessing steps need to load that data from storage first and do the following operations. We can split the pipeline at any step, and each split is a preprocessing strategy. By using this method, some specific strategies can achieve a much higher final preprocessing throughput. Our project adopts this method to profile the performance of different strategies. And our goal is to maximize the final preprocessing throughput into training, for a specific pipeline. We want to make this an automatic process, rather than ask for extra user instructions or parameters.

Experiment

Next, we did the data preprocessing strategy experiment on the LibriSpeech dataset, which is an audio dataset for ML tasks like Auto Speech Recognition. The dataset size is 6.3 GB with almost 30000 samples. Each audio file is in a binary format FLAC. As a result, the first step of the preprocessing pipeline we use is decoding, which converts the binary data into arrays of floats. Then we applied some typical audio preprocessing steps of transformation (normalization, padding, extract loudest section) and augmentation (random cut, random shift audio, random mask, random add noise) to audio data. Finally, the audio data is converted to Log-Mel Spectrogram signal, which is commonly used in audio tasks like Speech Recognition and Speaker identification.

We have benchmarked the throughput performance and storage overhead of all possible strategy split points, and have seen some trade-offs between them. Both storage overhead and throughput speed-up use the fully online method as the baseline. What we’ve observed from our results is that the speed-up keeps increasing when we put operations into the offline part, and the storage consumption is very low for the strategies after audio decoding. Also, we analysed the performance of individual methods of transformation and augmentation steps. We find that the speed-up performance is quite stable between 1.0 and 1.2 across these methods, but some methods can have a high storage overhead, like normalization and random noise.

Another thing we observed during our experiments is that different dataset sizes can influence the preprocessing pipeline throughput. We found that the throughput speed-up of 10000 samples is almost double the speed-up of 5000 samples. It seems like a larger dataset size may lead to a higher speed-up. So, we were thinking that does every operation follows this pattern or only certain operations can have increasing throughput with increasing dataset size, and then did experiments about the throughput speed-ups on different dataset sizes of all operations in the audio preprocessing pipeline. The results showed that only the audio decoding step can have a great increase in speed-up for larger dataset sizes. But for transformation, augmentation and LMS, the throughputs always stay at a steady level. This indicates that the only audio decoding step can become faster and faster when the dataset size grows.

Conclusion

In our work, we have built up a collection of common datasets and their preprocessing pipelines for different machine-learning tasks. For the audio dataset LibriSpeech, we have done experiments about the trade-offs between throughput speed-ups and storage overhead, and dataset sizes. We have found that speed-ups keep increasing when more and more operations are divided into the offline part. Only the audio decoding step can become faster and faster when the dataset size grows.

Future works

In the near future, we still want to find the optimal preprocessing strategy by profiling only a small part of the original enormous dataset. The second thing is that besides the audio dataset, we must expand the range of our experiments on other datasets and ML tasks. Finally, we need to implement our goal of building an automatic system that decides the optimal strategy of a preprocessing pipeline.

Final Blog: FSA - Benchmarking Fail-Slow Algorithms

Wed, 14 Aug 2024 00:00:00 +0000

Introduction

Hello! I hope you’re enjoying the summer as much as I am. I’m excited to join the SOR community as a 2024 contributor. My name is Xikang Song, and I’m thrilled to collaborate with mentors Ruidan Li and Kexin Pei on the FSA-Benchmark project. This project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. Throughout this journey, we tested a broad range of algorithms, from traditional approaches to state-of-the-art techniques, using a robust evaluation system to compare their effectiveness.

In the first half of the project, I focused on implementing and testing different machine learning models for detecting disks at high risk of fail-slow anomalies. This involved setting up initial models such as the Cost-Sensitive Ranking Model and Multi-Prediction Models, and beginning to explore LSTM networks for analyzing input disk data.

In the second half, I built upon this foundation by refining the evaluation processes, exploring advanced models like PatchTST, and investigating the potential of large language models (LLMs) for detecting subtle fail-slow conditions in storage systems. This blog post will summarize the key achievements, findings, and comparisons with baseline models from this phase.

Key Achievements

Comprehensive Benchmarking and Evaluation:
- I extended the benchmarking framework to evaluate multiple algorithms across 25 different data clusters on PERSEUS. This process involved generating and analyzing heatmaps that visualized the precision and recall of each model under various settings, providing a clear understanding of each approach’s strengths and limitations.
Exploration of Advanced Machine Learning Models:
- LSTM Model: I implemented the Long Short-Term Memory (LSTM) model, specifically designed for sequential data, to capture temporal dependencies in disk performance metrics. This model was used to predict potential fail-slow anomalies by analyzing historical data. Using Mean Squared Error (MSE) as a risk indicator, the LSTM model outperformed baseline approaches like the Cost-Sensitive Ranking Model and Multi-Prediction Models, especially in clusters where latency patterns between faulty and normal disks were distinct, such as in Cluster_P. This resulted in a higher precision and fewer false positives. However, in clusters with more complex and overlapping data distributions, like Cluster_L, the LSTM model’s performance diminished, similar to that of the baseline models
- PatchTST Model: I also introduced and evaluated the PatchTST model, which is built on a transformer-based architecture known for its ability to handle sequential data by capturing long-range dependencies and intricate temporal patterns. Unlike traditional models, PatchTST processes time series data in segments or “patches,” enhancing its ability to predict disk behavior over extended periods. Like the LSTM model, PatchTST uses outlier MSE values to assess disk risk. In clusters with a clear separation between faulty and normal disks, PatchTST outperformed baseline models by effectively identifying faulty patterns. However, similar to the LSTM model, PatchTST encountered difficulties in clusters with significant data overlap, such as Cluster_L.
Investigation into Large Language Models (LLMs):
- I explored the use of GPT-4-o-mini for fail-slow detection. While large language models (LLMs) showed potential, particularly in reducing false positives and improving precision over baseline models, they did not consistently outperform specialized models like LSTM and PatchTST in this context. LLMs struggled with recall, especially as thresholds increased, revealing the challenges of adapting LLMs to time series data. This limitation arises because LLMs are primarily trained for natural language generation tasks, not for analyzing time series data. As a result, their ability to fully capture anomalies is limited. To improve their effectiveness, we need to develop methods that help LLMs better understand time series data. For example, incorporating statistical information about each disk’s performance could enhance LLMs’ understanding, leading to better precision in fail-slow detection.

Conclusion and Future Work

The work in this project demonstrated that while advanced machine learning models like LSTM and PatchTST offer significant potential for detecting fail-slow conditions, challenges remain in ensuring consistent performance across diverse clusters. Compared to baseline models, these advanced approaches generally provided better precision and recall, especially in clusters with distinct data patterns between faulty and normal disk performance time series. However, the persistent difficulties in more complex clusters indicate the need for further refinement.

Moving forward, future work will focus on refining these models, particularly in improving their performance in challenging clusters like Cluster_L. Additionally, I plan to further explore techniques such as prompt engineering for LLMs to better tailor them for time series analysis and fail-slow detection tasks.

Deliverables

Repository: All comprehensive analysis code and source code can be found in the FSA_BENCHMARK GitHub Repository.
Jupyter Notebook: A notebook to reproduce the experiments and benchmarks on Chameleon: Chameleon Experiment Notebook.
Final Report: Comprehensive algorithm performance evaluation for all methods in FSA-Benchmarking Final Report.

Data Leakage in Applied ML

Tue, 13 Aug 2024 00:00:00 +0000

Hello everyone!

I have been working on reproducing the results from Characterization of Term and Preterm Deliveries using Electrohysterograms Signatures. This paper aims to predict preterm birth using Support Vector Machine with RBF kernel. However, there is a major flaw in the methodology: preprocessing on training and test set. This happens when preprocessing is performed on the entire dataset before splitting it into training and test sets.

Reproducing the published results came with its own challenges, including updating EHG-Oversampling to extract meaningful features from EHG signals and finding optimal hyperparameters for the model. Through our work on reproducing the published results and creating toy example notebooks, we have been able to demonstrate that data leakage leads to overly optimistic measures of model performance and models trained with data leakage fail to generalize to real-world data. In such cases, performance on test set doesn’t translate to performance in the real-world.

Next, I’ll be reproducing the results published in Identification of COVID-19 Samples from Chest X-Ray Images Using Deep Learning: A Comparison of Transfer Learning Approaches.

You can follow my work on the EHG paper here.

Stay tuned for more insights on data leakage and updates on our progress!

Midterm Check-In: Progress on the AutoAppendix Project

Sat, 03 Aug 2024 00:00:00 +0000

Hi all,

I’m happy to share a quick update on the AutoAppendix project as we’re about halfway through. We’ve made some steady progress on evaluating artifacts from SC24 papers, and we’re starting to think about how we can use what we’ve learned to improve the artifact evaluation process in the future.

What We’ve Been Up To

As a quick reminder, the goal of our project is to develop a set of guidelines that researchers can use to improve the reproducibility of their work. We’re focusing on papers from the Supercomputing Conference 2024 that applied for an “Artifact Replicable” badge, and we’re evaluating their artifacts to see how well the experiments can be replicated. As it was difficult to make assumptions about the exact outcomes of the project besides detailed experiment recreation, our main goal of this midterm check-in is to share what insights we have gathered so far and to set the stage for the final outcomes.

Our main task so far has been making a selection of submissions with experiments designed for Chameleon Cloud, or those that could be easily adapted to run on Chameleon. As there were 45 submissions that applied for an “Artifact Replicable” badge, it was not easy to choose which ones to evaluate, but we managed to narrow it down to 18 papers that we thought would be a good fit for our project.

We’ve chosen to focus on papers that do not require special hardware (like a specific supercomputer) or complex network setups, as it would be difficult to generalize the insights from these kinds of experiments. Instead, we’ve been looking at those that require only a single computation node, and could theoretically be run with the available hardware on Chameleon.

Observations and Learning Points

At the moment, we’re about halfway through the evaluation process. So far, we’ve noticed a range of approaches to documenting and setting up computational experiments. Even without looking at the appendices in detail, it’s clear that there’s a lot of room for standardization of the documentation format and software setup, which could make life easier for everyone involved. This particularly applies to software setups, which are often daunting to replicate, especially when there are specific version requirements, version incompatibilities or outright missing dependencies. Since the main goal of this project is to develop a set of guidelines that researchers can use to improve the reproducibility of their work, suggesting a way to deal with software versions and dependencies will be a key part of our results.

We’ve observed that submissions with well-structured and detailed appendices tend to fare better in reproducibility checks. This includes those that utilized containerization solutions like Docker, which encapsulate the computing environment needed to run the experiments and thus eliminates the need for installing specific software packages. It’s these kinds of practices that we think could be encouraged more broadly.

Looking Ahead

The next steps are pretty exciting! We’re planning to use what we’ve learned to draft some guidelines that could help future SC conference submissions be more consistent. This might include templates or checklists that ensure all the necessary details are covered.

Additionally, we’re thinking about ways to automate some parts of the artifact evaluation process. The goal here is to make it less labor-intensive and more objective. A particularly nice way of reproducible artifact evaluation is Chameleon’s JupyterHub interface, which in conglomeration with the Trovi artifact sharing platform makes it easy to share artifacts and allow interested parties to reproduce the experiments with minimal effort. We are thus looking into ways to utilize and contribute to these tools in a way that could benefit the broader research community.

Wrapping Up

That’s it for now! We are working towards getting as many insights as possible from the rest of the artifact evaluations, and hopefully, by the end of this project, we’ll have some solid recommendations and tools to show for it. Thanks for keeping up with our progress, and I’ll be back with more updates as we move into the final stages of our work.

[MidTerm] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Thu, 01 Aug 2024 00:00:00 +0000

Hey there, scalability enthusiasts and fellow researchers! I’m excited to share my progress on the ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. Here’s a glimpse into how we’re tackling scalability bugs in large-scale distributed systems.

Project Overview

Large-scale distributed systems are the backbone of modern computing, powering various applications and services. However, these systems often face challenges related to reliability and performance, particularly scalability bugs. These bugs manifest in large-scale deployments, causing issues such as system downtime, reduced responsiveness, and data loss. Traditional bug-finding methods fall short in detecting these bugs, which are triggered by factors like component count, system load, workload size, recovery protocol reliability, and intermediate failure magnitude.

Our project, ScaleRep, aims to address these challenges by analyzing recent scalability issues from ten popular open-source large-scale systems. We are providing detailed accounts of bug reproduction experiences, identifying common challenges, and developing protocols for triggering and quantifying the impact of scalability bugs.

Progress Highlights

So far, I have been working on the following bugs and have successfully uploaded some of them to Trovi. Here’s a brief overview of my progress:

Bugs Worked On:

IGNITE-20614: Uploaded to Trovi Trovi Link
IGNITE-17407: Uploaded to Trovi Trovi Link
IGNITE-20692
IGNITE-16600
IGNITE-16072

What is Chameleon and Trovi?

Chameleon is a configurable experimental environment for large-scale cloud research. It provides a platform for running and testing distributed systems at scale, allowing researchers to reproduce and study scalability issues in a controlled setting.

Trovi is a platform that facilitates the sharing of reproducible artifacts. By uploading our bug reproduction artifacts to Trovi, we enable other researchers to easily reproduce scalability bugs, fostering collaboration and advancing the field of distributed systems research.

Short Description of the Bugs

IGNITE-20614 This bug refers to an issue where the Ignite service grid experiences degradation or hangs under specific conditions related to service deployment and node restarts.

Root Causes: The root cause is a race condition during the deployment and undeployment of services in the service grid, particularly when nodes are restarted or when there is a significant amount of concurrent service deployment and undeployment activity.

Impact: The impact of this bug includes potential service grid hangs, degraded performance, and possible inability to deploy or undeploy services as expected, which can disrupt the overall operation of the Ignite cluster.

Fix: The fix involves adding proper synchronization mechanisms to handle concurrent service deployment and undeployment operations more gracefully, ensuring that race conditions are avoided.

IGNITE-17407 This issue pertains to the incorrect behavior of the Ignite thin client protocol, particularly when dealing with binary objects and schema changes.

Root Causes: The root cause lies in the way the thin client handles binary object schema changes. The thin client was not correctly updating the schema cache, leading to inconsistencies and incorrect behavior when deserializing binary objects.

Impact: Users of the thin client may experience issues with binary object deserialization, leading to potential data corruption, incorrect query results, and overall application instability.

Fix: The fix involves updating the thin client protocol to properly handle schema changes by ensuring that the schema cache is correctly updated and synchronized with the server.

IGNITE-20692 This bug is related to the performance degradation observed in the Ignite SQL engine when executing certain complex queries.

Root Causes: The root cause is identified as inefficient query planning and execution strategies for specific types of complex SQL queries, leading to excessive resource consumption and slow query performance.

Impact: Users running complex SQL queries may experience significant performance degradation, leading to slower response times, increased CPU and memory usage, and potentially impacting the overall performance of the Ignite cluster.

Fix: The fix involves optimizing the SQL query planner and executor to handle complex queries more efficiently, including better indexing strategies, improved query plan caching, and more effective resource management during query execution.

IGNITE-16600 This bug involves an issue with speed-based throttling in the checkpoint process, leading to possible starvation of the checkpoint thread under heavy load.

Root Causes: The root cause is the absence of proper mechanisms to wake up throttled threads when they no longer need to be throttled, resulting in unnecessary waiting and potential starvation of the checkpoint thread.

Impact: Under heavy load, the checkpoint process can be significantly delayed, leading to slower checkpoint completion times, increased risk of data loss, and overall degraded performance of the Ignite cluster.

Fix: The fix includes implementing methods to wake up throttled threads when they no longer need to be throttled (tryWakeupThrottledThreads and shouldThrottle), ensuring that the checkpoint process can proceed without unnecessary delays.

IGNITE-16072 This issue pertains to the incorrect handling of SQL queries involving NULL values in the Ignite SQL engine, leading to unexpected query results.

Root Causes: The root cause is an incorrect implementation of SQL semantics for handling NULL values in certain query conditions, particularly in the presence of complex joins and subqueries.

Impact: Users may experience incorrect query results when NULL values are involved, leading to potential data inconsistencies and incorrect application behavior.

Fix: The fix involves correcting the SQL engine’s implementation to properly handle NULL values according to the SQL standard, ensuring that queries involving NULL values produce the expected results.

What’s Next?

Continued Bug Reproduction:

Focus on reproducing more scalability bugs

Documentation of Challenges:

Breakdown specific challenges encountered during attempts to reproduce scalability bugs.
Categorize challenges, including technical complexities, environmental dependencies, and lack of documentation in bug reports.

Finalizing Project Deliverables:

Package artifacts using Jupyter notebook scripts for convenient replay of investigation steps.
Upload the package to Trovi for replayable artifacts, enabling other researchers to easily reproduce scalability bugs for our benchmark applications.

Conclusion

The ScaleRep project has made significant strides in reproducing and benchmarking scalability bugs in large-scale distributed systems. By successfully reproducing and documenting scalability bugs, we are contributing valuable insights to the research community, aiding in the development of more robust distributed systems. The protocols and methodologies devised in this project will serve as valuable tools for researchers exploring similar issues.

Stay tuned for more updates as we continue to tackle scalability bugs and improve the reliability and performance of large-scale distributed systems.

Mid-term Blog: Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Mon, 29 Jul 2024 00:00:00 +0000

Introduction

Hello everyone I’am Archit from India. An undergraduate student at the Indian Institute of Technology, Banaras Hindu University, IIT (BHU), Varanasi. As part of the Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon my proposal under mentorship of Raül Sirvent aims to develop a service that facilitates the automated replication of COMPSs experiments within the Chameleon infrastructure.

About the project:

The project proposes to create a service that will have the capability to take a COMPSs crate (an artifact adhering to the RO-Crate specification) and, through analysis of the provided metadata construct a Chameleon-compatible image for replicating the experiment on the testbed.

Progress

It has been more than six weeks since the ReproducibilityService project began, and significant progress has been made. You can test the actual service from my GitHub repository: ReproducibilityService. Let’s break down what the ReproducibilityService is capable of doing now:

Support for Reproducing Basic COMPSs Experiments: The RS program is now fully capable of reproducing basic COMPSs experiments with no third-party dependencies on any device with the COMPSs Runtime installed. Here’s how it works:
- Getting the Crate: The RS program can accept the COMPSs workflow from the user either as a path to the crate or as a link from WorkflowHub. In either case, it creates a sub-directory for further execution named reproducibility_service_{timestamp} and stores the workflow as reproducibility_service_{timestamp}/Workflow.
- Address Mapping: The ro-crate contains compss_submission_command_line.txt, which is the command originally used to execute the experiment. This command may include many paths such as runcompss flag1 flag2 ... flagn <main_workflow_file.py> input1 input2 ... inputn output. The RS program maps all the paths for <main_workflow_file.py> input1 input2 ... inputn output to paths inside the machine where we want to reproduce the experiment. The flags are dropped as they may be device-specific, and the service asks the user for any new flags they want to add to the COMPSs runtime.
- Verifying Files: Before reproducing an experiment, it’s crucial to check whether the inputs or outputs have been tampered with. The RS program cross-verifies the contentSize from the ro-crate-metadata.json and generates warnings in case of any abnormalities.
- Error Logging: In case of any problems during execution, the std_out and std_err are stored inside reproducibility_service_{timestamp}/log.
- Results: If any results do get generated by the experiment, the RS program stores them inside reproducibility_service_{timestamp}/Results. If we ask for the provenance of the workflow also, the ro-crate thus generated is also stored here only.

Support for Reproducing Remote Datasets: If a remote dataset is specified inside the metadata file, the RS program fetches the dataset from the specified link using wget, stores the remote dataset inside the crate, and updates the path in the new command line it generates.

Challenges and End-Term Goals

Support for DATA_PERSISTENCE_FALSE: The RS program still needs to support crates with dataPersistence set to false. After weeks of brainstorming ideas on how to implement this, we recently concluded that since the majority of DATA_PERSISTENCE_FALSE crates are run on SLURM clusters, and the dataset required to fetch in such a case is somewhere inside the cluster, the RS program will support this case for such clusters. Currently, I am working with the Nord3v2 cluster to further enhance the functionality of ReproducibilityService.
Chameleon Cluster Setup: I have made some progress towards creating a new COMPSs 3.3 Appliance on Chameleon to test the service. However, creating the cluster setup script needed for the service to run on a COMPSs 3.3.1 cluster to execute large experiments has been challenging.
Integrating with COMPSs Repository: After completing the support for dataPersistence false cases, we aim to launch this service as a tool inside the COMPSs repository. This will be a significant milestone in my developer journey as it will be the first real-world project I have worked on, and I hope everything goes smoothly.

Stay tuned for the next blog!!

Final Blog: FetchPipe: Data Science Pipeline for ML-based Prefetching

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Peiran Qin, a CS student at the University of Chicago. This summer I worked on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. The FetchPipe project focuses on building a unified Python simulator and evaluating the existing cache-eviction policy and ML-based prefetcher under this simulator. Through this projects, we make the following contributions and get several insights that can share with the community:

We built up a simulator to evaluate various prefetchers under a unified framework, under the production level traces from Alibaba, Microsoft Research, and Tencent.
Through the evaluation, we discover several downsides that existing heuristic-based prefetchers encounter.
We draw several insights that can guide the future prefetchers’ design.

Methodology

In the first half of the SoR project, I mainly focus on the simulator building of I/O prefetcher. The simulator should mimic the real OS-level prefetching as much as possible. First, we develop a mechanism that mimics the users sending I/O requests to the underlying systems. Then, we simulate the process of page division, and memory management inside the systems. Finally, we designed a sleep-based mechanism to mimic the I/O latency of backend storage. The outcome system can eventually simulate the data path of I/O request and prefetching of real systems, and collect the crucial metrics such as hit rate, total prefetched data, bandwidth usage, prefetch accuracy, total cache eviction, etc.

In the second half of the SoR project, I concentrate on the evaluation of existing prefetchers. First, we surveyed the existing state-of-the-art prefetchers and divided them into two categories: (1) Heuristic-based prefetchers and (2) ML-based prefetchers. Next, for each category, we picked several representative prefetchers and implemented them within our simulator. Then, we evaluated those prefetchers using the production-level over 600 traces from Alibaba, Tencent, and Microsoft Research. Finally, we analyzed the performance of those prefetchers and discovered some interesting insights that might guide the future prefeters’s design.

Finally, based on the achievements of the SoR project, I will continue involving this interesting project with Prof. Haryadi S. Gunawi. We are leveraging the current insights we get to build an I/O prefetcher that mitigates the downsides of existing prefetchers.

Insights

Based on our experiments on the existing prefetchers, we would like the share the following insights:

Heuristic-based prefetchers, including Linux Readahead and Stride prefetcher, rely on strict pre-fined rules and detect straightforward access patterns. However, those prefetchers are too conservative to recognize the increasingly complex access patterns. Especially, in real-world applications, sequential accesses are interweaved with random accesses, leading to a next-level complexity that makes it difficult for Linux Readahead and Stride prefetchers to recognize.
Offline learning-based prefetchers learn the access patterns by training machine learning models on pre-collected historical access patterns. Blessed by the representational power of machine learning, these prefetchers excel at recognizing complex access patterns. However, their effectiveness is constrained by their dependence on the patterns encountered during offline training, making them less adaptable to previously unseen patterns in online scenarios. Moreover, due to not relying on the pre-defined rule of prefetching, Offline learning-based prefetchers are more prone to prefetch useless data, which causes cache pollution and extra pressure on backend storage.
We argue that a good prefetcher under nowadays complex and changing workload should have three properties: (1) Complexity-Recognition: which means the prefetcher should be able to recognize the complex access pattern of a complex workload. (2) Reliability: means the prefetcher should reduce its possibility to prefetch using less data and cause cache pollution. (3) Adaptability: means the prefetcher should adapt itself to the changing workload.

Future Works

Based on the above insights, we are now designing our own prefetchers that can mitigate the downsides of existing prefetchers. We will make our code public after we finalize our design.

Conclusion

Through the SoR project, I delved into the research area of I/O prefetching by reproducing the related works, characterizing their performance, and designing our own prefetcher. We contribute to the community with a comprehensive simulator, evaluation results of related prefetchers, and insights that can guide the future prefetchers’ design. In the future, I will continue working on the research area of prefetcher and keep making contributions.

Mid Term Blog: FetchPipe: Data Science Pipeline for ML-based Prefetching

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Peiran Qin, a CS student at the University of Chicago, currently working on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. The FetchPipe project focuses on building a unified python simulator and evaluating the existing chache-eviction and ML-Based prefetcher under this simulator.

Motivation

Existing prefetching algorithms can be categorized into (a) heuristic-based methods such as the Linux lookahead prefetcher and (b) machine learning-based methods like Long Short Term Memory (LSTM) models. However, there is a research gap in comprehensively comparing all existing ML solutions, such as Leap and LSTM Prefetcher, under a consistent evaluation setup. To ensure the fairness of evaluations, it is essential to integrate all baselines and our prefetcher into a homogeneous evaluation environment. Additionally, there is a need to evaluate cache eviction algorithms under prefetching scenarios.

Therefore, in this project, we aim to build a fair simulator, deploy state-of-the-art prefetchers and cache eviction algorithms onto this platform, and then evaluate them using comprehensive metrics. The state-of-the-art prefetchers we consider include Pythia (MICRO'21), SGDP (arXiv), and the Markov-Chain prefetcher. For cache eviction algorithms, we consider S3FIFO (SOSP'23) and SIEVE (NSDI'24). Our focus is on implementing these algorithms on our simulator and evaluating their performance using block storage datasets from Alibaba, Tencent, and MSR. Besides evaluating the prefetchers and eviction algorithms individually, we also aim to combine prefetchers with cache eviction algorithms to test overall performance.

Current Progress

In the past one and a half months, I have focused on (1) implementing our Python simulator and (2) deploying state-of-the-art prefetchers and cache eviction algorithms on this simulator. The implementation phase is now complete. The detailed progress is as follows:

The python simulator of evaluating both ML-based or heuristic-based prefetchers and cache eviction are done.
Evaluations metrics collection, such as hit rate, total prefetched data, prefetch overhead, prefetch accuracy are implemented on the simulator.
Two ML-based prefetchers, SGDP, Pythia and Markov-Chain are deployed on the simulator. SGDP is a graphed neural network based prefetcher, and Pythia is a reinforment learning based prefetcher.
State-of-the-art heuristic based eviction algorithms are implemented in the simulator, including S3FIFO and SIEVE.

With the simulator and state-of-the-art ML-based prefetchers and eviction algorithms in place, the next steps are to (1) organize a large-scale dataset (including over 600 traces from real storage servers) for testing performance and (2) evaluate the implemented prefetchers and eviction algorithms on this dataset. Finally, I will analyze the evaluation results and provide insights from the experimental outcomes. For the ML-based prefetchers, I will analyze both ML-related metrics such as accuracy and F1-score, and system metrics such as hit rate and various overheads.

Challenges

The biggest challenge is implementing existing prefetchers correctly and fairly. Since some state-of-the-art prefetchers are designed for DRAM prefetching, adapting them for SSD prefetching in the simulator is challenging. Additionally, the lack of source code for some works makes it difficult to reproduce their algorithms accurately based solely on their paper descriptions.

Improving Usability and Performance in cc-snapshot: My Midterm Update

Wed, 24 Jul 2024 00:00:00 +0000

Hi! I’m Zahra Temori, a rising junior studying Computer Science at the University of Delaware. This summer, I’ve had the exciting opportunity to participate in the Chameleon Summer Reproducibility Program, where I’ve been working under the mentorship of Paul Marshall. In this blog post, I’d love to share a midterm update on my project cc-snapshot and highlight what I’ve accomplished so far, what I’ve learned, and what’s coming next. It’s been a challenging but rewarding experience diving into real-world research and contributing to tools that help make science more reproducible!

Project Overview

CC-Snapshot is a powerful tool on the Chameleon testbed that enables users to package their customized environments for reproducibility and experiment replication. In research, reproducibility is essential. It allows scientists to run experiments consistently, share complete setups with others, and avoid environment-related errors. However, the current snapshotting mechanism has limitations that make it unreliable and inefficient, particularly in terms of usability and performance. These issues can slow down workflows and create barriers for users trying to reproduce results. Our goal is to improve both the usability and performance of the cc-snapshot tool. A more user-friendly and optimized system means that users can create and restore snapshots more quickly and easily, without needing to manually rebuild environments, ultimately saving time and improving reliability in scientific computing.

Progress So Far

To structure the work, we divided the project into two main phases:

Improving usability, and
Optimizing performance.

I’ve nearly completed the first phase and have just started working on the second.

Phase One – Usability Improvements

The original version of the cc-snapshot tool had several usability challenges that made it difficult for users to interact with and for developers to maintain. These issues included a rigid interface, lack of flexibility, and limited testing support. All of which made the tool harder to use and extend. To address these, I worked on the following improvements:

Problem: The command-line interface was limited and inflexible. Users couldn’t easily control features or customize behavior, which limited their ability to create snapshots in different scenarios.

Solution: I enhanced the CLI by adding:

A flag to disable automatic updates, giving users more control.
A –dry-run flag to simulate actions before actually running them which is useful for testing and safety.
Support for a custom source path, allowing snapshots of specific directories. This makes the tool much more useful for testing smaller environments.

Problem: The code lacked automated tests. Without tests, developers have to manually verify everything, which is time-consuming and error-prone.

Solution: I implemented a basic test suite and integrated it with GitHub Actions, so the tool is automatically tested on every pull request.

Problem: The tool didn’t follow a modular design. The logic was tightly coupled, making it hard to isolate or extend parts of the code.

Solution: I refactored the code by extracting key functions. This makes the code cleaner, easier to understand, and more maintainable in the long term.

Next Steps – Phase Two: Performance Optimization

After improving the usability of the cc-snapshot tool, the next phase of the project focuses on addressing key performance bottlenecks. Currently, the snapshotting process can be slow and resource-intensive, which makes it less practical for frequent use especially with large environments.

Problem 1: Slow Image Compression The current implementation uses the qcow2 image format with zlib compression, which is single-threaded and often inefficient for large disk images. This leads to long snapshot creation times and high CPU usage.

Solution: I will benchmark and compare different compression strategies, specifically:

qcow2 with no compression
qcow2 with zstd compression, which is faster and multi-threaded
raw image format, which has no compression but may benefit from simpler processing

These tests will help determine which method provides the best tradeoff between speed, size, and resource usage.

Problem 2: Suboptimal Storage Backend Snapshots are currently uploaded to Glance, which can be slow and unreliable. Uploading large images can take several minutes, and this slows down the user workflow.

Solution: I will compare Glance with a faster alternative, the Object Store. Smaller, compressed images may upload significantly faster to the Object Store e.g. 30 seconds vs. 2 minutes. By measuring upload speeds and reliability, I can recommend a better default or optional backend for users.

How I will Measure Performance

To understand the impact of different strategies, I will try to collect detailed metrics across three stages:

Image creation: How long it takes to build the image, depending on compression and format
Image upload: How quickly the snapshot can be transferred to Glance or Object Store
Instance boot time: How fast a new instance can start from that image (compressed formats must be decompressed)

I will run multiple tests for each scenario and record performance metrics like CPU usage, memory usage, disk throughput, and total time for each step. This will help identify the most efficient and practical configuration for real-world use.

Conclusion

Addressing the current usability and performance issues in cc-snapshot is essential to improving the overall user experience. By making the tool easier to use, faster, and more flexible, we can support researchers and developers who depend on reproducible computing for their work. So far, I’ve worked on enhancing the tool’s interface, adding testing support, and refactoring the codebase for better maintainability. In the next phase, I’ll be focusing on benchmarking different compression methods, image formats, and storage backends to improve speed and efficiency. These improvements will help make cc-snapshot a more powerful and user-friendly tool for the scientific community.

Stay tuned for the next update and thank you for following my journey!

Halfway Blog: FSA: Benchmarking Fail-Slow Algorithms

Tue, 23 Jul 2024 00:00:00 +0000

Introduction

Hi, I’m Xikang Song, a 2024 SoR contributor to the project, working with mentors Ruidan Li and Kexin Pei. Our FSA-Benchmark project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. We will benchmark a range of machine learning algorithms, from traditional to advanced methods, and compare the results using a comprehensive evaluation system. This will provide a clear view of how machine learning impacts critical error detection in RAID systems.

Motivation

Fail-slow issues in storage systems , where a disk operates at a significantly reduced speed without completely failing, are subtle and can manifest as consistently higher latency compared to peer disks or recurrent abnormal latency spikes. These issues are challenging to detect but can significantly degrade overall system performance over time. Fixed thresholds are ineffective because latency distributions vary across different clusters, leading to thresholds that are either too low or too high, resulting in numerous false alerts. Therefore, we are enthusiastic about using machine learning models to analyze disk performance data. Machine learning algorithms can deeply learn the trends in the data, providing better detection capabilities.

Current Progress and Challenges

Algorithm Implementation:

Cost-Sensitive Ranking Model: Inspired by the paper “Improving Service Availability of Cloud Systems by Predicting Disk Error” presented at the USENIX ATC ‘18 conference, this model ranks disks based on fail-slow risk.
Multi-Prediction Models: Drawing from “Improving Storage System Reliability with Proactive Error Prediction” presented at the USENIX ATC ‘17 conference, this approach uses multiple traditional machine learning models to evaluate disk health using diverse features. Various models were tested, with the Random Forest classifier proving most effective.
LSTM Model: This model employs Long Short-Term Memory (LSTM) networks, trained on the first day’s data for each cluster and evaluated on data spanning all days. It captures temporal dependencies to accurately predict fail-slow anomalies over time.

Comprehensive Evaluation:

Collected outputs from all algorithms on Chameleon for Perseus data A to Y (25 clusters).
Parsed the outputs through a comprehensive evaluation system, recording the true/false positives/negatives.
Plotted heat maps to show precision and recall with different look-back days and alert threshold settings.
Compared the performance across different clusters to draw conclusions.

Packaging Code:

Packaged all the code into a Trovi Jupyter notebook, including the Chameleon server setup, to provide clear steps for running the code and reproducing the experiments. All algorithm testing and result parsing can be easily done here.

Challenges

Initially, I was unsure how to evaluate the performance of different algorithms. Ruidan Li provided comprehensive guidance on collecting all the results uniformly and parsing them to gather true/false positives/negatives. This approach enabled us to derive meaningful metrics and plot heatmaps for precision and recall. I learned the scientific method of benchmarking performance, and I am grateful for the guidance.

Future Steps

Further Investigation of Advanced Algorithms

We plan to explore advanced algorithms such as PatchTST. This will involve systematically collecting outputs and conducting comprehensive benchmarking to assess their performance in identifying fail-slow anomalies.

Transition to Large Language Models (LLMs)

Recognizing the limitations of traditional machine learning methods, we intend to transition to utilizing Large Language Models (LLMs). LLMs have demonstrated superior capabilities in understanding complex patterns and making accurate predictions. We anticipate that incorporating LLMs into our analysis will enhance our ability to detect and predict fail-slow anomalies more accurately, leading to better overall system reliability.

Exploring Throttling Bugs in HDFS: Reproducing Developer Fixes

Mon, 22 Jul 2024 00:00:00 +0000

Scalability is a critical concern for large-scale distributed systems like the Hadoop Distributed File System (HDFS). Throttling bugs, which affect the system’s ability to manage data transfer rates effectively, can lead to performance issues and system instability. In my recent work, I focused on reproducing the effects of two specific throttling bugs in HDFS, which were fixed by developers. This blog provides an overview of these bugs and the process of reproducing their effects to validate the fixes.

HDFS-17087: Missing Throttler in DataXceiver#readBlock

One of the throttling bugs I explored was HDFS-17087. The DataXceiver#readBlock function in HDFS lacked a throttler, resulting in unregulated data reads. This absence could lead to potential performance degradation under heavy loads. The developer fixed this issue by adding a throttler to regulate the data transfer rate. In my work, I reproduced the bug and observed the system’s behavior both before and after applying the developer’s patch. The results showed a significant improvement in stability and performance post-fix.

HDFS-17216: Incorrect Data Rate Calculation

Another crucial bug was HDFS-17216. The issue stemmed from the use of integer division in the getBytesPerSec function, which caused incorrect speed calculations and failed to trigger the throttle, resulting in overspeed. The developer addressed this by switching from integer to float for calculating the elapsed time, ensuring accurate speed measurements. I reproduced the conditions that highlighted the bug’s effects and compared the system’s performance with and without the fix. The post-fix results confirmed that the throttling mechanism worked correctly, effectively preventing overspeed.

Conclusion

Reproducing these throttling bugs and validating the developer fixes was a vital step in understanding their impact on HDFS’s scalability. The improvements observed in system stability and performance underscore the importance of accurate throttling mechanisms. This work contributes to the broader effort of maintaining robust and scalable distributed systems, ensuring they can handle increasing loads efficiently.

Trovi redesign process and low fidelity prototype in Figma

Mon, 22 Jul 2024 00:00:00 +0000

Hello! My name is Alicia Esquivel Morel, and I’m a graduate research assistant at the University of Missouri – Columbia, pursuing a PhD in Computer Science. This summer, I’m working on a project to improve user experience reproducibility through a redesign of TROVI, as part of the Summer of Reproducibility (SoR) program. I’m excited to be working with two fabulous mentors, Kate Keahey, and Mark Powers. .

Research Reproducibility with a TROVI Redesign

As researchers, we constantly face challenges replicating experiments due to limitations in current tools. TROVI, a platform designed to facilitate experiment replication, can be hindered by hard-to-follow interfaces and difficulties integrating code and data. This leads to confusion and frustration.

My SoR project tackles these issues by redesigning TROVI to enhance user experience reproducibility. Imagine a user-friendly platform where uploading code, sharing data, and collaborating with colleagues becomes easy and straighforward.

The Redesign’s Goals

Enhanced User Experience: Inspired by user-friendly platforms like Google Colab, we’ll simplify TROVI’s interface for intuitive navigation and ease of use.
Uploads and Sharing: Uploading code and data, as well as collaborating with researchers, are key goals. Integration with platforms like GitHub will further streamline collaboration.
Continuous Improvement: A built-in feedback loop will allow users to provide input and suggestions, ensuring TROVI constantly evolves based on user needs.

Progress I have made so far

The first stage of my project began with conducting User Experience (UX) research and identifying user requirements for TROVI. I then conducted a literature review on reproducibility platforms to learn about efficient methodologies and platforms for reproducibility. This helped establish a clearer project scope. Additionally, I analyzed TROVI end-user feedback to understand redesign needs.

In summary, during the first weeks of the project, I focused on research and requirements gathering, including the literature review on state-of-the-art reproducibility platforms. Before midterm assessment, my work also involved the redesign process, prioritizing improved usability and user experience. I designed wireframes following requierements and user feedback and later translated them into a low-fidelity prototypes. Front-end and back-end considerations were made, such as selecting a front-end language (Vue.js) and a collaborative design tool (Figma).

What do I plan to do over the next weeks?

During the next two weeks, I will address challenges encountered in the design process and make the necessary adjustments to ensure the success of the next steps of the project. A higher-fidelity prototype will be completed, including connections between the different objects and frames. This will facilitate the creation of a front-end with multiple flows in the prototype. Additionally, this will provide a preview of the end-user experience through the design process, without requiring the back-end to be functional or connected yet. I’m also investigating design tool API integrations to access TROVI’s APIs. This will give us the ability to access and isolate any TROVI artifact properties associated with it.

I’m halfway in the redesign process. Next steps will include the integration of both the backend and frontend components to create a cohesive and functional system. We will also facilitate initial user interactions and testing to gather valuable feedback and ensure that the system meets the needs and expectations of end users.
In addition, as I progress, my focus will shift towards enhancing the user experience and refining the final product based on the feedback received. The final two weeks of the program will be dedicated to this critical phase, where I will implement user experience techniques and conduct thorough testing to polish the product. This period will involve close analysis and iteration to address any issues, and an optimize functionality.
By the end of the program, I aim to deliver a functional and user-friendly product that not only meets the initial project goals but also exceeds user expectations.

Stay tuned to see how TROVI is built for reproducible research!!

Data Engineering and Automated Evaluation for OpenROAD's Chat Assistant: Midterm Update

Sun, 21 Jul 2024 00:00:00 +0000

Hello everyone! We’ve reached the halfway point of our Google Summer of Code 2024 journey, and it’s time for an update on our project to build a conversational chat assistant for OpenROAD. Under the guidance of our mentors, Indira Iyer and Jack Luar, we’re making significant strides in enhancing OpenROAD’s user support capabilities.

Project Focus

My project focuses on two crucial aspects of our chat assistant:

Data Engineering: Ensuring our assistant has access to comprehensive and relevant information.
Evaluation: Developing robust methods to assess and improve the assistant’s performance.

The ultimate goal is to create a more responsive and accurate chat assistant capable of aiding users with troubleshooting, installation, and general queries about OpenROAD. I’m working in tandem with Palaniappan R, who is developing the RAG architecture for our assistant.

Progress

Since our initial deployment, I’ve been concentrating on implementing automated evaluation systems for our RAG architecture. We’ve developed two primary evaluation methods:

Basic Abbreviation Evaluation

This method assesses the model’s ability to accurately identify and explain common abbreviations used within the OpenROAD community. It ensures that our assistant can effectively communicate using domain-specific terminology.

LLM Judge-Based Evaluation

For this more comprehensive evaluation, we:

Prepared a dataset of question-answer pairs relevant to OpenROAD.
Queried our model with these questions to generate answers.
Employed LLMs (including GPT-4o and Gemini 1.5 Flash) to act as judges.
Evaluated our model’s responses against ground truth answers.

Here’s a glimpse of our early benchmark results:

Exploratory Data Analysis (EDA) on GitHub OpenROAD issues

To gather more data, I performed Exploratory Data Analysis (EDA) on GitHub OpenROAD issues using GitHub’s GraphQL API. This allowed us to:

Filter data based on parameters such as:
- Minimum number of comments
- Date range
- Mentioned PRs
- Open or closed status
Structure the data, focusing on issues tagged with Build, Query, Installation, and Runtime.
Process the data into JSONL format with key fields including:
- url: URL of the GitHub issue
- id: Unique issue number
- title: Issue title
- author: Username of the issue creator
- description: Initial issue description
- content: Array of messages related to the issue
- category: General category of the issue
- subcategory: More specific category of the issue
- tool: Relevant tools or components
- date: Issue creation timestamp

After curating this dataset, I was able to run an Analysis on OpenROAD Github Issues, identifying multiple categories of issues in the form of a pie chart.

Looking Ahead

As we move into the second half of the GSOC period, our plans include:

Incorporating GitHub Discussions data into our knowledge base.
Utilizing this expanded dataset to enhance our RAG architecture.
Continually refining and improving our model’s performance based on evaluation results.

We’re excited about the progress we’ve made and look forward to delivering an even more capable and helpful chat assistant for the OpenROAD community. Stay tuned for more updates as we continue this exciting journey!

Halfway Through SoR24: Building a Scalable Performance Benchmarking Tool for Genomics Workflows

Sun, 21 Jul 2024 00:00:00 +0000

Project Overview

Hi! I’m Martin Putra, and I’m working on the “Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster” project under the supervision of In Kee Kim. We are building GenScale, a scalable benchmarking tool for genomics workfload which leverages industrial-grade cluster manager and monitoring systems. GenScale will allow us to generate performance data under a setup that is representative of large-scale production settings. Ultimately, we hope GenScale and the datasets it produces will catalyze engagement between the computer systems and bioinformatics community, thus accelerating the pace of discovery at both fields.

Progress and Challenges

We have built a prototype using Kubernetes as cluster manager and Prometheus for monitoring systems. At its current state, the prototype can support an arbitrary number of compute nodes, owing to Kubernetes’ notable scaling capability. This provides a suitable environment for small- to mid-scale experiments. We leverage ChameleonCloud to provide the necessary computational and reproducibility infrastructure. The monitoring system supports cluster-level, node-level, and container-level metrics collection and failure detection. We integrated Grafana dashboards for visualizations.

The prototype also supports the execution of user-defined workflows. During the design process, we considered integrating one of existing workflow execution systems, such as cwltool, Nextflow, or Cromwell. Each system has its own pros and cons when placed within the context of how we envision GenScale. However, we ultimately decided to build our own workflow execution system in order to provide maximum flexibility for the capabilities we plan to add in the future. For example, we believe it will be interesting to study how hardware heterogeneity affects the performance of each application in the workflow (a well-known workflow scheduling problem). Studying the problem requires capability to schedule execution on specific machines. In addition, if we want to study contention, we may need to execute on machines which are currently running specific workflows, too. While there are ways to do them with existing workflow execution systems + Kubernetes stack, we believe it will be hugely simplified if we build our own workflow execution system.


Figure 1. Proportion of execution time for DNA Alignment applications, executed on Chameleon’s cascadelake_r node with 1500MB paired-end input. y-axis: proportion of application’s exec. time out of the whole workflow’s exec. time, x-axis: top 10 applications accounting for 97% exec. time, sorted by proportion. Other applications are aggregated.

We confirmed GenScale’s capability to produce useful data by executing a DNA alignment workflow and capturing its runtime resource usage. We use Genomics Data Commons’ (GDC) DNA alignment workflow as reference, which has a total of 27 applications ranging from quality check, read trimming, actual alignment, indexing, and various metrics collection. We wrote our own simplified version of the workflow by first analyzing the execution time & resource usage of each application, then we chose 10 applications which represents 97% of the workflow execution time. We took into account that containerization is the de-facto standard for workflow execution among the bioinformatics community. Thus, we packaged each application as its own separate container, then hosted their Dockerfiles & containers in a private Github Container Registry (GHCR). We plan to make them public in the future. Our monitoring system is able to show resource usage in real time. We also built sidecar containers which use Unix’s pidstats to generate a CSV of cores, memory, and storage utilization throughout each workflow’s execution. This will allow easier analysis and data sharing for GenScale’s users.


Figure 2. CPU utilization pattern of BWA, Picard’s CollectWGSMetrics, and Picard’s ValidateSamFile collected by GenScale. y-axis: (num. cores) x 100%, x-axis: time elapsed in seconds.

One technical challenge is in automating the creation of Kubernetes cluster and in keeping it alive. We believe GenScale’s users would be interested in the performance of workflows under dynamic cluster sizes, either due to intentional scaling or machine failures. While the current prototype supports creating a cluster with arbitrary nodes, there are still steps which require a reboot when adding nodes. This makes cluster creation and horizontal scaling not fully automated yet. Keeping a cluster alive is also expensive. Since we use ChameleonCloud as our testbed, we have a choice of either keeping the cluster alive at the cost of significant service units (SU) usage, or save SUs by terminating our leases at the cost of rebuilding the cluster from scratch later. We choose a middle ground by keeping only Kubernetes’ control plane alive. The approach works well so far.

Next Steps

For the remaining weeks, we plan to work on the second workflow, namely RNA Alignment. We would also like to add simple user interfaces if time permits. Finally, we plan to package GenScale’s source code, container images, and sample benchmark results for the open-source community. We look forward to the second half of Summer of Reproducibility!

Midterm Blog: ML in Detecting and Addressing System Drift

Sun, 21 Jul 2024 00:00:00 +0000

Hello! I’m Joanna! Over the past month, I have been contributing to the ML in Detecting and Addressing System Drift project under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy. My project aims to design a pipeline to evaluate drift detection algorithms on system traces. The goal is to characterize different drifts, understand how they affect model performance, and evaluate the performance of state-of-the-art (SOTA) drift detection algorithms.

Progress

Over the past month, I’ve primarily been constructing a data drift dataset from the Tencent I/O block trace, which includes both drift and non-drift data. By combining offline drift detection algorithms such as Maximum Mean Discrepancy, Cramér-von Mises, and Kolmogorov-Smirnov, I am developing a dataset that contains segments with and without drifts for features such as IOPS (Input/Output Operations Per Second), read/write size ratio, write size, and other relevant performance metrics. The diagrams below illustrate the data segments identified with and without drifts, respectively.

In addition to constructing the datasets, I have begun evaluating some online drift detection algorithms and designing metrics to assess their performance. I have tested the performance of online drift detection algorithms such as Online Maximum Mean Discrepancy and Online Cramér-von Mises under various settings, including different window lengths and sensitivity levels. The following diagrams illustrate the drift points detected for the IOPS feature under these different settings.

Next Steps

Here are my plans for the next month:

Complete the experiments on data drift and generate improved visualizations to summarize the performance of these online drift detection algorithms, including their overhead and accuracy over time.
Characterize drifts by identifying the types of drifts that lead to model performance degradation
Evaluate drift detection algorithms in the context of concept drifts.

Stay tuned for my future updates on this project!

Enabling VAA Execution: Environment and VAA Preparation and/or Reproducibility for Dynamic Bandwidth Allocation (CONCIERGE)

Sat, 20 Jul 2024 00:00:00 +0000

Hi there!

I am Rafael Sinjunatha Wulangsih, a Telecommunication Engineering graduate from the Bandung Institute of Technology (ITB), Bandung, Indonesia. I’m currently contributing to the “EdgeRep: Reproducing and benchmarking edge analytic systems” project under the mentorship of Yuyang (Roy) Huang and Prof. Junchen Jiang. You can find more details about the project proposal here.

This project addresses the challenges posed by the massive deployment of edge devices, such as traffic or security cameras, in smart cities and other environments. In the previous Edgebench project, the team proposed a solution to dynamically allocate bandwidth and compute resources to video analytic applications (VAAs) running on edge devices. However, that project was limited to a single VAA, which may not represent the diverse applications running on edge devices. Therefore, the main goal of this project, “EdgeRep,” is to diversify the VAAs running on edge devices while utilizing a solution similar to that of the Edgebench project. EdgeRep aims to reproduce state-of-the-art self-adaptive VAAs (with seven candidates) and maintain self-adaptation in these video analytics pipelines. We will implement it ourselves if the video analytics applications do not support self-adaptation.

Halfway Through GSOC: Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Sat, 20 Jul 2024 00:00:00 +0000

Hello, I’m Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University. I’m currently working on the AIIO / Graph Neural Network project under the guidance of Bin Dong and Suren Byna. Our project focuses on enhancing the AIIO framework to automatically diagnose I/O performance bottlenecks in high-performance computing (HPC) systems using Graph Neural Networks (GNNs).

Project Overview

Our primary goal is to tackle the persistent issue of I/O bottlenecks in HPC applications. Identifying these bottlenecks manually is often labor-intensive and prone to errors. By integrating GNNs into the AIIO framework, we aim to create an automated solution that can diagnose these bottlenecks with high accuracy, ultimately improving the efficiency and reliability of HPC systems.

Progress and Challenges

Over the past few weeks, my work has been centered on developing a robust data pre-processing pipeline. This pipeline is crucial for converting raw I/O log data into a graph format suitable for GNN analysis. The data pre-processing involves extracting relevant features from Darshan I/O logs, which include job-related information and performance metrics. One of the main challenges has been dealing with the heterogeneity and sparsity of the data, which can affect the accuracy of our models. To address this, we’ve focused on using correlation analysis to identify and select the most relevant features, ensuring that the dataset is well-structured and informative for GNN processing.

We’ve also started constructing the GNN model. The model is designed to capture the complex relationships between different I/O operations and their impact on system performance. This involves defining nodes and edges in the graph that represent job IDs, counter types, and their values. We explored different graph structures, including those that focus on counter types and those that incorporate more detailed information. While more detailed graphs offer better accuracy, they also require more computational resources.

Current Achievements

Data Pre-processing Pipeline: We have successfully developed and tested the pipeline to transform Darshan I/O logs into graph-structured data. This was a significant milestone, as it sets the foundation for all subsequent GNN modeling efforts.
GNN Model Construction: The initial version of our GNN model has been implemented. This model is now capable of learning from the graph data and making predictions about I/O performance bottlenecks.
Correlation Analysis for Graph Structure Design: We have used correlation analysis on the dataset to understand the relationships between I/O counters. This analysis has been instrumental in designing a more effective graph structure, helping to better capture the dependencies and interactions critical for accurate performance diagnosis.

Training for Different Graph Structures: We are currently training our model using various graph structures to determine the most effective configuration for accurate I/O performance diagnosis. This ongoing process aims to refine our approach and improve the model’s predictive accuracy.

Next Steps

Looking ahead, we plan to focus on several key areas:

Refinement and Testing: We’ll continue refining the GNN model, focusing on improving its accuracy and efficiency. This includes experimenting with different graph structures and training techniques.
SHAP Analysis: To enhance the interpretability of our model, we’ll incorporate SHAP (SHapley Additive exPlanations) values. This will help us understand the contribution of each feature to the model’s predictions, making it easier to identify critical factors in I/O performance.
Documentation and Community Engagement: As we make progress, we’ll document our methods and findings, sharing them with the broader community. This includes contributing to open-source repositories and engaging with other researchers in the field.

This journey has been both challenging and rewarding, and I am grateful for the support and guidance from my mentors and the community. I look forward to sharing more updates as we continue to advance this exciting project.

Optimizing Scientific Data Streaming: Developing Reproducible Benchmarks for High-Speed Memory-to-Memory Data Transfer over SciStream

Sat, 20 Jul 2024 00:00:00 +0000

Hello there! I’m Acheme and I’m thrilled to share the progress on my project, “Optimizing Scientific Data Streaming: Developing Reproducible Benchmarks for High-Speed Memory-to-Memory Data Transfer over SciStream” under the mentorship of Joaquin Chung and Flavio Castro under the SciStream project.

Project Overview

This project aims to develop SciStream-bench, a set of benchmarks and artifacts designed to precisely evaluate the performance of scientific streaming applications across diverse traffic patterns when running over the SciStream framework.

Progress

One of the first points of call in the project was consultation with SciStream team members working at Argonne to identify use cases in scientific streaming applications and what typical traffic profiles they represent. The goal was to simulate these profiles using traffic generator tools and network configuration of network resources on the FABRIC/Chameleon testbed. The following traffic profiles were identified to meet many use-cases including one of the ESnet’s broad categorization, “The Time-Sensitive Pattern”, in integrated research workflows:

Throughput intensive startup
Intermittent burst of traffic for a duration of time
Constant rate traffic
Latency sensitive

Since data streaming applications have some unique requirements for optimum performance, the following metrics were selected as important for testing streaming performance.

Latency
Jitter
Packet loss / message loss
Throughput

Subsequently, about seventeen open-source traffic generator applications were identified and compared to determine a few suitable ones for generating our defined traffic profiles and that expose the desired performance metrics. We ultimately settled on iperf3 and pvaPy (a scientific streaming application developed at Argonne National Lab)

So far, the first set of tools for benchmarking using iperf3 as traffic generator with profiles of constant rate and intermittent bursts have been developed, the tools generate traffic, collects the metrics that iperf3 exposes metrics including throughput, jitter and datagram losses, and saved to a csv file for further analysis. A Jupyter notebook is used to setup a FABRIC slice and configure a four-node experiment suitable for benchmarking SciStream base architecture. After running the experiments on the nodes on FABRIC and collecting results in a CSV file, cells in the Jupyter notebook were coded to analyze the data. In the analysis includes average, min, max and standard deviation of the various metric performances.

Findings

From the experiments conducted so far, the findings are as follows:

We could not properly simulate some of the listed traffic profiles initially defined: for example, to simulate a latency-sensitive traffic profile, we needed the ability to set timeouts in iperf3 which is not available at the moment
It is not straightforward to implement SciStream on the Chameleon testbed at the moment.
Iperf3 does not expose the latency metric and the jitter computation is suspect.

Next Steps

Similar to the iperf3-based benchmarking tool developed and the analysis tools, I will focus next on pvaPy:

Fully develop traffic generator and metric collection tools for pvaPy for the defined traffic profiles and exposing the chosen metrics
Perform initial experiment like for iperf3 before
Repeat both iperf3 and pvaPy-based benchmarking operation in multiple scenario (LAN, METRO, WAN), compare performance and explain results.

Stay tuned for my final blog as I present deeper results and insights!

Reproducibility in Data Visualization

Fri, 19 Jul 2024 00:00:00 +0000

Hello everyone!

Initial Approach and Challenges

I began my work by comparing original visualizations with reproduced ones using OpenCV for pixel-level comparison. This method helped highlight structural differences but also brought to light some challenges. Different versions of libraries rendered visualizations slightly differently, causing minor positional changes that didn’t affect the overall message but were still flagged as discrepancies.

To address this, I experimented with machine learning models like VGG16, ResNet, and Detectron2. These models are excellent for general image recognition but fell short for our specific needs with charts and visualizations. The results were not as accurate as I had hoped, primarily because these models aren’t tailored to handle the unique characteristics of data visualizations.

Shifting Focus to Chart-Specific Models

Recognizing the limitations of general ML models, I shifted my focus to chart-specific models like ChartQA, ChartOCR, and ChartReader. These models are designed to understand and summarize chart data, making them more suitable for our goal of comparing visualizations based on the information they convey.

Generating Visualization Variations and Understanding Human Perception

Another exciting development in my work has been generating different versions of visualizations. This will allow me to create a survey to collect human categorization of visualizations. By understanding how people perceive differences whether it’s outliers, shapes, data points, or colors. We can gain insights into what parameters impact human interpretation of visualizations.

Next Steps

Moving forward, I’ll continue to delve into chart-specific models to refine our comparison techniques. Additionally, the survey will provide valuable data on human perception, which can be used to improve our automated comparison methods. By combining these approaches, I hope to create a robust framework for reliable and reproducible data visualizations.

I’m thrilled about the progress made so far and eager to share more updates with you all. Stay tuned for more insights and developments on this exciting journey!

Reproducibility in Data Visualization

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello! My name is Arya Sarkar a machine learning engineer and researcher based out of Kolkata, a city in Eastern India dubbed the City of Joy. For the last month and a half I have been working closely with Professor David Koop on the project titled Reproducibility in Data Visualization. I’m thrilled to be able to make my own little mark on this amazing project and aid in exploring solutions to capture visualizations in hopes of making reproducibility easier in this domain.

Progress and Challenges

The last month and a half have mostly been spent trying to explore best possible solutions to facilitate the reproducibility of STATIC visualizations from local sources and/or the web. We have taken inspiration from existing work in the domain and successfully captured meta-information required to ensure reproducibility in the regenerated visualizations from the said metadata. The metadata extracted is saved into the generated .png figure of the visualization therefore allowing reproducibility as long as you have (a) The original dataset (b) The generated .png of the visualization. Every other information is stored inside the .png file as a json object and can be used to regenerate the original image with a very high accuracy.

The problem however remains with visualizations where randomness such as jitter is involved. Capturing the randomness has not been 100% successful as of now, and we are looking into options to ensure the capture of certain plots that contains randomness.

The following images can be used to highlight some results from our reproducibility experiments: Original Histogram using Matplotlib on the iris dataset:

Reproduced Histogram using metainformation from the original:

The next steps

We have already started looking into solutions and ways to capture visualizations from the web i.e. from platforms such as ObservableHq and use these experiments to transition into capturing interactive visualizations from the web.

Capturing user interactions and all states in an interactive visualization can prove to be very useful as it is a very known pain-point in the reproducibility community and has been a challenge that needs to be solved. My next steps involve working on finding a solution to capture these interactive visualizations especially those living on the web and ensuring their reproducibility.

Halfway Through GSOC: My Experience and Learnings

Thu, 18 Jul 2024 00:00:00 +0000

Hello there! I’m Qianru, and this is my mid-term blog post for the 2024 Google Summer of Code. I am working on the BenchmarkST project, focusing on benchmarking gene imputation methods in spatial transcriptomics. My goal is to create a comprehensive, reproducible platform for evaluating these methods across various datasets and conditions.

In this post, I will share some of the progress I have made so far, the challenges I have faced, and how I overcame them. I will also highlight some specific accomplishments and what I plan to do next.

Achievements:

Developed the Python Package: I created the “Impeller” Python package, which includes tools for downloading example data, processing it, and training models. This package aims to standardize gene imputation tasks in spatial transcriptomics.
Example Data Integration: Successfully integrated various spatial transcriptomics datasets into the package for benchmarking purposes.
Benchmarking Framework: Established a framework for objective comparison of different gene imputation methodologies.

Python Package: Installation and Usage

You can install the package using pip:

pip install Impeller

Download Example Data

from Impeller import download_example_data
download_example_data()

Load and Process Data

from Impeller import load_and_process_example_data, val_mask, test_mask, x, original_x = load_and_process_example_data()

Train Model

from Impeller import create_args, train args = create_args(),test_l1_distance, test_cosine_sim, test_rmse = train(args, data, val_mask, test_mask, x, original_x)

Challenges:

Reproducing the results of various gene imputation methods was not an easy task. I faced several challenges along the way:

Lack of Standardized Data: Some methods had incomplete or missing code, making it difficult to reproduce their results accurately.
Reproducibility Issues: Successfully integrated various spatial transcriptomics datasets into the package for benchmarking purposes.
Resource Limitations: Running large-scale experiments required significant computational resources, which posed constraints on the project timeline.

Future Work:

Moving forward, I plan to:

Extend the package’s functionalities to include more datasets and imputation methods.
Enhance the benchmarking framework for more comprehensive evaluations.
Collaborate with other researchers to validate and improve the package’s utility in the bioinformatics community.

I hope you found this update informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for your attention and support!

Mid Term Blog: FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Lihaowen (Jayce) Zhu, a 2024 SoR contributor for the FEP-bench project, under the mentorship of Yuyang (Roy) Huang. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. By exploring innovative caching, prefetching, and heuristic strategies, this proposal aims to optimize the preprocessing workflow, thereby enhancing efficiency and reducing the required resources of ML projects.

Motivation

Our research project is based on the context of Deep Neural Networks. To train a DNN, we first need a large amount of data. All raw data must be preprocessed by a data preprocessing pipeline, which is specific to different ML tasks. As usual, in a preprocessing pipeline, the data must be loaded from the disk and converted to the correct format, transformed and augmented. And then, it can be fed into the training stage. In common ML training tasks and datasets, the data preprocessing stage can consume almost 65% of the total training time. However, compared with the fast development of computing hardware including GPUs and TPUs, the speed of data preprocessing pipelines has not been improved by a lot and cannot keep up with these hardware innovations, which leads to a bottleneck in the efficiency of Deep Neural Network training.

The bottlenecks can be divided into 2 categories: the data side and the computation side. The data side bottleneck is mainly caused by the data transfer in the system, including data fetching, I/O bound, huge size of data, and complex data format. However, the computation side bottleneck can always happen during data preprocessing operations and data shuffling. For distributed Machine Learning training systems, gathering the distributed data can also lead to the computation side bottleneck.

Current Progress

In order to improve the efficiency of the machine learning preprocessing pipeline, we first need to understand and document the preprocessing workflows commonly used in machine learning, including pipelines of Natural Language Processing, Computer Vision, and Audio datasets. As a result, for the past month, we have built up a collection of common datasets for different machine learning tasks. The dataset types include NLP, CV, Audio, Linear Regression, Video and LiDAR. The machine learning job types are collected based on the dataset types, such as sentiment analysis for NLP, and image classification for CV. The data has either a structured or unstructured format. In addition, our collection contains the following attributes:

Data/Sample size
Typical preprocessing operations
Preprocessing difficulty: hard/easy
Input splittable
Output reusable
CPU/GPU/IO Bound
Dataset and preprocessing links.

By collecting all this data, we can gain an overview of all common preprocessing pipelines in the current machine learning research field, and build up a solid basis for the next phase of our project, which requires hard work on benchmark profiling. For example, for the Audio datasets, we focus on the LibriSpeech dataset. It contains 1000 hours of speech sampled at 16kHz, making it one of the largest publicly available datasets for speech recognition tasks. The typical preprocessing steps of the LibriSpeech dataset include feature extraction, label to integer conversion, and padding.

Challenges

During the first phase of the project, I met a lot of challenges as I had not been exposed to topics similar to this project. The first big problem was that I needed to learn the concepts of some machine learning tasks from scratch, such as NLP, so that I could have a better understanding of the common datasets and pipelines. Also, I needed to deeply review a lot of different preprocessing pipelines for each machine learning task, to make the table more comprehensive.

Midterm Blogpost: Drift Management Strategies Benchmark

Thu, 18 Jul 2024 00:00:00 +0000

Hello there! I’m William and I’m thrilled to share the progress on my project, “Developing A Comprehensive Pipeline to Benchmark Drift Management Approaches” under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy under the LAST project.

Project Overview

Progress

So far, I’ve generated various synthetic datasets, which include:

CIRCLE: This dataset contains two features x1, x2 drawn uniformly from the interval [0, 1]. Each data point is labeled as per the condition (x1 − c1)^2 + (x2 − c2)^2 <= r where the center (c1, c2) and radius r of the circular decision boundary changes gradually over a period of time introducing (gradual) concept drift.
COVCON: This 2-dimensional dataset has covariate shift and concept drift. The decision boundary at each point is given by α ∗ sin(πx1) > x2. We use 10000 points (100 batches, 1000 points per batch). Covariate shift is introduced by changing the location of x1 and x2 (for batch t x1 and x2). Concept drift is introduced by alternating the value of α.
SINE: This dataset contains two features x1, x2 drawn uniformly from the interval [0, 1]. In the first context all points below the curve y = sin(x) are classified as positive. The label for the classes are flipped after.

Additionally, I’ve also curated drifting data from the Tencent I/O block trace. These datasets will be used to benchmark model performance under different drift conditions.

The pipeline can receive a base sci-kit learn model, and evaluate their performance on these datasets prequentially. Here are some of the initial results for the performance of the models on these drifting dataset, under a never retraining and retraining, using 1 & 7 past windows. As you can see, model performance degrades upon encountering extreme drift.

Findings

From the experiments conducted so far, the findings are as follows:

A model without retraining struggles to maintain performance when drift occurs.
Retraining on data from previous drifting windows, whether abruptly (SINE) or gradually (CIRCLE), leads to poorer performance, especially evident in the retrain Window, which incorporates data up to 7 windows prior.
However, retraining on previous data proves beneficial in cases of covariate shift (CovCon), allowing the model to better align with the evolving real-world feature distributions.

Next Steps

As the base template for the pipeline and dataset curation is done, as I move forward, my focus will be on:

Implementing three advanced algorithms: AUE (Accuracy Updated Ensemble), MATCHMAKER, and Driftsurf, then integrating them into the pipeline.
Enhancing the benchmarking process by adding more metrics and plots, such as training time and inference time, to better evaluate the strategies.
Packaging the entire experiment into a Chameleon Trovi Artifact, ensuring ease of reproducibility and extension.

Stay tuned for my final blog as I delve deeper into this project!

Midterm Blogpost: HDEval's LLM Benchmarking for HDL Design

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello! My name is Ashwin Bardhwaj, an electrical engineering and computer science student based in San Diego, CA. For the past 6 weeks, I have been working closely with Professor Jose Renau on the HDEval project. The aim of this project is to create multiple project sized HDL benchmarks to evaluate how well existing LLMs can generate Verilog/Chisel code. These benchmarks will include my own “golden” HDL implementation of the project as well as respective English prompts to guide the LLM. I am excited to be able to work with these tools that have the potential to become a valuable resource for HDL design. So far, I have been successful in creating the first benchmark, a pipelined 3 stage RISC-V core, as well as working through by second project, a Gameboy Emulator.

RISC-V Implementation

Over this past month and a half, I have successfully completed my first benchmark which focuses on creating, modeling, and testing a pipelined 3-stage RISC-V core. The core uses the fetch, decode, and execute structure and is functional for most RV32I instructions. I synthesized and simulated my Verilog using Icarus Verilog and displayed the waveforms on GTKWave. After development, a good section of time was spent creating and tuning the English explanation of each Verilog module. After running these benchmark files through several LLM APIs, we compared the existing “golden” modules with the generated ones and noticed that more recent versions of LLMs such as GPT 4o and Claude 3 preform much better at creating syntactically correct and efficient code.

In addition, I have also created a tool that will parse the Verilog and instruction files into the necessary json structure to then test on various models.

Gameboy Emulator

I am also in the process of developing the second benchmark, which targets a Gameboy emulator. This will challenge the LLMs much more than the RISC-V project because apart from the custom CISC CPU, the model should also understand how to handle various other blocks of the hardware system including memory, picture processing unit (PPU), sound processing unit (SPU), various input/output systems like the buttons and cartridge, and interrupt handlers. As a result, it will challenge the model to understand the system as a whole when creating each individual module.

Next Steps

As we continue on to the second half of the project, I will continue working on my gameboy emulator. I have already completely developed and tested the Z80-esque CPU, DMA, and interrupt handler but need to continue working on the display and sound interfaces. Also, I will also continue to evaluate and run these tests over a wider range of LLMs to get a better picture of what models and versions are best suited for HDL design as well as the direction these models are going in.

Halfway Through OSRE24: My Experience and Learnings

Mon, 15 Jul 2024 00:00:00 +0000

Hello there! I’m Kilian Warmuth, a computer science student from Germany. This summer, I’m part of the 2024 Summer of Reproducibility (SoR) initiative. My project, “Reproducible Experiment Workflows in SLICES/pos,” aims to enhance reproducibility in scientific research, aligning with the FAIR principles (Findable, Accessible, Interoperable, Reusable).

Project Overview

The “Reproducible Experiment Workflows in SLICES/pos” project is part of the larger SLICES-RI initiative, designed to improve the reproducibility and reusability of large-scale experimental research. The project focuses on integrating the RO-Crate standard into the pos testbed to organize and document experiment results systematically. This integration will enhance the accessibility and comprehensibility of research findings, ensuring they adhere to the FAIR principles. Additionally, the project aims to improve the portability of pos experiments to the Chameleon testbed, facilitating collaboration and seamless execution across different research environments.

Progress and Challenges

The first half of the project is done, marked by significant progress and learnings. My initial focus was on familiarizing myself with the pos framework and the RO-Crate standard. This foundational knowledge was crucial for the subsequent steps of restructuring the results folder and integrating automated RO-Crate generation into the pos framework.

Key Achievements:

Restructured Results Folder: The structure of the results folder has been redesigned to streamline navigation and enable systematic storage of result data.
Automated RO-Crate Generation: Successfully integrated the basics of the RO-Crate standard into the pos framework, enabling the automated generation of comprehensive results documentation.
Metadata Documentation: Added comprehensive documentation to the results data, including essential metadata such as author details, user scripts, and hardware information, enhancing reproducibility and interpretability.

Challenges Encountered:

Balancing Automation with Flexibility: Ensuring the automated generation of RO-Crates did not compromise the flexibility required by researchers to customize their experiment documentation and mess with the complex requirements of a testbed.
Complexity of Testbed Systems: FIntegrating the RO-Crate implementation for a complex system like a testbed has required deep dives into the code base of the testbed.

Despite these challenges, the progress made has been rewarding, laying a solid foundation for the next phase of the project.

Learnings and Skills Gained

Understanding the Complexity of Testbeds: One of the key learnings from this project has been the realization that testbeds are complex systems. Despite their complexity, the process became manageable thanks to well-documented software and the invaluable support of top mentors who provided detailed answers to in-depth questions. Their guidance was crucial in navigating the challenges of the project.

Open Source Development in an Educational Environment: My experience in open source development has been enriched by working within an educational context. This skill is particularly important when adapting and simplifying code to ensure that users can follow along and gain a deeper understanding of the experiments, improving the quality of research experiments.

Next Steps

As we move into the second half of this project, our primary focus will be on enhancing the portability of pos experiments to the Chameleon testbed. Key tasks include:

Finetune RO-Crate Implementation: Continue refining the RO-Crate integration to handle the complexities of testbed systems more effectively like special edge cases.
Enhance Portability: Refine the integration with Trovi, ensuring seamless upload and retrieval of experiment results across testbeds.
Develop Introductory Examples: Create examples demonstrating the use of pos in various testbed environments to guide researchers.
Execute and Analyze Experiments: Design and execute a complex network experiment on both SLICES/pos and Chameleon, validating and refining portability features.

These steps are crucial to achieving our goal of making pos experiments more accessible and reproducible across different research environments.

Conclusion

Reflecting on the first half of my OSRE24 journey, I am incredibly grateful for the opportunity to work on the “Reproducible Experiment Workflows in SLICES/pos” project. The experience has been both challenging and rewarding, providing valuable insights into open-source development, machine learning techniques, and the creation of educational resources.

As we move forward, I am excited about the coming weeks. The completion of the portability enhancements and the execution of complex experiments lie ahead, marking significant milestones in our project. The skills and lessons I have acquired will guide me in future endeavors.

Data leakage in applied ML: reproducing examples from genomics, medicine and radiology

Mon, 01 Jul 2024 00:00:00 +0000

Hello everyone! I’m Shaivi Malik, a computer science and engineering student. I am thrilled to announce that I have been selected as a Summer of Reproducibility Fellow. I will be contributing to the Data leakage in applied ML: reproducing examples of irreproducibility project under the mentorship of Fraida Fund and Mohamed Saeed. You can find my proposal here.

This summer, we will reproduce studies from medicine, radiology and genomics. Through these studies, we’ll explore and demonstrate three types of data leakage:

Pre-processing on train and test sets together
Model uses features that are not legitimate
Feature selection on training and test sets

For each paper, we will replicate the published results with and without the data leakage error, and present performance metrics for comparison. We will also provide explanatory materials and example questions to test understanding. All these resources will be bundled together in a dedicated repository for each paper.

This project aims to address the need for accessible educational material on data leakage. These materials will be designed to be readily adopted by instructors teaching machine learning in a wide variety of contexts. They will be presented in a clear and easy-to-follow manner, catering to a broad range of backgrounds and raising awareness about the consequences of data leakage.

Stay tuned for updates on my progress! You can follow me on GitHub and watch out for my upcoming blog posts.

FetchPipe: Data Science Pipeline for ML-based Prefetching

Tue, 25 Jun 2024 00:00:00 +0000

Hello, I’m Peiran Qin, a first-year Pre-Doctoral student in Computer Science at the University of Chicago. In this summer I will focus working on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. This is my proposal.

Caching and prefetching are integral components of modern storage systems, aimed at reducing I/O latency by utilizing faster but less dense memory for storing data that is accessed frequently. Traditional prefetching strategies, which primarily rely on heuristic-based methods, often fall short in performance, particularly in complex scenarios. To address the complex scenarios, in recent years, machine learning solutions have emerged as a promising alternative, offering the ability to learn and predict complicated data access patterns. However, each existing ML prefetcher may bias toward different scenarios and distinct evaluation metrics. There is still a necessity to evaluate state-of-the-art machine learning based literatures comprehensively and fairly under an aligned evaluation framework and extensive performance metrics. Therefore, It becomes the motivation for me to spend my summer on this interesting project!

Assessing the Computational Reproducibility of Jupyter Notebooks

Tue, 18 Jun 2024 00:00:00 +0000

Like so many authors before me, my first reproducibility study and very first academic publication started with the age-old platitude, “Reproducibility is a cornerstone of the scientific method.” My team and I participated in a competition to replicate the performance improvements promised by a paper presented at last year’s Supercomputing conference. We weren’t simply re-executing the same experiment with the same cluster; instead, we were trying to confirm that we got similar results on a different cluster with an entirely different architecture. From the very beginning, I struggled to wrap my mind around the many reasons for reproducing computational experiments, their significance, and how to prioritize them. All I knew was that there seemed to be a consensus that reproducibility is important to science and that the experience left me with more questions than answers.

Not long after that, I started a job as a research software engineer at Purdue University, where I worked heavily with Jupyter Notebooks. I used notebooks and interactive components called widgets to create a web application, which I turned into a reusable template. Our team was enthusiastic about using Jupyter Notebooks to quickly develop web applications because the tools were accessible to the laboratory researchers who ultimately needed to maintain them. I was fortunate to receive the Better Scientific Software Fellowship to develop tutorials to teach others how to use notebooks to turn their scientific workflows into web apps. I collected those and other resources and established the Jupyter4Science website, a knowledgebase and blog about Jupyter Notebooks in scientific contexts. That site aims to improve the accessibility of research data and software.

There seemed to be an important relationship between improved accessibility and reuse of research code and data and computational reproducibility, but I still had trouble articulating it. In pursuit of answers, I moved to sunny Arizona to pursue a History and Philosophy of Science degree. My research falls at the confluence of my prior experiences; I’m studying the reproducibility of scientific Jupyter Notebooks. I have learned that questions about reproducibility aren’t very meaningful without considering specific aspects such as who is doing the experiment and replication, the nature of the experimental artifacts, and the context in which the experiment takes place.

I was fortunate to have found a mentor for the Summer of Reproducibility, Tanu Malik, who shares the philosophy that the burden of reproducibility should not solely rest on domain researchers who must develop other expertise. She and her lab have developed FLINC, an application virtualization tool that improves the portability of computational notebooks. Her prior work demonstrated that FLINC provides efficient reproducibility of notebooks and takes significantly less time and space to execute and repeat notebook execution than Docker containers for the same notebooks. My work will expand the scope of this original experiment to include more notebooks to FLINC’s test coverage and show robustness across even more diverse computational tasks. We expect to show that infrastructural tools like FLINC improve the success rate of automated reproducibility.

I’m grateful to both the Summer of Reproducibility program managers and my research mentor for this incredible opportunity to further my dissertation research in the context of meaningful collaboration.

Exploring Reproducibility in High-Performance Computing Publications with the Chameleon Cloud

Sat, 15 Jun 2024 00:00:00 +0000

Hello everyone,

I’m Klaus Kraßnitzer and am currently finishing up my Master’s degree at the Technical University of Vienna. This summer, under the guidance of Sascha Hunold, I’m excited to dive into a project that aims to enhance reproducibility in high-performance computing research.

Our project, AutoAppendix, focuses on the rigorous evaluation and potential automation of Artifact Description (AD) and Artifact Evaluation (AE) appendices from publications to this year’s Supercomputing Conference (SC). Due to a sizeable chunk of SC publications utlizing Chameleon Cloud, a platform known for its robust and scalable experiment setups, the project will be focused on and creating guidelines (and potentially, software tools) that users of the Chameleon Cloud can utilize to make their research more easily reproducible. You can learn more about the project and read the full proposal here.

My fascination with open-source development and research reproducibility was sparked during my undergraduate studies and further nurtured by my role as a teaching assistant. Hands-on projects and academic courses, like those in chemistry emphasizing precise experimental protocols, have deeply influenced my approach to computational science.

Project Objectives

Analyze and Automate: Assess current AE/AD appendices submitted for SC24, focusing on their potential for automation.
Develop Guidelines: Create comprehensive guidelines to aid future SC conferences in artifact submission and evaluation.
Build Tools (Conditionally): Develop automation tools to streamline the evaluation process.

The ultimate aim of the project is to work towards a more efficient, transparent, and reproducible research environment, and I’m committed to making it simpler for researchers to demonstrate and replicate scientific work. I look forward to sharing insights and progress as we move forward.

Thanks for reading, and stay tuned for more updates!

Reproducibility in Data Visualization

Fri, 14 Jun 2024 00:00:00 +0000

Hello! My name is Arya Sarkar and I will be contributing to the research project titled Reproducibility in Data Visualization, with a focus on investigating and coming up with novel solutions to capture both static and dynamic visualizations from different sources. My project is titled Investigate Solutions for Capturing Visualizations and I am mentored by Prof. David Koop.

Open-source has always piqued my interest, but often I found it hard to get started in as a junior in university. I spent a lot of time working with data visualizations but had never dived into the problem of reproducibility before diving into this project. When I saw a plethora of unique and interesting projects during the contribution phase of OSRE-2024, I was confused at the beginning. However, the more I dived into this project and understood the significance of research in this domain to ensure reproducibility, the more did I find myself getting drawn towards it. I am glad to be presented this amazing opportunity to work in the Open-source space as a researcher in reproducibility.

This project aims to investigate, augment, and/or develop solutions to capture visualizations that appear in formats including websites and Jupyter notebooks. We have a special interest on capturing the state of interactive visualizations and preserving the user interactions required to reach a certain visualization in an interactive environment to ensure reproducibility.My proposal can be viewed here!

Data leakage in applied ML: reproducing examples of irreproducibility

Fri, 14 Jun 2024 00:00:00 +0000

Hello,

I am Kyrillos Ishak I am happy to be part of SOR 2024, I am working on Data leakage in applied ML: reproducing examples of irreproducibility project. My proposal was accepted.

I am excited to work with Fraida Fund and Mohamed Saeed as my mentors. The objective of the project is to develop educational resources that can be adjusted by professors/instructors to explain specific data leakage problems. This involves ensuring the reproducibility of certain research papers that contain data preprocessing issues, then fixing these issues to demonstrate how they can affect the results.

Data leakage is a problem caused when information from outside the training dataset is used to create the model. This issue can lead to overly optimistic performance estimates and, ultimately, models that do not perform well on new, unseen data.

Despite the importance of addressing data leakage, many people from fields not closely related to computer science, are often unfamiliar with it, even if they are aware of best practices for data preprocessing. Developing educational materials on this topic will greatly benefit them.

I am excited to dive into the topic of data leakage in machine learning. Throughout the summer, I will be sharing regular updates and insightful blog posts on this subject. Stay tuned for more information!

Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Fri, 14 Jun 2024 00:00:00 +0000

Hello, I am Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University, specializing in Artificial Intelligence. This summer, I will be working on the project AIIO / Graph Neural Network under the mentorship of Bin Dong and Suren Byna.

High-Performance Computing (HPC) applications often face performance issues due to I/O bottlenecks. Manually identifying these bottlenecks is time-consuming and error-prone. My project aims to enhance the AIIO framework by integrating a Graph Neural Network (GNN) model to automatically diagnose I/O performance bottlenecks at the job level. This involves developing a comprehensive data pre-processing pipeline, constructing and validating a tailored GNN model, and rigorously testing the model’s accuracy using test cases from the AIIO dataset.

Through this project, I seek to provide a sophisticated, AI-driven approach to understanding and improving I/O performance in HPC systems, ultimately contributing to more efficient and reliable HPC applications.

StatWrap: Automated Reproducibility Checklists Generation

Fri, 14 Jun 2024 00:00:00 +0000

Namaste🙏🏻! I am Adi Akhilesh Singh, currently pursuing a degree in Computer Science and Engineering at IIT(BHU). This summer, I will be working on the StatWrap: Automated Reproducibility Checklists Generation project under the mentorship of Luke Rasmussen. You can view my project proposal for more details.

My project aims to integrate customizable reproducibility checklists into StatWrap, using metadata and user input to automate their generation. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklists to ensure their work is reproducible.

Stay tuned for updates on my progress in the coming weeks! 🚀

LLM Assistant for OpenROAD - Data Engineering and Testing

Thu, 13 Jun 2024 00:00:00 +0000

Hello! My name is Aviral Kaintura, and I will be contributing to OpenROAD, a groundbreaking open-source toolchain for digital integrated circuit automation (RTL to GDSII) during GSoC 2024.

My project, LLM Assistant for OpenROAD - Data Engineering and Testing, is jointly mentored by Indira Iyer and Jack Luar.

The aim of this project is to develop a chat assistant to improve the user experience with OpenROAD. My focus will be on developing a well-curated dataset from OpenROAD’s knowledge base. This dataset will be fundamental for another project led by Palaniappan R, which involves building the chatbot’s architecture. It will be used for training and validating the model and ensuring efficient context retrieval to generate accurate user responses, aiding in troubleshooting, installation, and other common issues to reduce the maintainers’ workload.

In addition to dataset creation, I will be working on testing and evaluation. This includes developing metrics for model evaluation, incorporating both human and automated techniques.

Our human evaluation framework will utilize chatbot feedback for valuable insights, enhancing the model and dataset. An automated batch testing application is also used to further enhance the evaluation process.

Here is an early build of the evaluation framework.

By leveraging advanced data engineering and testing methodologies, we aim to build an assistant that combines high accuracy with optimal response times. Additionally, we will collaborate with research teams at NYU and ASU to contribute to the research on AI-based chat assistants for electronic design automation.

I am thrilled to be part of this journey and look forward to making a meaningful impact on the OpenROAD project.

Stay tuned for more updates on the project!

Reproducibility in Data Visualization

Thu, 13 Jun 2024 00:00:00 +0000

Hello everyone!

I’m Triveni, a Master’s student in Computer Science at Northern Illinois University (NIU). When I came across the OSRE 2024 project Categorize Differences in Reproduced Visualizations focusing on data visualization reproducibility, I was excited because it aligned with my interest in data visualization. While my initial interest was in geospatial data visualization, the project’s goal of ensuring reliable visualizations across all contexts really appealed to me. So, I actively worked on understanding the project’s key concepts and submitted my proposal My proposal can be viewed here under mentorship of David Koop to join the project.

Early Steps and Challenges:

I began working on the project on May 27th, three weeks ago. Setting up the local environment initially presented some challenges, but I persevered and successfully completed the setup process. The past few weeks have been spent exploring the complexities of reproducibility in visualizations, particularly focusing on capturing the discrepancies that arise when using different versions of libraries to generate visualizations. Working with Dr. David Koop as my mentor has been an incredible experience. Our weekly report meetings keep me accountable and focused. While exploring different algorithms and tools to compare visualizations can be challenging at times, it’s a fantastic opportunity to learn cutting-edge technologies and refine my problem-solving skills.

Looking Ahead:

I believe this project can make a valuable contribution to the field of reproducible data visualization. By combining automated comparison tools with a user-centric interface, we can empower researchers and data scientists to make informed decisions about the impact of visualization variations. In future blog posts, I’ll share more about the specific tools and techniques being used, and how this framework will contribute to a more reliable and trustworthy approach to data visualization reproducibility.

Stay tuned!

I’m excited to embark on this journey and share my progress with all of you.

Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Wed, 12 Jun 2024 00:00:00 +0000

About the project:

How it all started

This journey began amidst our college’s cultural fest, in which I was participating, just 15 days before the proposal submission deadline. Many of my friends had been working for months to get selected for GSoC. I didn’t think I could participate this year because I was late, so I thought, “Better luck next year.” But during the fest, I kept hearing about UC OSPO and that a senior had been selected within a month. So, I was in my room when my friend told me, “What’s the worst that can happen? Just apply,” and so I did. I chose this project and wrote my introduction in Slack without knowing much. After that, it’s history. I worked really hard for the next 10 days learning about the project, making the proposal, and got selected.

First few weeks:

I started the project a week early from June 24, and it’s been two weeks since. The start was a bit challenging since it required setting up a lot of things on my local machine. For the past few weeks, the majority of my time has been dedicated to learning about COMPSs, RO-Crate, and Chameleon, the three technologies this project revolves around. The interaction with my mentor has also been great. From the weekly report meetings to the daily bombardment of doubts by me, he seems really helpful. It is my first time working with Chameleon or any cloud computing software, so it can be a bit overwhelming sometimes, but it is getting better with practice.

Stay tuned for progress in the next blog!!

FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I’m Lihaowen (Jayce) Zhu, currently pursuing my Master of Science in Computer Science at the University of Chicago. I will be spending my summer working on the project FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning under the mentorship of Yuyang (Roy) Huang and Swami Sundararaman, my proposal.

The landscape of machine learning (ML) is profoundly impacted by the initial stages of feature engineering and data preprocessing. This phase, critical for the success of ML projects, is often the most time-consuming, representing about 80% of the effort in typical ML workflows. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. By exploring innovative caching, prefetching, and heuristic strategies, this proposal aims to optimize the preprocessing workflow, thereby enhancing efficiency and reducing the required resources of ML projects.

First Steps in Enhancing User Experience Reproducibility through TROVI Redesign

Wed, 12 Jun 2024 00:00:00 +0000

Hello! My name is Alicia Esquivel Morel, and I’m a graduate research assistant at the University of Missouri – Columbia, pursuing a PhD in Computer Science. This summer, I’m working on a project to improve user experience reproducibility through a redesign of TROVI, as part of the Summer of Reproducibility (SoR) program. Excited to be working with two fabulous mentors; Kate Keahey, and Mark Powers.

Research Reproducibility with a TROVI Redesign

Researchers constantly face challenges replicating experiments due to limitations in current tools. TROVI, a platform designed to facilitate experiment replication, can be hindered by hard to follow interfaces and difficulties integrating code and data. This leads to confusion and frustration.

The Redesign’s Goals

Enhanced User Experience: Inspired by user-friendly platforms like Google Colab, we’ll simplify TROVI’s interface for intuitive navigation and ease of use.
Uploads and Sharing: Uploading code and data, as well as collaborating with researchers are key goals. Integration with platforms like GitHub will further streamline collaboration.
Continuous Improvement: A built-in feedback loop will allow users to provide input and suggestions, ensuring TROVI constantly evolves based on user needs.

The Road Ahead

We’re at the beginning of the redesign process. In the next blog post, I’ll describe the project’s specific goals and the deliverables you can expect.

Stay tuned to see how TROVI is built for reproducible research!!

FSA: Benchmarking Fail-Slow Algorithms

Wed, 12 Jun 2024 00:00:00 +0000

Hi everyone! I’m Xikang, a master’s CS student at UChicago. As a part of FSA benchmarking Project, I’m thrilled to be a contributor to OSRE 2024, collaborating with Kexin Pei, the assistant Professor of Computer Science at Uchicago and Ruidan, a talented PhD student at UChicago.

This summer, I will focus on integrating some advanced ML into our RAID slowdown analysis. Our aim is to assess whether LLMs can effectively identify RAID slowdown issues and to benchmark their performance against our current machine learning algorithms. We will test the algorithms on Chameleon Cloud and benchmark them.

Additionally, we will explore optimization techniques to enhance our pipeline and improve response quality. We hope this research will be a start point for future work, ultilizing LLMs to overcome the limitations of existing algorithms and provide a comprehensive analysis that enhances RAID and other storage system performance.

I’m excited to work with all of you and look forward to your suggestions. if you are interested, Here is my proposal

ML-Powered Problem Detection in Chameleon

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I am Syed Mohammad Qasim, a PhD candidate in Electrical and Computer Engineering at Boston University. I will be spending my summer working on the project ML-Powered Problem Detection in Chameleon under the mentorship of Ayse Coskun and Michael Sherman.

Currently, Chameleon Cloud monitors sites at the Texas Advanced Computing Center (TACC), University of Chicago, Northwestern University, and Argonne National Lab. They collect metrics using Prometheus at each site and feed them all to a central Mimir cluster. All the logs go to a central Loki, and Grafana is used to visualize and set alerts. Chameleon currently collects around 3000 metrics. Manually reviewing and setting alerts on them is time-consuming and labor-intensive. This project aims to help Chameleon operators monitor their systems more effectively and improve overall reliability by creating an anomaly detection service that can augment the existing alerting framework.

OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I’m Jiajun Mao, a BS/MS student at the University of Chicago studying Computer Science. I will be spending this summer working on the project OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS under the mentorship of Meng Wang and Anjus George, my proposal.

How to increase data’s durability and reliability while decreasing storage cost have always been interesting topics of research. Erasure coded storage systems in recent years have been seen as strong candidates to replace replications for colder storage tiers. In the paper “Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers”, the authors explored using theory and simulation on how a multiple tiered erasure coded system can out-perform systems using single level erasure codes in areas such as encoding throughput and network bandwidth consumed for repair, addressing a few pain points in adopting erasure coded storage systems. I will be implementing the theoretical and simulation result of this paper by building on top of HDFS and ZFS, and benchmarking the system performance.

The project will aim to achieve

HDFS understanding the underlying characteristics of ZFS as the filesystem
HDFS understanding the failure report from ZFS, and use new and special MLEC repair logic to execute parity repair
ZFS will be able to accept repair data from HDFS to repair a suspended pool caused by catastrophic data corruption

Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster

Wed, 12 Jun 2024 00:00:00 +0000

Hi! I’m Martin, and I will be working on Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster under the mentorship of In Kee Kim. Our work is driven by the scale of computing systems that hosts data commons – we believe that performance characterization of genomics workload should be done rapidly and at the scale similar to production settings. Feel free to check our proposal for more details!

We propose GenScale, a genomics workload benchmarking tool which can achieve both the scale and speed necessary for characterizing performance under large-scale settings. GenScale will be built on top of industrial-grade cluster manager (e.g. Kubernetes), metrics collection & monitoring systems (e.g. Prometheus), and will support comprehensive set of applications used in state-of-art genomics workflows. Initial version developed during this project will include DNA and RNA alignment workflows.

Finally, we believe that open access and reproducible research will greatly accelerate the pace of scientific discovery. We aim to package our artefacts and generated datasets in ways that makes it easiest to replicate, analyze, and build upon. I personally look forward to learn from & contribute to the open source community!

Developing a Pipeline to Benchmark Drift Management Strategies

Mon, 10 Jun 2024 00:00:00 +0000

With guidance from mentors Ray Andrew Sinurat and Sandeep Madireddy under the LAST project, I aim to develop a pipeline to benchmark the efficacy of various drift management algorithms.

Despite the abundance of literature on this subject, reproducibility remains a challenge due to the lack of available source code. As such, by crafting this pipeline, I aim to create standardized platform for researchers and practitioners to compare several state-of-the-art drift management approaches. Through rigorous testing and benchmarking, we seek to identify the most effective algorithms across a spectrum of drift scenarios, including gradual, sudden, and recurring drift.

This final deliverable of this pipeline will be packaged into a Chameleon Trovi Artifact. The pipeline will also be made easily extensible to cater to additional datasets or any custom-made drift-mitigation methods. This is my proposal for the project.

See you around!

Reproducing and benchmarking scalability bugs hiding in cloud systems

Mon, 10 Jun 2024 00:00:00 +0000

Hello there!

I am Shuang Liang, a third-year student studying Computer and Information Science at The Ohio State University. My passion lies in cloud computing and high-performance computing, areas I have explored extensively during my academic journey. I have participated in various projects and competitions, which have honed my technical skills and deepened my interest in distributed systems.

As part of the ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems, my proposal under the mentorship of Professor Yang Wang and Bogdan "Bo" Stoica aims to tackle the critical challenges posed by scalability bugs in systems like Cassandra, HDFS, and Hadoop. These bugs can lead to severe operational issues such as system downtime and data loss, particularly as systems scale up.

The project goals include systematically analyzing and documenting scalability bugs, developing protocols to effectively trigger and quantify the impact of these bugs, and creating reproducible artifacts and detailed investigation scripts to aid in bug analysis.

Our project will involve rigorous bug report analysis, reproduction of scalability bugs, and a comparative study of system behaviors before and after bug fixes. We aim to develop methodologies that enhance the reliability and performance of large-scale distributed systems, providing valuable insights and resources to the open-source community.

Stay tuned to explore the future of reliable and scalable distributed systems!

BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Sun, 09 Jun 2024 00:00:00 +0000

Hello! My name is Qianru, and I will be working on a project to improve spatial transcriptomics during Google Summer of Code 2024. My project, Benchmarking Gene Imputation Methods for Spatial Transcriptomics, is mentored by Ziheng Duan and Cormac Flanagan. The goal is to create a standard platform to evaluate methods for filling in missing gene data, which is a big challenge in spatial transcriptomics. My proposal can be viewed here!

Spatial transcriptomics lets us see where genes are active in tissues, giving us insight into how cells interact in their natural environment. However, current methods often miss some gene data, making it hard to get a complete picture. Gene imputation can help fill in these gaps.

My project will:

Create a benchmark dataset to standardize gene imputation tasks across different platforms, species, and organs.

Compare various gene imputation methods to see how well they work in different scenarios.

Develop a user-friendly Python package with tools for gene imputation to help researchers improve their data.

I’m excited to contribute to this project and help advance the field of spatial transcriptomics by making data analysis more accurate and comprehensive.

FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Mon, 03 Jun 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (primary contact), Swami Sundararaman
Contributor(s): Lihaowen (Jayce) Zhu

In the realm of machine learning (ML), preprocessing of data is a critical yet often underappreciated phase, consuming approximately 80% of the time in common ML tasks. This extensive time consumption can be attributed to various challenges encountered from both data and computation perspectives.

From the data side, one significant challenge is the slow retrieval of data from data lakes, which are storage repositories that hold a vast amount of raw data in its native format. However, the process of extracting this data can be slow, causing computation cycles to wait for data arrival and leading to delays in the entire preprocessing phase. Furthermore, the size of the data often exceeds the memory capacity of standard computing systems. This is a frequent occurrence in ML, as datasets are typically large and complex. Handling such large datasets requires sophisticated memory management techniques to ensure efficient preprocessing without overwhelming the system’s memory.

On the computation side, a naive solution to data operations, especially aggregation, often leads to inefficiencies. These operations may require grouping a large chunk of data as a prerequisite before performing any actual computation. This grouping, without careful configuration and management, can trigger serious data shuffling, leading to extensive remote data movement when the data is distributed across various storage systems. Such data movement is not only time-consuming but also resource-intensive.

To mitigate these challenges, there is a pressing need to design better caching, prefetching, and heuristic strategies for data preprocessing. The team aims to significantly reduce the time and resources required for preprocessing by optimizing data retrieval and computational processes.

However, prior to the design and implementation of such a system, a systematic understanding of the preprocessing workflow is essential. Hence, throughout the program, the students will need to:

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

Project Deliverable

A rolodex for the commonly used dataset and corresponding preprocess operations and expected output formats/types
A Chameleon Trovi package that preprocess the dataset with single-machine preprocessing framework like pandas
A Chameleon Trovi package that preprocess the dataset in an existing distributed computation framework like Hadoop or Spark

SLICES/pos: Reproducible Experiment Workflows

Fri, 17 May 2024 00:00:00 +0000

Servus everyone!

I’m Kilian Warmuth, currently pursuing my M.Sc. in Computer Science at the Technical University of Munich (TUM) after completing my B.Sc. in Computer Science at the same institution. Throughout my academic education, I have taken courses in Advanced Computer Networks, which have deepened my understanding and expertise in the field. I was involved in an interdisciplinary project where I created a testing toolchain for the packet generator MoonGen using the SCLICES/pos testbed. This experience provided me with extensive hands-on exposure to pos, increasing my interest in reproducible testbeds and the enhancement of pos.

As part of the SLICES/pos: Reproducible Experiment Workflows project, my proposal, under the mentorship of Sebastian Gallenmüller, Kate Keahey, and Georg Carle, aims to address the challenges of managing experiment results within the pos framework.

The project leverages the RO-Crate open standard to organize result data systematically, enhancing accessibility and comprehensibility of research findings. We aim to improve experiment documentation for the pos testbed, providing clear setup and execution instructions to ensure reproducibility. Therefore we need to simplify the dissemination of research findings by automating the creation of RO-Crates, allowing researchers to focus on experiment design without needing to be familiar with RO-Crate standards. Implementing these standards will enhance the sharing of results by automating publication processes for open repositories, promoting transparency and collaboration.

We also aim to enhance the portability of experiments across different testbeds, with a particular focus on the Chameleon Testbed. We will develop introductory examples demonstrating how to use pos in various testbed environments. Additionally, we will design and execute a portable complex network experiment based on SLICES/pos. To validate the portability enhancements, we will perform experiments on the Chameleon testbed. Finally, we will refine the portability of pos experiments within Chameleon to ensure seamless execution.

Stay tuned to explore the future of reproducible testbeds!

HDEval: Benchmarking LLMs that Generate Verilog/Chisel Modules From Natural Language

Tue, 14 May 2024 00:00:00 +0000

Hi everyone!

I’m Ashwin Bardhwaj, currently pursuing a bachelors in Electrical Engineering and Computer Science at UC Berkeley. I was recently involved in a project to implement a secure hardware encryption enclave in Verilog. That’s why I was excited to work with the MASC group to evaluate how existing generalized LLMs (such as ChatGPT 4 or StarCoder) can generate accurate Verliog/Chisel code from English and assist in the hardware development process.

As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau and Sakshi Garg looks to create a suite of benchmark programs for HDEval.

The deliverable of this project is to create multiple large HDL benchmarks along with a respective set of prompts. Using yosys to implement Logic Equivalence Check, we are able to prove through formal verification that the generated code will exhibit the same behavior as the benchmark. In addition, we can also consider the performance and resource utilization of the generated code as a metric.

(Re)Evaluating Artifacts for Understanding Resource Artifacts

Wed, 20 Mar 2024 00:00:00 +0000

Project Idea Description

Topics: Virtualization, Containerization, Profiling, Reproducibility
Skills: C and Python and DevOps experience.
Difficulty: Medium
Size: Large; 350 hours
Mentors: Tanu Malik

This project aims to characterize computer-science related artifacts that are either submitted to conferences or deposited in reproducibility hubs such as Chameleon. We aim to characterize experiments into different types and understand reproducibility requirements of this rich data set, possibly leading to a benchmark. We will then understand packaging requirements, especially of distributed experiments and aim to instrument a package archiver to reproduce a distributed experiment. Finally, we will use learned experiment characteristics to develop a classifier that will determine alternative resources where experiment can be easily reproduced.

Project Deliverable Specific Tasks include: A pipeline consisting of a set of scripts to characterize artifacts. Packaged artifacts and an analysis report with open-sourced data about the best guidelines to package using Chameleon. A classifier system based on artifact and resource characteristics.

Auto Appendix

Mon, 11 Mar 2024 14:48:10 +0000

The SC Conference Series, a leading forum on High Performance Computing (HPC), supports scientific rigor through an enhanced reproducibility of accepted papers. To that end, all manuscripts submitted to the SC Technical Papers program must contain an Artifact Description. Authors of accepted papers may request reproducibility badges, for which an Appendix describing the Artifact Evaluation is required.

In recent years, Chameleon has facilitated SC’s reproducibility initiative by enabling authors to develop and share computational, reproducible artifacts through the Chameleon cloud. The Chameleon platform helps authors and reviewers to easily share computational artifacts, which are included in the papers’ artifact appendices.

The proposed project aims to assess all AD/AE appendices submitted for reproducibility badge requests. This evaluation will focus on AD/AE appendices that utilized the Chameleon cloud as the execution platform, examining their potential for automation. Our aim is to evaluate the feasibility of fully automating various components of the appendices. Students will engage directly with the chairs of the SC24 Reproducibility Initiative in this effort.

Advancing SC Conference Artifact Reproducibility via Automation

Topics: Reproducibility Reproducible Research Artifact Evaluation Open Science
Skills: HPC, Cloud computing, Chameleon, MPI, OpenMP, CUDA
Difficulty: Difficult
Size: Large
Mentors: Sascha Hunold
Tasks:
- Perform an analysis of the current limitations of AD/AE appendices submitted for Artifact Evaluation.
- Re-run the computational artifacts to identify areas for enhancement, with a primary objective of achieving full automation of Artifact Evaluation using the Chameleon cloud.
- Evaluate the existing automation capabilities of the Chameleon cloud.
- Develop a set of recommendations for structuring Computational Artifacts, aimed at benefiting future SC conferences.

ML-Powered Problem Detection in Chameleon

Wed, 06 Mar 2024 16:33:57 -0600

Today’s Continuous Integration/Continuous Development (CI/CD) trends encourage rapid design of software using a wide range of software components, followed by frequent updates that are immediately deployed on the cloud. The complexity of cloud systems along with the component diversity and break-neck pace of development amplify the difficulty in identifying or fixing problems related to performance, resilience, and security. Furthermore, existing approaches that rely on human experts—e.g., methods involving manually-written rules/scripts—have limited applicability to modern CI/CD processes, as they are fragile, costly, and often not scalable. Consequently, there is growing interest in applying machine learning (ML) based methods for identifying vulnerabilities in code, non-compliant or otherwise problematic software, and resilience problems in systems and networks. However, despite some success stories in applying AI for cloud operations (e.g., in resource management), much of cloud operations still rely on human-centric methods, which require updates as the cloud undergoes CI/CD cycles. The goal of this summer project is to explore methods of automation for the Chameleon Cloud to enable faster detection and diagnosis of problems. Overall, the project will contribute to an overarching vision of building an infrastructure that collects and synthesizes cross-layer data from large-scale cloud systems, applying ML-powered methods to automate cloud ops, and, further, making this data available to researchers through coherent APIs and analytics engines.

Currently, Chameleon uses runbooks as manual guides for operational tasks, including routine maintenance and troubleshooting. However, these traditional runbooks often fall short in dynamic and fast-paced CI/CD environments, as they lack the flexibility to adapt to changes in software versions, deployment configurations, and the unique challenges of emerging issues. To overcome these challenges, the project will leverage ML to automate anomaly detection based on telemetry data collected from Chameleon Cloud’s monitoring frameworks. This method will not only facilitate rapid identification of performance anomalies but also enable automated generation of runbooks. These runbooks can then offer operators actionable steps to resolve issues efficiently, thereby making the anomaly mitigation process more efficient. Furthermore, this approach supports the automatic creation of targeted runbooks for newly generated support tickets, enhancing response times and system reliability.

Time-permitting, using a collection of automated runbooks (each targeting a specific problem), we will analyze support tickets, common problems, and their frequency to offer insights and suggestions to help roadmapping for Chameleon Cloud to offer the best return on investment on fixing problems.

A key aspect of this summer project is enhancing the reproducibility of experiments in the cloud and improving data accessibility. We plan to design infrastructures and APIs so that the telemetry data that is essential for anomaly detection and automated runbooks is systematically documented and made available. We also aim to collect and share insights and modules on applying ML for cloud operations, including ML pipelines, data labeling strategies, data preprocessing techniques, and feature engineering. By sharing these insights, we aim to promote best practices and support reproducible experiments on public clouds, thus fostering future ML-based practices within the Chameleon Cloud community and beyond. Time permitting, we will explore applying lightweight privacy-preserving approaches on telemetry data as well.

Topics: Machine Learning, Anomaly Detection, Automated Runbooks, Telemetry Data
Skills:
- Proficiency in Machine Learning: Understanding of ML algorithms for anomaly detection and automation.
- Cloud Computing Knowledge: Familiarity with CI/CD environments and cloud architectures.
- Programming Skills: Proficiency in languages such as Python, especially in cloud and ML contexts.
- Data Analysis: Ability to analyze telemetry data using data analytics tools and libraries.
Difficulty: Hard
Size: Large
Mentors: Michael Sherman

ReproNB: Reproducibility of Interactive Notebook Systems

Mon, 26 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: HPC, MPI, distributed systems
Skills: C++, Python
Difficulty: Difficult
Size: Large; 350 hours
Mentors: Tanu Malik

Notebooks have gained wide popularity in scientific computing. A notebook is both a web-based interactive front- end to program workflows and a lightweight container for sharing code and its output. Reproducing notebooks in different target environments, however, is a challenge. Notebooks do not share the computational environment in which they are executed. Consequently, despite being shareable they are often not reproducible. We have developed FLINC (see also eScience'22 paper) to address this problem. However, it currently does not support all forms of experiments, especially those relating to HPC experiments. In this project we will extend FLINC to HPC experiments. This will involve using recording and replaying mechanisms such as ReMPI and rr within FLINC.

Project Deliverable

The project deliverable will be a set of HPC experiments that are packaged with FLINC and available on Chamaeleon.

SciStream-Rep: An Artifact for Reproducible Benchmarks of Scientific Streaming Applications

Mon, 26 Feb 2024 00:00:00 +0000

SciStream is a framework and toolkit that attempts to tackle the problem of enabling high-speed(+100Gbps), memory-to-memory data streaming in scientific environments. This task is particularly challenging because data producers (e.g., data acquisition applications on scientific instruments, simulations on supercomputers) and consumers (e.g., data analysis applications) may be in different security domains and thus require bridging of those domains. Furthermore, either producers, consumers, or both may lack external network connectivity and thus require traffic forwarding proxies. If you want to learn more, please take a look at our HPDC'22 paper.

SciStream-Rep: An Artifact for Reproducible Benchmarks of Scientific Streaming Applications

Project Idea Description:

Topics: Network Performance Testing, Benchmarking, Data Streaming, Reproducibility
Skills: Python, Scripting, Linux, Containers, Networking, benchmark tools
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

This project focuses on expanding the scope of testing SciStream’s architecture by incorporating a variety of traffic patterns based on real scientific applications. The goal is to understand how different traffic patterns influence the performance of memory-to-memory data streaming in scientific scenarios by creating artifacts for reproducible experiments. Additionally, the project will explore the use of different forwarding elements, such as Nginx and HAProxy, to assess their impact on data streaming efficiency and security.

Reproducibility is especially difficult in shared network environments such as Chameleon and FABRIC testbeds. We can expect similar results for two exact same experiments, only when the network condition (external to our traffic) is similar for both experiments. By creating reproducible artifacts for Chameleon and FABRIC, we can build statistical confidence in the measured results by multiple repetitions from other researchers.

The Specific Tasks of the Project Include:

Developing a set of benchmarks to measure the performance of scientific streaming applications across a broader range of traffic patterns.
Creating a set of artifacts for generating traffic patterns typical of data streaming applications.
Deploying various forwarding elements within the SciStream architecture for the Chameleon and FABRIC testbeds.
Compiling a best practices document detailing the optimal configurations for Scistream.

Scistream-LB: A Dynamic Load Balancing Solution Using Programmable network devices

Project Idea Description:

Topics: Network Performance Testing, Data Streaming, Reproducibility, Programmable Data Planes
Skills: Python/Scripting, Linux, Docker/Containers, Networking fundamentals, Experience with OpenFlow/P4 programming
Difficulty: Difficult
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

The aim of this project is to create a specialized forwarding element using OpenFlow (OF) or P4 programming languages, tailored to enhance the SciStream data plane. This new development seeks to enable a more flexible and hardware-based (and therefore more efficient) alternative to conventional software-based forwarding mechanisms like NGINX or HAProxy, specifically designed to support the needs of high-performance data streaming environments for scientific applications. The OF/P4 forwarding elements will be packaged as artifacts for reproducibility experiments in Chameleon and FABRIC testbeds. Reproducibility is especially difficult in shared network environments such as Chameleon and FABRIC testbeds. We can expect similar results for two exact same experiments, only when the network condition (external to our traffic) is similar for both experiments. By creating reproducible artifacts for Chameleon and FABRIC, we can build statistical confidence in the measured results by multiple repetitions from other researchers.

Specific tasks of the project include:

Design and implementation of an OF/P4-based forwarding element that can be seamlessly integrated with the data plane of SciStream’s architecture.
Forwarding logic that supports efficient and secure memory-to-memory data streaming.
A set of benchmarks for evaluating the new forwarding element against traditional options, focusing on improvements in throughput, latency, and security.
An investigation on the potential advantages of programmable network elements for detailed control over data streaming paths and security configurations.
A package of the newly developed forwarding elements as artifacts for reproducibility experiments in Chameleon and FABRIC testbeds.

Chameleon Trovi Redesign

Wed, 21 Feb 2024 13:43:55 -0600

Trovi on Chameleon is an open-source service designed to significantly enhance the practical reproducibility of computer science research. By allowing Chameleon users to upload, share, and access packaged experiments and other research artifacts, Trovi aims to streamline the process of replicating and building upon existing studies. This capability is crucial in the scientific community, where the ability to accurately reproduce research results is as fundamental to validating, critiquing, and extending scientific findings as reading papers. The importance of Trovi lies in its potential to serve as a centralized hub that facilitates the exchange of valuable research outputs, promotes transparency, and fosters collaboration among researchers. By improving the ease with which experiments can be replicated and data can be shared, Trovi supports the advancement of knowledge and innovation in the field of computer science, making it an essential tool for researchers seeking to contribute to the development of reproducible and robust scientific research.

This project will focus on the evolution of Trovi. It will aim to enhance Trovi as a tool to advance practical reproducibility in CS research. Students will evaluate the most important use cases and enabling features necessary to enhance Trovi’s functionality and user experience. With these design insights, students will then create a robust interface that allows researchers to integrate experiment code and data easily as packaged artifacts, similar to the user-friendly design of Google Colab, and build off other users’ artifacts to create novel experiments, similar to the design of GitHub. Furthermore, students will create comprehensive documentation with valuable insights into what works well and what requires improvement, creating a dynamic feedback loop to guide the ongoing redesign process. Lastly, students will actively participate in designing webinars, creating and posting video tutorials, and organizing academic events at the University of Chicago to showcase the work on Trovi. This multifaceted project ensures a well-rounded experience and fosters a collaborative learning environment.

Each of the project ideas below focuses on a different aspect of the overall goal to enhance Trovi as a tool for advancing practical reproducibility in CS research. They are designed to offer a comprehensive approach, from technical development to community engagement, ensuring a well-rounded enhancement of the service.

Topics: User Interface Design User Experience Web Development
Skills: HTML/CSS, JavaScript, UX design principles
Difficulty: Moderate to Hard
Size: Medium to Large
Mentors: Mark Powers
Tasks:
- Conduct user research to understand the needs and pain points of current and potential Trovi users.
- Design wireframes and prototypes that incorporate user feedback and aim to simplify the process of uploading, sharing, and reusing research artifacts.
- Implement the frontend redesign using a modern web framework to ensure responsiveness and ease of use.

Packaged Artifacts Integration System

Topics: Cloud Computing Data Management Web APIs
Skills: Python, RESTful APIs, Docker, Git
Difficulty: Hard
Size: Large
Mentors: Mark Powers
Tasks:
- Develop a system that allows users to easily package and upload their experimental code and data to Trovi.
- Create a standardized format or set of guidelines for packaging experiments to ensure consistency and ease of use.
- Implement API endpoints that enable automated uploads, downloads, and integration with other tools like GitHub or Zenodo.
- Test the system with real-world experiments to ensure reliability and ease of integration.

Community Engagement and Educational Materials

Topics: Educational Technology Community Building Content Creation
Skills: Video Editing, Public Speaking, Event Planning
Difficulty: Moderate
Size: Medium
Mentors: Mark Powers
Tasks:
- Design and organize webinars that introduce Trovi and its new features to the research community.
- Create engaging video tutorials that guide users through the process of using Trovi for their research needs.
- Develop comprehensive documentation that covers both basic and advanced use cases, troubleshooting, and tips for effective collaboration using Trovi.
- Organize academic events, such as workshops or hackathons, that encourage the use of Trovi for collaborative research projects.

Feedback Loop and Continuous Improvement System

Topics: Software Engineering Data Analysis User Feedback
Skills: Python, SQL, Data Visualization, Web Development
Difficulty: Moderate
Size: Medium
Mentors: Mark Powers
Tasks:
- Implement a system within Trovi for collecting, storing, and analyzing user feedback and usage data.
- Develop dashboards that visualize feedback trends and identify areas for improvement.
- Create mechanisms for users to easily report bugs, request features, and offer suggestions for the platform.
- Use the collected data to prioritize development efforts and continuously update the platform based on user needs and feedback.

Data leakage in applied ML: reproducing examples of irreproducibility

Wed, 21 Feb 2024 00:00:00 +0000

Topics: applied machine learning, data leakage, reproducibility
Skills: Python, data analysis, machine learning
Difficulty: Medium
Size: Large (350 hours)
Mentors: Fraida Fund and Mohamed Saeed

Project Idea Description

Data leakage has been identified as a major cause of irreproducibility of a paper’s findings, when machine learning techniques are applied to problems in science. Data leakage includes errors such as:

pre-processing before splitting into training/test sets
feature selection before splitting into training/test sets
duplicated data points in both training and test sets
temporal leakage (e.g. shuffled K-fold cross validation with temporal data)
group leakage (e.g. shuffled K-fold cross validation with data that has group structure)

and leads to an overly optimistic evaluation of model performance, such that the finding may no longer be the same when the error is corrected.

Despite the seriousness of this problem, data leakage is often not covered in introductory machine learning courses, and many users of machine learning across varied science domains are unaware of it. Even those who have learned “rules” for avoiding data leakage (e.g. “never do feature selection on the test set”) may not understand the reasons for these “rules”, and how important they are for ensuring that the final result is valid and reproducible.

The goal of this project is to create learning materials demonstrating how instances of data leakage invalidate a result. These materials should be easily adoptable by instructors teaching machine learning in a wide variety of contexts, including those teaching a non-CS audience. To achieve this, the project proposes to re-implement published results that have been affected by data leakage, and package these implementations along with supporting material in a format suitable for use in classrooms and by independent learners. For each “irreproducible result”, the “package” should include -

a re-implementation of the original result
an explanation of the data leakage problem affecting the result, with an implementation of a “toy example” on synthetic data
a re-implementation of the result without the data analysis error, to show how the finding is affected
and examples of exam or homework questions that an instructor adopting this package may use to assess understanding.

Writing a successful proposal for this project

A good proposal for this project should include, for at least a few “types” of data leakage mentioned above -

a specific published result that could be used as an exemplar (you may find ideas among the review papers listed here)
a brief description of the details of the experiment that will reproduce that result (e.g. what data is used, what machine learning technique is used, what are the hyperparameters used for training)
and an explanation of why this result is suitable for this use (it uses a publicly available dataset, a machine learning technique that is familiar and accessible to students in an introductory course, the paper has sufficient detail to reproduce the result, etc.)

The contributor will need to create learning materials that are written in a clear, straightforward, and concise manner, without unncessary jargon. The proposal should show evidence of the contributor’s writing abilities.

Github link

To get a sense of the type of code you would be writing, here is an example of a learning module related to data leakage (however, it is not in the format described above): Beauty in the Classroom

Project Deliverables

“Packages” of learning materials for teaching about common types of data leakage
Trovi artifacts for “playing back” each of the “packages”

Evaluating congestion controls past and future

Wed, 21 Feb 2024 00:00:00 +0000

Topics: computer networks, congestion control, reproducibility
Skills: Python, Bash scripting, Linux, computer network performance evaluation
Difficulty: Medium
Size: Large (350 hours)
Mentors: Fraida Fund and Ashutosh Srivastava

Project Idea Description

In computer networks, congestion control protocols play an outsize role in determining our experience with networked applications. New congestion control algorithms are regularly proposed by researchers to improve throughput and latency performance, adapt to new types of networks, and align more closely with the needs of new applications.

However, our understanding of the benefits of a new congestion control protocol depends to a large extent on the evaluation - the network topology, the network delay and throughput, the type of flow, the type of competing traffic - and there is no single standard way to evaluate a congestion control protocol. The Pantheon project (which is no longer supported) sought to fill this gap somewhat and address the problem of reproducibility of congestion control results, but their approach is not easily adapted to evaluation scenarios representative of new types of applications or networks. Nor is it capable of representing the evaluation scenarios in most published results related to congestion control.

The goal of this project, therefore is to create an evaluation suite for congestion control protocols that can be used to reproduce existing congestion control results in the academic literature, and to evaluate new protocols under similar evaluation conditions, and to be easily extended to new scenarios. An “evaluation scenario” includes:

a Python notebook to realize the network topology on the FABRIC and/or Chameleon testbed, and configure the network characteristics,
scripts to generate the data flow(s) needed for the evaluation,
and scripts to capture data from the experiment and visualize the results.

Writing a successful proposal for this project

To write a good proposal for this project, you should review the most influential papers on TCP congestion control, and especially those related to TCP protocols that are available in the Linux kernel.

Use your findings to explain what your proposed evaluation suite will include (what network topologies, what flow generators), and justify this with reference to the academic literature. Also indicate which specific results you expect to be able to reproduce using this suite (e.g. include figures from influential papers showing evaluation results! with citation, of course).

You can also take advantage of existing open source code that reproduces a congestion control result, e.g. Replication: When to Use and When Not to Use BBR, or Some of the Internet may be heading towards BBR dominance: an experimental study.

Github link

There is no pre-existing Git repository for this project - at the beginning of the summer, the contributor will create a new repository for this project.

Project Deliverables

“Packages” of evaluation scenarios that can be used to evaluate a congestion control algorithm implemented in the Linux kernel
Trovi artifacts for realizing each evaluation scenario on Chameleon

Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Mon, 19 Feb 2024 00:00:00 +0000

Topics: Provenance, reproducibility, standards, image creation
Skills: Python, JSON, Bash scripting, Linux, image creation and deployment
Difficulty: Medium
Size: Large (350 hours)
Mentors: Raül Sirvent

Project Idea Description

The COMPSs programming model provides an interface for the programming of a sequential application that is transformed in a workflow that, thanks to the COMPSs runtime, is later scheduled in the available computing resources. Programming is enabled for different languages through the use of bindings: Java, C/C++ and Python (named PyCOMPSs). COMPSs is able to generate Workflow Provenance information after the execution of an experiment. The generated artifact (code + data + recorded metadata) enables the sharing of results through the use of tools such as the WorkflowHub portal, that provides the capacity of generating a DOI of the results to include them as permanent references in scientific papers.

The format of the metadata generated in COMPSs experiments follows the RO-Crate specification, and, more specifically, two profiles: the Workflow and Workflow Run Crate profiles. This metadata enables not only the sharing of results, but also their reproducibility.

This project proposes the creation of a service that enables the automatic reproducibility of COMPSs experiments in the Chameleon infrastructure. The service will be able to get a COMPSs crate (artifact that follows the RO-Crate specification), and, by parsing the available metadata, build a Chameleon compatible image for reproducing the experiment in the testbed. Small modifications to the COMPSs RO-Crate are foreseen (i.e. the inclusion of third party software required by the application).

Project Deliverables

Study the different environments and specifications (COMPSs, RO-Crate, Chameleon, Trovi, …).
Design the most appropriate integration, considering all the elements involved.
Integrate PyCOMPSs basic experiments reproducibility in Chameleon.
Integrate PyCOMPSs complex experiments reproducibility in Chameleon (i.e. with third party software dependencies).

BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Sat, 17 Feb 2024 00:00:00 +0000

Topics: bioinformatics, spatial transcriptomics, gene imputation, benchmarking, cross-platform/species analysis
Skills:
- Programming Languages:
  - Proficient in Python and/or R, commonly used in bioinformatics.
- Data Analysis:
  - Experience with statistical data analysis and machine learning models.
- Bioinformatics Knowledge (not required but preferred):
  - Proficiency in bioinformatics and computational biology.
  - Familiarity with spatial transcriptomics datasets and platforms.
Difficulty: Advanced
Size: Large (350 hours). Given the scope of integrating multi-platform, multi-species datasets and the complexity of benchmarking gene imputation methods, this project is substantial. It requires extensive data preparation, analysis, and validation phases, making it suitable for a larger time investment.
Mentors: Ziheng Duan (contact person)

Project Idea Description

The orchestration of cellular life is profoundly influenced by the precise control of gene activation and silencing across different spatial and temporal contexts. Understanding these complex spatiotemporal gene expression patterns is vital for advancing our knowledge of biological processes, from development and disease progression to adaptation. While single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to profile gene expression across thousands of cells simultaneously, its requirement for cell dissociation strips away the critical spatial context, limiting our comprehension of cellular interactions within their native environments. Recent strides in spatial transcriptomics have started to bridge this gap by enabling spatially resolved gene expression measurements at single-cell or even sub-cellular resolutions. These advancements offer unparalleled opportunities to delineate the intricate tapestry of gene expression within tissues, shedding light on the dynamic interactions between cells and their surroundings.

Despite these technological advances, a significant challenge remains: the datasets generated by spatial transcriptomic technologies are often incomplete, marred by missing gene expression values due to various technical and biological constraints. This limitation severely impedes our ability to fully interpret these rich datasets and extract meaningful insights from them. Gene imputation emerges as a pivotal solution to this problem, aiming to fill in these missing data points, thereby enhancing the resolution, quality, and interpretability of spatial transcriptomic datasets.

Recognizing the critical importance of this task, there is a pressing need for a unified benchmarking platform that can facilitate the evaluation and comparison of gene imputation methods across a diverse array of samples, spanning multiple sampling platforms, species, and organs. Currently, the bioinformatics and spatial transcriptomics fields lack such a standardized framework, hindering progress and innovation. To address this gap, our project aims to establish a comprehensive gene imputation dataset that encompasses a wide range of conditions and parameters. We intend to reproduce known methods and assess their efficacy, providing a solid and reproducible foundation for future advancements in this domain.

Project Deliverable

A comprehensive, preprocessed benchmark dataset that spans multiple sampling platforms, species, and organs, aimed at standardizing gene imputation tasks in spatial transcriptomics.
An objective comparison of state-of-the-art gene imputation methodologies, enhancing the understanding of their performance and applicability across diverse biological contexts.
A user-friendly Python package offering a suite of gene imputation tools, designed to fulfill the research needs of the spatial transcriptomics community by improving data completeness and reproducibility.

ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Sat, 10 Feb 2024 00:00:00 +0000

Topics: Distributed systems, Scalability, Bug analysis, Bug reproducibility
Skills: Java, Python, bash scripting, perf, Linux internals
Difficulty: Hard
Size: Large (350 hours)
Mentors: Bogdan "Bo" Stoica (contact person), Yang Wang

Project Idea Description

Large-scale distributed systems are integral to the infrastructure of a wide range of applications and services. The continuous evolution of these systems requires ongoing efforts to address inherent faults which span a variety of issues including availability, consistency, concurrency, configuration, durability, error-handling, integrity, performance, and security. Recent developments in the field and the rise of cloud computing have been marked by a notable increase in the scale at which such systems operate.

This increase in scale introduces specific challenges, particularly in terms of system reliability and performance. As distributed systems expand beyond single machines, addressing the growing demands for computation, memory and storage becomes more difficult. This underlying complexity leads to the emergence of scalability bugs — defects that surface in large-scale deployments, yet do not reveal themselves in a small-scale setting.

To better understand scalability bugs, we set out to investigate a set of scalability issues documented over the last 5 years from 10 popular open-source large-scale systems. These bugs have led to significant operational challenges, such as system downtime, reduced responsiveness, data loss, and data corruption. Moreover, addressing them required extensive collaboration and problem-solving efforts among engineers and bug reporters, with discussions often spanning a month or more.

We observed that traditional bug finding techniques are insufficient for detecting scalability bugs since these defects are triggered by a mixture of scale-related aspects not properly investigated by previous approaches. These characteristics include the number of components involved, the system load and workload size, the reliability of recovery protocols, and the magnitude of intermediate failures. Although previous research examined some of these aspects, it has typically done so either in isolation (individually), or without providing a comprehensive understanding of the fundamental bug patterns, symptoms, root causes, fixes, and, more importantly, how easily these bugs can be reproduced in-house.

Therefore, the main goal of this project is to systematically understand, characterize, and document the challenges associated with scalability bugs, at-large. Our approach is twofold: first, to analyze scalability bugs in terms of reproducibility, and second, to develop methodologies for triggering them and measuring their impact. Specifically, we aim to:

Provide detailed accounts of bug reproduction experiences for a diverse set of recently reported scalability bugs from our benchmark applications;
Identify specific challenges that prevent engineers from reproducing certain scalability bugs and investigate how prevalent these obstacles are;
Create a suite of protocols to effectively trigger and quantify the impact of scalability bugs, facilitating their investigation in smaller-scale environments.

Project Deliverable

A set of Trovi replayable artifacts enabling other researchers to easily reproduce scalability bugs for our benchmark applications;
A set of Jupyter notebook scripts allowing to conveniently replay each step in our investigation;
A detailed breakdown of the challenges faced when reproducing scalability bugs and how these obstacles differ from those related to more “traditional” types of bugs.

GPEC: An Open Emulation Platform to Evaluate GPU/ML Workloads on Erasure Coding Storage

Thu, 08 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage Systems, Machine Learning, Erasure Coding
Skills: C/C++, Python, PyTorch, Bash scripting, Linux, Erasure Coding, Machine Learning
Difficulty: Hard
Size: Large (350 hours)
Mentors: Meng Wang (primary contact), John Bent

Large-scale data centers store immense amounts of user data across a multitude of disks, necessitating redundancy strategies like erasure coding (EC) to safeguard against disk failures. Numerous research efforts have sought to assess the performance and durability of various erasure coding approaches, including single-level erasure coding, locally recoverable coding, and multi-level erasure coding.

Despite its widespread adoption, a significant research gap exists regarding the performance of large-scale erasure-coded storage systems when exposed to machine learning (ML) workloads. While conventional practice often leans towards replication for enhanced performance, this project seeks to explore whether cost-effective erasure encoding can deliver comparable performance. In this context, several fundamental questions remain unanswered, including: Can a typical erasure-coded storage system deliver sufficient throughput for ML training tasks? Can an erasure-coded storage system maintain low-latency performance for ML training and inference workloads? How does disk failure and subsequent repair impact the throughput and latency of ML workloads? What influence do various erasure coding design choices, such as chunk placement strategies and repair methods, have on the aforementioned performance metrics?

To address these questions, the most straightforward approach would involve running ML workloads on large-scale erasure coded storage systems within HPC data centers. However, this presents challenges for researchers and students due to limited access to expensive GPUs and distributed storage systems, especially when dealing with large-scale evaluations. Consequently, there is a need for a cost-effective evaluation platform.

The objective of this project is to develop an open-source platform that facilitates cheap and reproducible evaluations of erasure-coded storage systems concerning ML workloads. This platform consists of two key components: GPU Emulator: This emulator is designed to simulate GPU performance for ML workloads. Development of the GPU emulator is near completion. EC Emulator: This emulator is designed to simulate the performance characteristics of erasure-coded storage systems. It is still in the exploratory phase and requires further development.

The student’s responsibilities will include documenting the GPU emulator, progressing the development of the EC emulator, and packaging the experiments to ensure easy reproducibility. It is anticipated that this platform will empower researchers and students to conduct cost-effective and reproducible evaluations of large-scale erasure-coded storage systems in the context of ML workloads.

Project Deliverable

Build an EC emulator to emulate the performance characteristics of large-scale erasure-coded storage systems
Incorporate the EC emulator into ML workloads and GPU emulator
Conduct reproducible experiments to evaluate the performance of erasure-coded storage systems in the context of ML workloads
Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository with open-source code

LAST: Let’s Adapt to System Drift

Wed, 07 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Computer systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Data Science and Machine Learning
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ray Andrew Sinurat (primary contact), Sandeep Madireddy

The performance of computer systems is constantly evolving, a natural outcome of updating hardware, improving software, and encountering hardware quirks over time. At the same time, machine learning (ML) models are becoming increasingly popular. They are being used widely to address various challenges in computer systems, notably in speeding up decision-making. This speed is vital for a quick and flexible response, essential for meeting service-level agreements (SLAs). Yet, an interesting twist has emerged: like the computer systems they aid, ML models also experience a kind of “aging.” This results in a gradual decline in their effectiveness, a consequence of changes in their operating environment.

The phenomenon of model “aging” is a ubiquitous occurrence across various domains, not limited merely to computer systems. This process of aging can significantly impact the performance of a model, emphasizing the critical importance of early detection mechanisms to maintain optimal functionality. In light of this, numerous strategies have been formulated to mitigate the aging of models. However, the generalizability and effectiveness of these strategies across diverse domains, particularly in computer systems, remain largely unexplored. This research aims to bridge this gap by designing and implementing a comprehensive data analysis pipeline. The primary objective is to evaluate the efficacy of various strategies through a comparative analysis, focusing on their performance in detecting and addressing model aging. To achieve a better understanding of this issue, the research will address the following pivotal questions:

Data-Induced Model Aging: What specific variations within the data can precipitate the aging of a model? Understanding the nature and characteristics of data changes that lead to model deterioration is crucial for developing effective prevention and mitigation strategies.
Efficacy of Aging Detection Algorithms: How proficient are the current algorithms in identifying the signs of model aging? Assessing the accuracy and reliability of these algorithms will provide insights into their practical utility in real-world scenarios.
Failure Points in Detection: In what scenarios or under what data conditions do the aging detection mechanisms fail? Identifying the limitations and vulnerabilities of these algorithms is vital for refining their robustness and ensuring comprehensive coverage.
Scalability and Responsiveness: How do these algorithms perform in terms of robustness and speed, particularly when subjected to larger datasets? Evaluating the scalability and responsiveness of the algorithms will determine their feasibility and effectiveness in handling extensive and complex datasets, a common characteristic in computer systems.

To better understand and prevent issues related to model performance, our approach involves analyzing various datasets, both system and non-system, that have shown notable changes over time. We aim to apply machine learning (ML) models to these datasets to assess the effects of these changes on model performance. Our goal is to leverage more advanced ML techniques to create new algorithms that address these challenges effectively. This effort is expected to contribute significantly to the community, enhancing the detection of model aging and improving model performance in computer systems.

Project Deliverable

Run pipeline on several computer systems and non-computer systems dataset
A Trovi artifact for data preprocessing and model training shared on Chameleon Cloud
A GitHub repository containing the pipeline source code

Reproducibility in Data Visualization

Tue, 06 Feb 2024 15:00:00 -0500

At the heart of evaluating reproducibility is a judgment about whether two results are indeed the same. This can be complicated in the context of data visualization due to rapidly evolving technology and differences in how users perceive the results. First, due to the rapid evolution of libraries including web technologies, visualizations created in the past may look different when rendered in the future. Second, as the goal of data visualization is communicating data to people, different people may perceive visualizations in a different way. Thus, when a reproduced visualization does not exactly match the original, judging whether they are “similar enough” is complicated by these factors. For example, changes in a colormap may be deemed minor by a computer but could lead people to different understandings of the data. The goals of this research are to capture visualizations in a way that allows their reproducibility to be evaluated and to develop methods to categorize the differences when a reproduced visualization differs from the original.

Investigate Solutions for Capturing Visualizations

Topics: Reproducibility, Data Visualization
Skills: Python and/or JavaScript, Data Visualization Tools
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: David Koop

The goal of this project is to investigate, augment, and/or develop solutions to capture visualizations that appear in formats including websites and Jupyter notebooks. In past work, we implemented methods to capture thumbnails as users interacted with visualizations. Other solutions can be used to capture interactive visualizations. We wish to understand the feasibility of recording such visualizations and their utility in evaluating reproducibility in the future.

Specific tasks:

Evaluate tools for capturing static visualizations on the web
Investigate tools for capturing dynamic visualizations on the web
Investigate how data including code or metadata can be captured with visualizations
Augment or develop tools to aid in capturing reproducible visualizations

Categorize Differences in Reproduced Visualizations

Topics: Reproducibility, Data Visualization
Skills: Python and/or JavaScript, Data Visualization Tools
Difficulty: Moderate/Hard
Size: Medium or Large (175 or 350 hours)
Mentors: David Koop

The goal of this project is to organize types of differences in reproduced visualizations and create tools to detect them. Publications and computational notebooks record renderings of visualizations. When they also include the code to reproduce the visualization, we can regenerate them in order to compare them. Often, the reproduced visualization does not match the original (see examples in this manuscript). This project seeks to categorize the types of differences that can occur in order and start understanding how they impact judgments of reproducibility.

Specific tasks:

Evaluate and/or develop tools to compare two visualizations
Evaluate the utility of artificial intelligence solutions
Organize and categorize the detected differences
Develop tools to determine the types or categories of differences present in two visualizations

FSA: Benchmarking Fail-Slow Algorithms

Tue, 06 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ruidan Li (primary contact), Kexin Pei

In the realm of modern applications, achieving not only low but also predictable response times is a critical requirement. Performance instability, even when it amounts to just a few milliseconds of delay, can result in violations of Service Level Objectives (SLOs). Redundancy at the RAID group level provides a layer of protection; however, the early identification of potential slowdowns or failures is paramount in minimizing their impact on overall system latency.

Fail-Slow represents a unique type of fault within storage systems, characterized by the system’s ability to continue functioning while progressively deteriorating – its performance significantly drops below expected levels. Notably, fail-slow conditions are responsible for a considerable share of latency tails. Detecting fail-slow faults is particularly challenging, as they can be easily masked by the normal fluctuations in performance. Consequently, the identification of fail-slow faults is a critical area of research, demanding meticulous attention.

Several strategies have been developed to address the fail-slow issue, yet the question of their broad applicability remains. We plan to implement and assess various existing fail-slow detection algorithms, examining their strengths and weaknesses. Our analysis will concentrate on key questions:

How promptly can the algorithm identify a fail-slow symptom? What methods does the algorithm employ to accurately distinguish fail-slow incidents, thereby minimizing false negatives? Through what approach does the algorithm achieve the right sensitivity level to keep false positives in check?

This evaluation aims to shed light on the effectiveness of current methodologies in detecting fail-slow faults, crucial for enhancing system reliability and performance.

Building upon our evaluation of several fail-slow detection algorithms, our objective is to harness advanced machine learning (ML) models to develop a novel algorithm. This initiative seeks to address and potentially compensate for the identified weaknesses in existing methodologies. By focusing on the critical aspects of early detection, accurate differentiation, and optimal sensitivity, we aim to create a solution that reduces both false negatives and false positives, thereby enhancing overall system reliability. This approach represents a strategic effort to not only advance the current state of fail-slow detection but also to contribute significantly to the resilience and performance of storage systems.

Project Deliverable

A Trovi artifact for the existing Fail-Slow detection algorithms on Chameleon Cloud
A GitHub repository containing the full evaluation result
A Google Colab notebook for quick replay

OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS

Mon, 05 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage Systems, Erasure Coding
Skills: C/C++, Java, Bash scripting, Linux, HDFS, ZFS, Erasure Coding
Difficulty: Hard
Size: Large (350 hours)
Mentors: Meng Wang (Main contact person) and Anjus George

Multi-Level Erasure Coding (MLEC), which performs erasure coding at both network and local levels, has seen large deployments in practice. Our recent research work has shown that MLEC can provide high durability with higher encoding throughput and less repair network traffic compared to other erasure coding methods. This makes MLEC particularly appealing for large-scale data centers, especially high-performance computing (HPC) systems.

However, current MLEC systems often rely on straightforward design choices, such as Clustered/Clustered (C/C) chunk placement and the Repair-All (RALL) method for catastrophic local failures. Our recent simulations [1] have revealed the potential benefits of more complex chunk placement strategies like Clustered/Declustered (C/D), Declustered/Clustered (D/C), and Declustered/Declustered (D/D). Additionally, advanced repair methods such as Repair Failed Chunks Only (RFCO), Repair Hybrid (RHYB), and Repair Minimum (RMIN) have shown promise for improving durability and performance according to our simulations. Despite promising simulation results, these optimized design choices have not been implemented in real systems.

In this project, we propose to develop open-source MLEC implementations in real systems, offering a range of design choices from simple to complex. Our approach leverages ZFS for local-level erasure coding and HDFS for network-level erasure coding, supporting both clustered and declustered chunk placement at each level. The student’s responsibilities include setting up HDFS on top of ZFS, configuring various MLEC chunk placements (e.g., C/D, D/C, D/D), and implementing advanced repair methods within HDFS and ZFS. The project will culminate in reproducible experiments to evaluate the performance of MLEC systems under different design choices.

We will open-source our code and aim to provide valuable insights to the community on optimizing erasure-coded systems. Additionally, we will provide comprehensive documentation of our work and share Trovi artifacts on Chameleon Cloud to facilitate easy reproducibility of our experiments.

[1] Meng Wang, Jiajun Mao, Rajdeep Rana, John Bent, Serkay Olmez, Anjus George, Garrett Wilson Ransom, Jun Li, and Haryadi S. Gunawi. Design Considerations and Analysis of Multi-Level Erasure Coding in Large- Scale Data Centers. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’23), 2023.

Project Deliverable

Open-source MLEC implementations with a diverse range of design choices.
Configuration setup for HDFS on top of ZFS, supporting various MLEC chunk placements.
Implementation of advanced repair methods within HDFS and ZFS.
Reproducible experiments to assess the performance of MLEC systems across distinct design choices.
Comprehensive documentation of the project and the provision of shared Trovi artifacts on Chameleon Cloud for ease of reproducibility.

EdgeRep: Reproducing and benchmarking edge analytic systems

Fri, 02 Feb 2024 00:00:00 +0000

Topics: video analytics, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Medium
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (contact person), Junchen Jiang

Project Idea Description

With the flourishing of ideas like smart cities and smart manufacturing, a massive number of edge devices (e.g., traffic or security cameras, thermometers, flood sensors, etc.) are deployed and connected to the network. These devices collect and analyze data across space and time, aiding stakeholders like city governments and manufacturers in optimizing their plans and operations. However, the sheer number of edge devices and the large amount of communication among the devices and central servers raises significant challenges in how to manage and schedule resources. This includes network bandwidth between the devices and computing power on both edge devices and bare metal servers, all to maintain the reliable service capability of running applications.

Moreover, given the limited resources available to edge devices, there’s an emerging trend to reduce average compute and/or bandwidth usage. This is achieved by leveraging the uneven distribution of interesting events with respect to both time and space in the input data. This, in turn, introduces further challenges in provisioning and managing the amount of resources available to edge devices. The resource demands of running applications can greatly depend on the input data, which is both dynamic and unpredictable.

Keeping these challenges in mind, the team previously designed and implemented a dynamic resource manager capable of understanding the applications and making decisions based on this understanding at runtime. However, such a resource manager has only been tested with a limited number and types of video analytic applications. Thus, through the OSRE24 project, we aim to:

Collect a wide range of videos to form a comprehensive video dataset
Reproduce other state-of-art self-adaptive video analytic applications
Package the dataset as well as the application to publish them on Chameleon Trovi site

Project Deliverable

Collect a wide range of videos to form a comprehensive video dataset
Reproduce other state-of-art self-adaptive video analytic applications
Package the dataset as well as the application to publish them on Chameleon Trovi site

FEP-Bench: Benchmarks for understanding featuring engineering and preprocessing bottlenecks

Fri, 02 Feb 2024 00:00:00 +0000

Topics: storage system, scheduling, distributed system, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (contact person), Swami Sundararaman

Project Idea Description

However, prior to the design and implementation of such a system, a systematic understanding of the preprocessing workflow is essential. Hence, throughout the program, the students will need to:

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

Project Deliverable

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

FetchPipe: Data Science Pipeline for ML-based Prefetching

Fri, 02 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: C/C++, Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Daniar H. Kurniawan (primary contact), Haryadi Gunawi

The contemporary landscape of high-performance servers, particularly those designed for data centers and AI/ML training, prominently features solid-state drives (SSDs) and spinning disks (HDDs) as primary storage devices. These components play a crucial role in shaping overall system performance, underscoring the importance of addressing and minimizing Input/Output (I/O) latency. This is particularly crucial given the widespread adoption of hybrid storage systems, where caching and prefetching strategies are instrumental in optimizing storage performance. Caching involves using faster but less dense memory to store frequently accessed data, while prefetching aims to reduce latency by fetching data from slower memory to cache before it is needed. Although both caching and prefetching present valid challenges, our primary emphasis is on the prefetching problem due to the inherent difficulty in predicting future access.

Traditional prefetchers, dating back 1-2 decades, heavily rely on predefined rules for prefetching based on LBA access sequences, limiting their adaptability to complex scenarios. For instance, the read-ahead prefetcher is confined to prefetching the next data item within a file for faster sequential access. Addressing this limitation, recent advancements include learning-based methods, such as Long Short-Term Memory (LSTM) techniques like DeepPrefetcher and Delta LSTM, which model the LBA delta to cover a broader range of LBAs. However, they are still struggling to achieve high accuracy when the workload pattern changes drastically. Although there are some sophisticated prefetchers capable of learning complex I/O access patterns using Graph structure, they face challenges in their deployment due to the computational cost.

In this project, our goal is to provide an end-to-end data science pipeline to empower the research on ML-based prefetchers. We believe that this pipeline is crucial for fostering active collaboration between the ML community and storage systems researchers. This collaboration aims to optimize existing ML-based prefetching solutions. Specifically, we will provide the dataset for training/testing and some samples of ML-based models that can further be developed by the community. Furthermore, we will also provide a setup for evaluating the ML model when deployed in storage systems.

Project Deliverable

Compile I/O traces from various open traces and open systems.
Develop a pipeline for building ML-based prefetching solutions.
Build a setup to evaluate the model in a real hybrid storage system.
Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository

Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster

Fri, 02 Feb 2024 00:00:00 +0000

Project Idea description

We aim to characterize the performance of genomic workflows on HPC clusters by conducting two research activities using a broad set of state-of-the-art genomic applications and open-source datasets.

Performance Benchmarking and Characterizing Genomic Workflows:

Topics: High Performance Computing (HPC), Data Analysis, Scientific Workflows
Skills: Linux, Python, Bash Scripting, Data Science Toolkit, Kubernetes, Container Orchestration, Genomics Applications (e.g. BWA, FastQC, Picard, GATK, STAR)
Difficulty: Medium
Size: Large (350 hours)
Mentor(s): In Kee Kim

In this activity, students will perform comprehensive performance measurements of genomic data processing on HPC clusters using state-of-the-art applications, workflows, and real-world datasets. They will collect and package datasets for I/O, memory, and compute utilization using industry-standard tools and best practices. Measurement will be done using Kubernetes container orchestration on a multi-node cluster to achieve scalability, with either custom-made metrics collection system or integration of existing industry standard tools. (e.g. Prometheus).

Quantifying Performance Interference and Assessing Their Impact on Workflow Execution Time:

Topics: Machine Learning, Data Analysis, and Scientific Workflows and Computations
Skills: Linux, Python, Bash Scripting, Data Science Toolkit, Kubernetes, Container Orchestration
Difficulty: Difficult
Size: Medium (175 hours)
Mentor(s): In Kee Kim

In this activity, students will measure the slowdown of various applications due to resource contention (e.g. CPU and I/O). Students will analyze whether an application is compute-bound, I/O bound, or both, then analyze the correlation between resource utilization and execution time. Following that, students will assess the impact of per-application slowdown to the slowdown of a whole workflow. To the best of our knowledge, this will be the first study which systematically quantifies per-application interference when running genomics workflow on an HPC cluster.

For both subprojects, all experiments will also be conducted in a reproducible manner (e.g., as a Trovi package or Chameleon VM images), and all code will be open-sourced (e.g., shared on a public Github repo).

Project Deliverable:

A Github repository and/or Chameleon VM image containing source code for application executions & metrics collection. Jupyter notebooks and/or Trovi artifacts containing analysis and mathematical models for application resource utilization & the effects of data quality.

StatWrap

Wed, 24 Jan 2024 00:00:00 +0000

Additional information:

Reproducibility Checklists

Topics: reproducibility, user interface, checklists
Skills: JavaScript, React
Difficulty: Medium
Size: Large (350 hours)
Mentor: Luke Rasmussen

This goal of this project is to develop support within StatWrap to generate customizable reproducibility checklists. The developer will use the metadata and user input collected by StatWrap to automatically generate checklists. This functionality will allow investigators to automatically generate a document indicating what practices they’ve followed to support reproducibility. Part of the project will involve surveying proposed reproducibility checklists and considering what to implement in StatWrap. This work will take a systematic approach to documenting reproducibility, much like PRISMA checklists for systematic reviews or CONSORT checklists for clinical trials.

The specific tasks of the project include:

Identify candidate reproducibility checklists to use as guides
Create the data structure for configuring reproducibility checklists
Display the reproducibility checklist in the user interface
Store responses and comments to the checklist as provided by the user
Generate a reproducibility checklist report from StatWrap

StatTag: Connecting statistical software to Microsoft Word

Mon, 22 Jan 2024 00:00:00 +0000

StatTag is a free, open-source software plug-in for conducting reproducible research. It facilitates the creation of dynamic documents using Microsoft Word documents and statistical software, such as Stata, SAS, R, and Python. Users can use StatTag to embed statistical output (estimates, tables and figures) into a Word document and then with one click individually or collectively update output with a call to the statistical program.

What makes StatTag different from other tools for creating dynamic documents is that it allows for statistical code to be edited directly from Microsoft Word. Using StatTag means that modifications to a dataset or analysis no longer require transcribing or re-copying results into a manuscript or table.

StatTag works by interpreting specially formatted comments (“tags”) within a code file. StatTag then reads the code file, executes the code through the corresponding language interpreter, formats the results, and inserts them into the Word document as a field.

There are versions of StatTag for both Microsoft Windows and macOS. Proposed projects here are specific to the Microsoft Windows version, which is developed in the C# programming language.

Additional Information:

Support Additional Programming Languages

Topics: reproducibility, statistics
Skills: C# and one of: MATLAB, Octave, SQL, Julia
Difficulty: Medium
Size: Medium or large (175 or 350 hours)
Mentor: Luke Rasmussen

Following the same structure used for other language support in StatTag, develop support for a new programming language (suggested languages are provided, but applicants can propose others). This will include:

Creating a Parser class to support StatTag-specific interpretation of results (e.g., identifying a line of code that is writing to a CSV file, then loading that CSV file)
Creating an Automation class that manages communication with the supported programming language’s interpreter. Python support uses a Jupyter kernel, and both SAS and Stata support invoke DLLs directly.
Integrating the language into the UI (e.g., allowing it to be a valid code file, adding the icon for the code file to the UI)
Additional setup/configuration as needed (e.g., SQL support would require secure configuration for connecting to the databse server).

Develop unit tests to demonstrate code is functioning. Create test scripts in the implemented language to exercise and demonstrate end-to-end execution.

Process Tags in Jupyter Notebooks

Topics: reproducibility, jupyter
Skills: C#, Jupyter Notebooks, Python
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Luke Rasmussen

StatTag uses

StatTag currently has support for Python, and utilizes the Jupyter kernel to interact with Python. However, we currently do not fully support processing StatTag ’tags’ in a Jupyter notebook.

Following the same structure used for RMarkdown integration in StatTag, develop support for Jupyter Notebooks in StatTag. StatTag should be able to:

Take as input one or more Jupyter Notebooks
Confirm that the Jupyter Notebook uses Python
Identify StatTag formatted tags within the notebook
Pass relevant code to the Python processor already implemented in StatTag

In addition, develop unit tests to demonstrate code is functioning as intended. Create test Jupyter Notebooks to exercise and demonstrate end-to-end execution.

SLICES/pos: Reproducible Experiment Workflows

Sat, 06 Jan 2024 00:00:00 +0000

SLICES-RI is a european research initiative aiming to create a digital research infrastructure providing an experimental platform for the upcoming decades. One of the main goals of this initiative is the creation of fully reproducible experiments. The SLICES research infrastructure will consist of different experiment sites focusing on different research domains such as AI experiments, Cloud and HPC-driven experiments, or investigations on wireless networks.

To achieve reproducibility, the research group on network architectures and services of the Technical University of Munich develops the SLICES plain orchestrating service (SLICES/pos). This framework supports a fully automated structured experiment workflow. The structure of this workflow acts as a template for the design of experiments. Users that adhere to this template will create inherently reproducible experiments, a feature we call reproducible-by-design.

The SLICES/pos framework currently exists in two versions: (1) A fully-managed pos deployment, that uses the SLICES/pos framework to manage the entire testbed and (2) a hosted SLICES/pos deployment. The hosted SLICES/pos deployment is a temporary deployment that runs inside existing testbeds such as Chameleon or CloudLab.

Additional Information:

plain orchestrating service

Support Additional Programming Languages

Topics: reproducibility, statistics
Skills: Python
Difficulty: Medium
Size: Large (350 hours)
Mentor: Sebastian Gallenmüller, Georg Carle, and Kate Keahey

Design a set of basic examples that demonstrate the usage of pos that can be executed on the SLICES/pos testbed in Munich and the Chameleon testbed. This set of basic examples acts as a demonstration of pos’ capabilities and as a tutorial for new users. Based on these introductory examples, a more complex experiment shall be designed and executed, demonstrating the portability of the experiments between testbeds. This experiment involves the entire experiment workflow consisting of the setup and configuration of the testbed infrastructure, the collection of measurement results, and finally, their evaluation and publication. Multiple results of this experiment shall be created on different testbeds and hardware configurations. The results of the experiments will differ depending on the different hardware platforms on which the experiment was executed. These results shall be evaluated and analyzed to find a common connection between the different result sets of the experiments.

Create introductory examples demonstrating the usage of pos
Design and create a portable complex network experiment based on SLICES/pos
Execute the experiment on different testbeds (Chameleon, SLICES/pos testbed)
Analysis of reproduced experiment
Automated analysis of experimental results
Deduction of a model describing the fundamental connections between different experiment executions

Static Python Perf: Measuring the Cost of Sound Gradual Types

Sat, 06 Jan 2024 00:00:00 +0000

Gradual typing is a solution to the longstanding tension between typed and untyped languages: let programmers write code in any flexible language (such as Python), equip the language with a suitable type system that can describe invariants in part of a program, and use run-time checks to ensure soundness.

For now, though, the cost of run-time checks can be enormous. Order-of-magnitude slowdowns are common. This high cost is a main reason why TypeScript is unsound by design — its types are not trustworthy in order to avoid run-time costs.

Recently, a team at Meta built a gradually-typed variant of Python called (drumroll) Static Python. They report an incredible 4% increase in CPU efficiency at Instagram thanks to the sound types in Static Python. This kind of speedup is unprecedented.

Other languages may want to follow the Static Python approach to gradual types, but there are big reasons to doubt the Instagram numbers:

the experiment code is closed source, and
the experiment itself is not easily reproducible (even for Instagram!).

Static Python needs a rigorous, reproducible performance evaluation to test whether it is indeed a fundamental advance for gradual typing.

Related Work:

Gradual Soundness: Lessons from Static Python https://programming-journal.org/2023/7/2/
Producing Wrong Data Without Doing Anything Obviously Wrong! https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
On the Cost of Type-Tag Soundness https://users.cs.utah.edu/~blg/resources/pdf/gm-pepm-2018.pdf

Design and Run an Experiment

Topics: performance, cluster computing, statistics
Skills: Python AST parsing, program generation, scripting, measuring performance
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Ben Greenman

Design an experiment that covers the space of gradually-typed Static Python programs in a fair way. Since every variable in a program can have up to 3 different types, there are easily 3^20 possibilities in small programs — far too many to measure exhaustively.

Run the experiment on an existing set of benchmarks using a cluster such as CloudLab. Manage the cluster machines across potentially dozens of reservations and combine the results into one comprehensive view of Static Python performance.

Derive Benchmarks from Python Applications

Topics: types, optimization, benchmark design
Skills: Python
Difficulty: Medium
Size: Small to Large
Mentor: Ben Greenman

Build or find realistic Python applications, equip them with rich types, and modify them to run a meaningful performance benchmark. Running a benchmark should produce timing information, and the timing should not be significantly influenced by random variables, I/O actions, or system events.

These 4 new features will change the way you use OpenROAD

Sun, 29 Oct 2023 00:00:00 +0000

Introduction

Welcome to the final blog post for my GSoC’23! Once again, my name is Jack and I am working under the open-source electronic design automation project - OpenROAD. We are a fast growing leading open-source foundational application for semiconductor digital design, as evidenced from our consistent star growth since inception. You may check us out at this link. Allow me to share the four significant contributions I made in this GSoC project.

1) Improving Ease of Installation

Firstly, OpenROAD is now able to support multiple operating systems. This is essential as one of our primary goals is to democratise chip implementation. And installation is often one of the hardest steps to get right, so that was one of our priorities. Today, we have provided options for different types of installation:

Prebuilt binaries: Local installations can often be riddled with incompatibilities or unexpected bugs, as well as taking a long compilation time. We sidestepped this by providing semi-regular updates to OpenROAD binary, reducing the time to installation.
Docker: Echoing previous concerns, we also enabled Docker installation for 9 major operating systems. Docker is extremely flexible and runs on many operating systems (as long as it is supported by Docker).

With these changes, we have observed 10% reduction of installation related Github issues posted on a weekly basis.

Figure 1: Supported OS matrix

2) Filling Missing Documentation

Next, we have made considerable improvements to over 20 tool-specific documentations, introducing consistent formatting styles for each page. We introduce default values and datatypes to allow users to use the tools with greater ease.

Figure 2: Helpful documentation defaults and datatype

Rather than having all arguments for a function under a common table, we separated out into developer arguments and developer commands. This is to further make our documentation more beginner-friendly to read, while not alienating our technical userbase. We have also added sections for example scripts and regression test, so as to help onboard newcomers to each tool of the flow.

Figure 3: Useful developer commands, example scripts, and regression test instructions

3) Extensible Documentation Framework

Thirdly, we have introduced extensible documentation frameworks. Now, what do we mean by extensible? It means we have created an infrastructure which is easy to use for developers, and allows for greater maintanability. Our goal is to create something that requires minimal changes to add content for documentation.

So, how did we do this?

We introduced 4 initiatives, namely: the warning/error messages glossary. We noticed that people were searching for error and warning messages, but our documentation did not have them. So we added a page where all the error/warning messages along with relevant code line number can be generated automatically. On top of that, developers can add useful debug information to help the end user.

Figure 4: Warning/Error messages glossary.

Next, we also introduced automatically generated Doxygen pages, which integrates nicely into our C++/Tcl source code framework. This automatic generation will make it much more convenient for developers to just insert comments into their source code, and allow Doxygen to generate documentation automatically.

Figure 5: Doxygen pages.

Next, we introduced cloud-based packaging. It is important that our framework is able to runnable on cloud, and the ever-popular notebook format. Our Colab based notebook was created with this in mind, and allows for easy transfer to other notebook providers with some modifications. Check out the notebooks here!

Figure 6: Google Colab can now run OpenROAD scripts.

Lastly, we have the changelog workflow which can be triggered manually. For our open-source project, we have chosen not to do software releases. This means it can be difficult to track the changes between commit numbers. Adding this workflow can help newcomers track the changes easier, by month.

Figure 7: Sample output of github changelog

4) OpenROAD Chatbot

Finally, we are also discussing the potential of creating a chatbot whose purpose is to answer user queries. We were thinking, there are lots of domain knowledge in Slack Channels, Github repos, and so on, so why not create a LLM-based chatbot. Stay tuned for updates!

Personal Reflections

To me, my most valuable takeaway is with regards to code quality. Often times, we as coders tend to opt for the best solution and “hack” something out quickly. Hacking is fine, as a proof of concept - but not for long term code development. Working in open-source projects like this, I have learnt to avoid creating unnecessary files, shortening the code and optimising runtime. In doing our job, we also wish to make life easier, not harder for future developers

Final Words

I would like to express my gratitude to my mentors Indira and Vitor for their guidance and insight throughout the project, as well as the OpenROAD dev team for their assistance. Would also like to thank the Google Summer of Code organising committee, and UCSC for creating such a wonderful program. Being able to contribute to actual real open-source projects with real needs, is truly the best of both worlds for aspiring programmers.

Final Blog Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Wed, 25 Oct 2023 00:00:00 +0000

In my final blog, I will first introduce the project, then describe the achievements after the midterm and summarize our experiments. As part of the Measuring Research Prototypes under Unreported Settings my proposal under the mentorship of Yang Wang and Miao YU aims to understand the impact of missing settings in artifact evaluation.

In my midterm blog(/report/osre23/osu/missingsettings/20230802-ren.450/), I took three paratmeters as the PostgreSQL config to test the performance of TPC-C benchmark and got some initial results about the effect of different parameters separately on throughput performance. After the midterm, I continue doing experiments on these four parameters (shared_buffer, min_wal_size, max_wal_size and effective_cache_size) with more values and associate them to measure the effect on performance. These parameters are related to memory consumption, checkpoints and planner cost in the database server. You can refer to my previous blog for details.

For the experiment, we continue to measure the throughput performanace for the benchmark by setting scalefactor as 10 and incrementing worker terminals. The settings for database server are all default values except the four parameters we choose to tune. For the shared_buffer parameter, we choose from initial 128mb to 8gb, in total 6 values. Then for each shared_buffer setting, effective_cache_size includes three values, from initial 4gb to 16gb. Next, for each effective_cache_size setting we tune the min_wal_size and max_wal_size as a tuple, min_wal_size has two values and max_wal_size has four values, in total 6 values. We conduct the experiments by running three rounds for each setting and get all three throughput numbers and calculate their average values.

Based on the results, the observation holds as the conclusion from midterm blog. The throughput of the benchmark can be affected by tuning shared_buffer and max_wal_size. Effective_cache_size and min_wal_size do not have obvious effect for this benchmark. The improvement is limited after shared_buffer and max_wal_size reach a certain value.

In our experiment, we only choose three possible parameters for one benchmark. The experiment is expensive considering the consuming time. There are also more values of above mentioned parameters to test. This experiment can also indicate we may need to sample a subset of settings to generate observations that match those from a full extensive artifact evaluation.

Public Artifact and Data Visualization: A Journey to Empower

Tue, 24 Oct 2023 00:00:00 +0000

Hola Amigos! As we draw the curtains on our project titled Public Artifact and Data Visualization we’re thrilled to present the incredible advancements we’ve achieved since our mid-term update. Our mission has been to foster a deeper understanding of data and empower users to make informed decisions. Let’s delve into the remarkable evolution of our project.

Unveiling New Functionalities

Modular Architecture: Your Way, Your Choice

At the core of our project is a modular architecture designed to cater to your unique preferences. We firmly believe that choice empowers users. Thus, we’ve given you the option to select between a Graphical User Interface (GUI) and a Command-Line Interface (CLI). It’s about providing a platform that adapts to your specific requirements and style of interaction.

Real-time Backend Environment Monitoring: Data as it Happens

Real-time monitoring of backend environment data is at the heart of our project. It’s not just about collecting data; it’s about providing continuous insights into system performance. This feature empowers you to make real-time, data-driven decisions—an essential capability in today’s fast-paced computing landscape.

Visualizing Environment Variables: Clarity Amidst Complexity

We’ve placed a strong emphasis on user-friendly data visualization. Our enhancements enable you to navigate through detected variables effortlessly and compare iterations within different buckets. The result is a visual representation of complex data, making it easier to comprehend and analyze.

Predefined Monitoring Commands: Your Head Start

We understand that monitoring can be a daunting task. To simplify the process, we’ve introduced predefined monitoring commands such as mpstat and iostat. These templates serve as a launchpad for monitoring common system metrics, helping you get started quickly and efficiently.

Comprehensive Customization: Tailoring the Experience

Recognizing that every user has unique needs, our platform now offers extensive documentation. This documentation serves as a guide, enabling users to fine-tune their monitoring commands. It’s about tailoring the platform to match your specific requirements and preferences. The power to customize is firmly in your hands.

Import and Export Functionality: Seamless Collaboration

In an era where collaboration and data management are essential, we’ve introduced the capability to import and export environment data. This feature simplifies data management and supports collaborative efforts, making it easy to share monitoring data and conduct analysis across various environments.

Exploring Our Repositories

As mentioned earlier, we have completed the core functionalities of our platform, and we would love to have you try it out and provide us with valuable feedback. Here are the links to our repositories where you can explore and experiment with our platform:

GUI Repository and CLI Repository
- The journey begins with a choice. Our repositories cater to a diverse range of user preferences. Inside the README.md file of the GUI repository, you’ll find meticulous installation instructions to guide you through setting up the Graphical User Interface (GUI). It’s your portal to a user-friendly experience
Sample Repository
- For those eager to embark on their monitoring journey, our Sample Repository is a valuable resource. It provides scripts that not only enable you to run our program but also serve as templates. These templates are designed to simplify the monitoring of your own programs, tailored to your unique requirements.

Project Demo

To provide you with a glimpse of what our project can do, here are some demo images showcasing the capabilities and features of “Public Artifact and Data Visualization.”

Thank You for Joining Us

We appreciate your support and participation in this journey of data visualization and empowerment. Our commitment to enhancing the world of data comprehension remains unwavering. As we mark the end of this chapter, we eagerly anticipate the exciting future that awaits in the realm of data visualization. The path doesn’t end here; it’s just the beginning of a new chapter in our collective exploration of data’s potential.`

Final Blog on Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Fri, 20 Oct 2023 00:00:00 +0000

Hello Again!

I’m excited to present my final blog post summarizing the progress and achievements made over the 2023 Summer of Reproducibility Fellowship.I will be sharing the work I’ve created for the Teaching Computer Networks with Reproducible Research: Developing a ‘classroom competition’ for adaptive video delivery.

Recap of the Journey

In my mid-term evaluation, I discussed the initial milestones and challenges I encountered during this program. At that point, I studied the key figures from the research paper ‘Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming’. My primary objectives were to ensure compatibility with both Python 2 and Python 3 and to incorporate an ‘Estimated Download Rate’ metric into the output file generated by the adaptive video client. Furthermore, I expanded the project to include two crucial visualizations: buffer occupancy vs. time and estimated download rate vs. time.

Final Project Progress

In the final weeks of my internship, I worked towards my ultimate goal, which was to reproduce existing work and create a clear guide for future students. I aimed to enable them to build upon and improve this work. To achieve this, I created a new experiment using an existing one,

which I titled “Compare Adaptive Video Policies”

This experiment compares two policies: rate-based (basic) policy and buffer-based (Netflix) policy. In the experiment, I covered the following key aspects:

How Both Policies Work: I detailed the workings of both the rate-based and buffer-based policies, explaining how each policy selects the next bitrate, among other relevant information.

Instructions for Execution of Policies: After conducting several experiments with different settings, I determined the most appropriate settings for this experiment. These settings have been added to the instructions for executing both policies, with a focus on ensuring similar “high” network rates, “low” data rates, similar durations of the “high” data rate before the interruption, and similar durations of the “interruption.” This setup allows for an easy and clear comparison of the two policies.

Discussion Part: In the discussion section, I addressed the differences that students can observe after conducting the experiment and visualising the graphs and videos.

In conclusion, I would like to thank my mentor, Fraida Fund, who has given me excellent guidance and would like to express my gratitude to OSRE23, where I have learned so much. This experience has been amazing for my personal and professional growth.

Final Blog on Using Reproducibility in Machine Learning Education

Wed, 18 Oct 2023 00:00:00 +0000

Welcome back!

In my final blog post for the 2023 Summer of Reproducibility Fellowship, I’ll be sharing my experiences and the materials I’ve created for the Using Reproducibility in Machine Learning Education project. As a quick reminder, my mentor Fraida Fund and I have been working on developing interactive open-source educational resources that teach reproducibility and reproducible research in machine learning. You can find my proposal here.

In this post, I’ll give you a rundown of my experience and share the materials I’ve created. If you haven’t checked out my previous blog posts, definitely take a look before diving into this one. Let’s get started!

Why is this project important 🤔

Reproducibility is an essential aspect of scientific research, and it’s becoming increasingly important in the field of computer science. However, most efforts to promote reproducibility in education focus on students who are actively involved in research, leaving a significant gap in the curriculum for introductory courses. Our project aims to address this issue by incorporating reproducibility experiences into machine learning education.

Why Reproducibility Matters in Education 🎓

There are two primary reasons why we believe reproducibility belongs in the computer science classroom. Firstly, it allows students to experience the process of reproducing research firsthand, giving them a deeper understanding of the scientific method and its importance in the field. This exposure can inspire students to adopt reproducible practices in their future careers, contributing to a more transparent and reliable scientific community.

Source: Fund, Fraida. “We Need More Reproducibility Content Across the Computer Science Curriculum.” Proceedings of the 2023 ACM Conference on Reproducibility and Replicability. 2023.

Secondly, as shown in the figure, involving students in reproducibility efforts can have a significant impact on the reproducibility ecosystem itself. Students can create reproducibility artifacts, such as replicable experiments or data analysis, that can be used by other researchers, including authors and graduate students. Additionally, students can consume reproducibility artifacts created by the research community, provide feedback, and suggest improvements. Authors appreciate this type of engagement, as it adds value to their work and promotes open science.

Focusing on Machine Learning 🧐

Given the growing interest in machine learning and its relevance to reproducibility, our project decided to focus on this area. Machine learning already has a strong culture of reproducibility, with initiatives like Papers with Code and the ML Reproducibility Challenge. These efforts encourage researchers to share their code and reproduce recent machine learning papers, validating their results. By leveraging these existing resources, we can create learning materials that utilize real-world examples and foster hands-on reproducibility experiences for students.

The Interactive Notebooks 📖

We have created two learning materials that focus on machine learning and reproducibility. The first material looks at a paper titled “On Warm Starting Neural Network Training” by Jordan T. Ash and Ryan P. Adams. This paper discusses the concept of warm-starting, which involves using weights from a previously trained model on a subset of the dataset to train a new model. The authors compare the performance of warm-started models with randomly initialized models and find that the warm-started models perform worse as shown in the below figure.

Our material takes students through the process of identifying the different claims made in the paper and finding the corresponding experiments that support them. They will also learn how to use open-source code and available data to reproduce these experiments and understand the computational complexity associated with reproducing each experiment. This material can be found on both github and chameleon where you can use chameleon to run the material on the required resources.

The second material examines the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” by Dosovitskiy et al., which introduces a novel way of applying the transformer architecture, which was originally designed for natural language processing, to image recognition tasks. The paper shows that transformers can achieve state-of-the-art results on several image classification benchmarks, such as ImageNet, when trained on large-scale datasets as shown in the following table.

Our material guides students through the process of understanding which claims can and cannot be validated based on the available datasets and how complex it can be to validate each claim. Additionally, they will learn how to use pre-trained models to replicate computationally expensive experiments. Again this material can be on both github and chameleon.

Both materials are designed to be easy to understand and interactive, allowing students to engage with the content and gain a deeper understanding of the concepts. Instructors can use these materials to assess their students’ understanding of machine learning and reproducibility.

Reflecting on the Journey

As we wrap up our journey of creating beginner-friendly learning materials for machine learning using reproducibility, it’s time to reflect on the rewarding experiences and valuable lessons learned along the way. Our deep dive into the world of machine learning and reproducibility not only enriched our knowledge but also provided us with an opportunity to contribute to the community at the UC Open Source Symposium 2023 at UCSC.

The symposium was a memorable event where we presented our work in a poster session. The diversity of the audience, ranging from professors and researchers to students, added depth to our understanding through their valuable feedback and insights. It was intriguing to see the potential applications of our work in various contexts and its capacity to benefit the broader community.

This project has been a personal journey of growth, teaching me much more than just machine learning and reproducibility. It honed my skills in collaboration, communication, and problem-solving. I learned to distill complex ideas into simple, accessible language and create engaging, interactive learning experiences. The most fulfilling part of this journey has been seeing our work come alive and realizing its potential to positively impact many people. The gratification that comes from creating something useful for others is unparalleled, and we are thrilled to share our materials with the world.

Your time and interest in our work are greatly appreciated! Hope you enjoyed this blog!

Learning Machine Learning by Reproducing Vision Transformers

Fri, 06 Oct 2023 00:00:00 +0000

Hello again!

In this blog post, I will be discussing the second material I created for the 2023 Summer of Reproducibility Fellowship. As you may recall from my first post, I am working on the Using Reproducibility in Machine Learning Education project with Fraida Fund as my mentor. My goal is to create interactive open-source educational resources that teach reproducibility and reproducible research in machine learning (ML), as outlined in my proposal.

In this post, I will share with you my second material, and how it can be helpful in machine learning class to teach students about vision transformers and reproducibility at the same time. If you haven’t seen my first work, be sure to check out my previous blog post. Without further ado, let’s dive in!

Reproducing “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”

This material is a reproduction of Dosovitskiy et al.‘s 2020 paper, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. This paper introduces the Vision Transformer (ViT), a novel architecture that applies the transformer model, originally designed for natural language processing tasks, to image recognition. The ViT model achieves state-of-the-art performance on several image classification benchmarks, demonstrating the potential of transformers for computer vision tasks.

The figure illustrates the key idea behind ViT, which is to treat an image as a sequence of patches, similar to how a transformer treats a sentence as a sequence of words. Each patch is flattened into a vector and fed into the transformer encoder, which learns to capture the complex relationships between these patches. The resulting representation is then fed into an MLP head, which produces a final prediction for the image. This approach allows ViT to handle large input images and capture both global context and fine-grained details. ViT models can also be pre-trained on large datasets and fine-tuned on smaller datasets for improved performance.

To reproduce this paper, I followed a systematic approach to ensure reliable results:

Critically analyze the paper’s qualitative and quantitative claims.
Identify the necessary experiments to verify each claim.
Determine the required data, code, and hyperparameters for each experiment.
Utilize pre-trained models for validating claims that require high computational resources.
Investigate resources shared by the authors, such as code, data, and models.
Assess the feasibility of verifying different types of claims.
Design new experiments for validating qualitative claims when certain models or datasets are unavailable.

I utilized Chameleon as my platform for conducting and documenting my reproduction experiments. Chameleon is a large-scale, reconfigurable experimental environment that supports computer science systems research. It enables users to create and share Jupyter notebooks capable of running Python code on Chameleon’s cloud servers. For this work, a GPU with 24GB or more memory is required to run the notebooks on GPU, which Chameleon offers in its variety of GPUs.

I have set up a GitHub repository where you can access all of my reproduction work. The repository contains interactive Jupyter notebooks that will help you learn more about machine learning and the reproducibility of machine learning research. These notebooks provide a hands-on approach to understanding the concepts and techniques presented in my reproduction work.

Challenges

Reproducing a paper can be a challenging task, and I encountered several obstacles during the process, including:

The unavailability of pretraining datasets and pretrained models
Inexact or unspecified hyperparameters
The need for expensive resources for some hyperparameters
The use of different frameworks for baseline CNNs and Vision Transformers

These issues posed significant difficulties in replicating the following table, a key result from the Vision Transformer paper that demonstrates its superiority over prior state-of-the-art models.

To overcome these challenges, I used the same models mentioned in the paper but pretrained on different datasets, experimented with various hyperparameter combinations to achieve the best results, and wrote my own code to ensure that both the baseline and Vision Transformer were fine-tuned using the same framework. I also faced other challenges, which I discussed in my notebooks along with the solutions I applied.

How to use this material?

This material consists of a series of notebooks that guide you through the paper, its claims, experiments, and results. You will learn how to analyze, interpret, and validate the authors’ claims. To get started, I recommend briefly skimming the original paper to gain an understanding of the main ideas and public information. This will help you see how the authors could have been more transparent and clear in certain sections. The notebooks provide clear instructions and explanations, as well as details on how I addressed any missing components.

Conclusion

In this blog post, I’ve walked you through the contents of this material and the insights users can gain from it. This material is particularly intriguing as it replicates a paper that has significantly influenced the field of computer vision. The interactive nature of the material makes it not only educational but also engaging and enjoyable. I believe users will find this resource both fun and beneficial.

I hope you found this post informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for reading and stay tuned for more updates!

noWorkflow as an experiment management tool - Final Report

Thu, 14 Sep 2023 00:00:00 +0000

This post describes our midterm work status and some achievements we have made so far in our project proposal for noWorkflow.

For a more friendly introduction to our work, please, refer to this tutorial available.

Our final code to merge is available in this repository.

Different ways of managing experiments

From our starting point at the midterm, and from our initial aspirations for the SoR, we kept on track with the goal of adding features to noWorkflow related to managing DS/ML experimental setups focusing on reproducibility.

With the emergence of IA across multiple fields in industry and academia, the subject of reproducibility has become increasingly relevant. In [1] we have an interesting description of the sources of irreproducibility in Machine Learning. All these sources are present at different stages during the project's experimental phases and may even persist in production environments, leading to the accumulation of technical debt [2]. The problem of irreproducibility is also discussed in [[3], [4]], pointing out that the velocity of deliverances usually comes at the expense of reproducibility, among other victims.

The CRISP-DM process as reviewed in [5] demonstrates that Data Science experiments follows a typical path of execution. In the same manner, [[3], [6], [7]], points out that Machine Learning pipelines are composed of well-defined layers (or stages) through its lifecycle. The emergence of IA in real world applications stressed the almost artisanal ways of creating and managing analytical experiments and reinforced that there is room to make things more efficiently.

In the search for possible approaches to the problem, we came across several projects that aimed to address these issues. Not surprisingly, multiple authors pursued the same goal, for instance [[9], [10]]. In these references, and confirmed in our survey, we found from targeted solutions to specific steps in modeling to services aiming for end-to-end AIOps management. Some are available as software packages, others as SaaS in cloud environments. In general terms, all of them end up offering features in different layers of the workflow (i.e. data, feature, scoring, and evaluation) or with different conceptualizations of reproducibility/replicability/repeatability as noticed by [11]. On one hand, this lack of standards makes any assessment difficult. On the other hand, it suggests a community in an exploratory process of a hot topic subject.

Specifically for this project, our focus is in the initial stages of computational scientific experiments. As studied in [8], in this phase, experiments are i) implemented by people as prototypes, ii) with minor focus on pipeline design and iii) in tools like Notebooks, that mix documentation, visualization and code with no required sequential structure. These three practices impact reproducibility and efficiency and are prone to create technical debts. However, tools like noWorkflow show a huge potential in such scenarios. It is promising because they i) demands a minimal setup to be functional, ii) works well with almost nonexistent workflows iii) require minimal additional intrusive code among the experimental one and iv) integrates well with Notebooks that are the typical artifact in these experiments.

According to its core team, the primary goal of noWorkflow is to "...allow scientists to benefit from provenance data analysis even when they don't use a workflow system.". Unlike other tools, "noWorkflow captures provenance from Python scripts without needing a version control system or any other environment". It is particularly interesting when we are in the scenario described above, where we lack any structured system at the beginning of experiments. In fact, after going through the docs, we can verify that noWorkflow provides:

Command-line accessibility
Seamless integration with Jupyter Notebooks
Minimal setup requirements in your environment
Elimination of the need for virtual machines or containers in its setup
Workflow-free operation
Open source license
Framework-agnostic position

Finally, in our research, we confirmed that there is an open spot in the management of scientific experiments that needs to be occupied by reproducibility. Provenance tools can help the academy and industry groups in this goal, and in this summer we focused on adding relevant features to leverage the noWorkflow in this direction.

Different tools for different needs

In our research phase, we didn't find any taxonomy that fully accommodated our review of different categories of tools providing reproducibility and experimental management. So, we could describe some tools in the following categories (freely adapted from this online references [here] and [here]):

Data and Pipeline Versioning: Platforms dealing with ingestion, processing, and exposing of features for model training and inference. They enable collaboration and discoverability of already existing Feature Sets throughout the teams and organizations. Provide provenance and lineage for data in different levels of complexity.

Metadata Stores/Experiment Trackers: They are specifically built to store metadata about ML experiments and expose it to stakeholders. They help with debugging, comparing, and collaborating on experiments. It is possible to divide them into Experiment Trackers and a Model Registry. Moreover, there are projects offering reproducibility features like hyperparameter search, experiment versioning, etc. However, they demand more robust workflows and are better suited for projects in the production/monitoring phases.

Pipeline frameworks: They operate within the realm of production, similar to Data Engineering workflows. Their usual goal is to allow any ML/AI products to be served across a wide range of architectures, and integrate all the low-hanging fruits along the way. For instance, pipelines adding hyperparameter optimization tasks, experiment tracking integrations, boilerplate containerized deployment, etc.

Deployment and Observability: They focus on deploying models for real-time inference and monitoring model quality once they are deployed in production. Their aim is to facilitate post-deployment control tasks such as monitoring feature drifts, conducting A/B testing, facilitating fast model shifts, and more.

The most remarkable aspect of this survey is that there are different tools for different phases in the life cycle of AI products. There are tools like DVC and Pachyderm that are Metadata Stores, allowing Experiment Tracking with features of tagging variables, as well as Data and Pipeline tracking. They are the most similar tools to noWorkflow in functionality. However, DVC possesses a more complex framework in dealing with different 'types' of tags, and relies on command line tools to extract and analyze tagged variables. Also, it depends strongly on git and replicate the git logics. Pachyderm requires a more sophisticated setup at the start, relying on containers and a server. It is an obstacle to small and lean prototypes, requiring installation of a docker image, and all friction on managing it.

There are other tools, like MLFlow and Neptune that pose themselves as Model Experiment Versioning with features of Monitoring and Deployment. They also have elements of pipeline frameworks, offering full integration and boiler plates for seamless integration with cloud platforms.

Pipelines are a vast field. They are AWS SageMaker, Google Vertex, DataRobot and Weights & Biases, among others. All of them offer features helping in all categories, with a strong focus on exploring all automation that can be offered to the final user, suggesting automatic parameter tuning, model selection, retraining, data lineage, metadata storing, etc.

Finally, Deployment and Observability frameworks are in the deployment realm, which is another stage far removed from prototypical phases of experiments. They come into the scene when all experimental and inferential processes are done, and there is an AI artifact that needs to be deployed and monitored. Such tools like Seldon, H2O, Datarobot do this job, again, with some features of Hyperparameter tuning, pipeline frameworks, data and pipeline tracking.

In light of this, when considering management and operation of experiments, we have a reduced sample of alternatives. Among them, Notebook integration/management are rare. Some of them rely on other tools like Git or enforces an overhead in the coding/setup with reserved keywords, tags and managerial workflows that hinder the process.

At first sight, our "informal" taxonomy positions noWorkflow as a Data/Pipeline Versioning and Metadata Store/Experiment Tracker. It is not a Pipeline Framework which works like a building block, facilitating the integration of artifacts at production stages. It is not a Deployment and Observability framework, because they are in the post-deployment realm, which is another stage far removed from prototypical phases of experiments.

Desiderata

As mentioned earlier, a typical workflow in DS/ML projects is well described by the CRISP-DM [5] and precede phases of deployment and production in the whole lifecycle of DS/ML projects.

Fig 1: CRISP-DM example of trajectory through a data science project

Briefly speaking, a workflow starts when a user creates a Jupyter Notebook and starts writing code. Usually, he/she imports or selects data from a source, explore features which are expected to have the highest inference potential, tunes some parameters to set up its training, trains and evaluates the predictive power of the model through different metrics. At this final step, we have delineated a trial. This trial result can suggest further improvements and new hypotheses about data, features, model types and hyperparameters. Then, we have a new experiment in mind that will result in a new trial.

When this process repeats multiple times, a researcher may end with different notebooks storing, each one, a different experiment. Each notebook has multiple hyperparameters, modeling choices and modeling hypotheses. Otherwise, the experimenter may have a unique notebook where different experiments were executed, in a nonlinear order between the cells. This former case is pointed out in [8], where Notebook flexibility makes it difficult to understand which execution order resulted in a specific output.

In a dream space, any researcher/team would have benefited at most if they could

a) in a running Notebook, being able to retrieve all the operations that contributed to the result of a variable of interest. In this case, modifications applied in the inputs or in the order of operations would be easily detectable. In the same way, any nonlinear execution that interferes in a control result.

b) Compare trials after different experiments. After experimenting with different hypotheses about hyperparameters, features or operation order, the user should easily compare the history of two trials and spot differences.

c) Retrieve a target variable among different trials that were executed in the context of an experiment. After proceeding with multiple experimental trials, users should be able to compare the results that are stored in different Notebooks (or even not).

d) Be as much "no workflow" as possible. All the former requisites should be possible with minimal code intervention, tags, reserved words or any active coding effort.

With these goals in mind, we worked on our deliverables and used the experiment carried out by [12] as a guideline to validate the new noWorkflow features.

Deliverables

In this session, we will describe what we have implemented during this summer.

We started on tagging cells and variables and then navigating through its pre-dependencies, or all other variables and function calls that contributed to its final value. This was a fundamental step that allowed us to evolve to create features that are really useful in day-to-day practice.

From the features of tagging a cell and tagging a variable, we evolved to the following features (an interactive notebook is available here):

backwards_deps('var_name', glanularity_level) : returns a dictionary storing operations/functions calls and their associated values that contributed to the final value of the tagged variable. Glanularity_level allows to set if the internal operations of the functions must be included or not.

global_backwards_deps('var_name', glanularity_level) : does the same as backwards_deps, but from all different tagging and re-tagging events in the notebook. It allows to retrieval of the complete operation of a tagged variable across all executed cells in the notebook
store_operations(trial_id, dictionary_ops) : save the current trial in order to make further comparisons with other experiments. The dictionaries aren't stored in the .noworkflow/db.sqlite, but in a shelve object named *ops.db* in the current notebook local folder.
resume_trials() : to support the management of experiments, the user can see the trial_ids of all experiments stored in the ops.db available for comparison/analysis.
trial_intersection_diff(trial_id1, trial_id2) : all mutual variables/funcion_calls between two experiments have its scalar values compared

trial_diff(trial_id1, trial_id2) : The values of variables and function calls are exhibited in a diff file format, emphasizing the operations' order. The goal here is to show that between the two experiments, the order of operations was different. Again, only scalar values are exhibited. More complex data structures (matrices, vectors, tensors, etc.) are only signaled as 'complex_type'

var_tag_plot('var_name') : Chart the evolution of a given variable across multiple trials in the database. In this case, all experiments stored in ops.db and tagged as *target_var* have their values plotted

var_tag_values('var_name') : Provides access to pandas.dataframe var_name entries with correspondent values across different trials.

Challenges

As expected, we had unexpected findings along the project. Bellow, we delve into the most significant challenges we had to face:

Jupyter notebooks allow a nonlinear execution of small parts of code through cells. More than once, we had to align about how to create functionalities to attend different scenarios that were unexpected. One example was the backwards_deps() and global_backwards_deps() functions. The latter function was born to cover the case where the user wants all dependencies rather than the local cell dependencies.
Despite the high quality of the current version of the package, the project needs documentation, which slows down the analysis of any new development. In this project, the aid of mentors was crucial at some points where a deeper knowledge was needed.
What is the vocation of noWorkflow? At some points in the project, we had to discuss forcing some kind of workflow over the user. And it would go against the philosophy of the project.
When working on comparing results, especially in DS/ML fields, complex types arise. Numerical vectors, matrices, and tensors from NumPy and other frameworks, as well as data frames, can't be properly manipulated based on our current approach.
The dilemma of focusing on graphic visual features versus more sophisticated APIs. More than once, we needed to choose between making a visual add-on to Jupyter or implementing a more complete API.
The current version of Jupyter support in noWorkflow doesn’t integrate well with Jupyter Lab. Also, even the IPython version has new versions, and noWorkflow needs to adapt to a new version.

Future Improvements

Given our current achievements and the insights gained along the project, we would highlight the following points as crucial future roadmap improvements:

Add a complex type treatment for comparisons. Today, visualizing and navigating through matrices, data frames, tensors, isn't possible with noWorkflow, although the user can do by its own means.
Integrate the dictionaries storing sequences of operations from shelve objects to a more efficient way of storage and retrieval.
Make it easier for users to manage (store, retrieve, and navigate) through different trials.
Add graphical management instead of relying upon API calls only.
Evolve the feature of tagging cells.
When tagging a model, save its binary representation to be recovered in the future.
Adding the capability of tracking the local dataset reading. Currently, it is possible to track changes in the name/path of the dataset. However, any modification in the integrity of a dataset is not traceable.

What I've learned

This was a great summer with two personal discoveries. The first one was my first formal contact with the Reproducibility subject. The second was to fully contribute with an Open Source project. In the research phase, I could get in touch with the state-of-the-art of reproducibility research and some of it nuances. In the Open Source contributing experience, I could be mentored by the core team of the noWorkflow and exercise all the skills required in doing high level software product.

Acknowledgments

I would like to thank the organization of Summer of Reproducibility for aiding this wonderful opportunity for interested people to engage with Open Source software. Also, thanks to the core team of noWorkflow for supporting me in doing this work.

Bibliography

[1] [O. E. Gundersen, K. Coakley, C. Kirkpatrick, and Y. Gil, “Sources of irreproducibility in machine learning: A review,” arXiv preprint arXiv:2204. 07610.]

[2] [D. Sculley et al., “Machine Learning: The High Interest Credit Card of Technical Debt,” in SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.]

[3] [P. Sugimura and F. Hartl, “Building a reproducible machine learning pipeline,” arXiv preprint arXiv:1810. 04570, 2018.]

[4] [D. Sculley et al., “Hidden technical debt in machine learning systems,” Adv. Neural Inf. Process. Syst., vol. 28, 2015.]

[5] [F. Martínez-Plumed et al., “CRISP-DM twenty years later: From data mining processes to data science trajectories,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 8, pp. 3048–3061, 2019.]

[6] [N. A. Lynnerup, L. Nolling, R. Hasle, and J. Hallam, “A Survey on Reproducibility by Evaluating Deep Reinforcement Learning Algorithms on Real-World Robots,” in Proceedings of the Conference on Robot Learning, L. P. Kaelbling, D. Kragic, and K. Sugiura, Eds., in Proceedings of Machine Learning Research, vol. 100. PMLR, 30 Oct--01 Nov 2020, pp. 466–489.]

[7] [A. Masood, A. Hashmi, A. Masood, and A. Hashmi, “AIOps: predictive analytics & machine learning in operations,” Cognitive Computing Recipes: Artificial Intelligence Solutions Using Microsoft Cognitive Services and TensorFlow, pp. 359–382, 2019.]

[8] [J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire, “Understanding and improving the quality and reproducibility of Jupyter notebooks,” Empirical Software Engineering, vol. 26, no. 4, p. 65, 2021.]

[9] [D. Kreuzberger, N. Kühl, and S. Hirschl, “Machine Learning Operations (MLOps): Overview, Definition, and Architecture,” IEEE Access, vol. 11, pp. 31866–31879, 2023.]

[10] [N. Hewage and D. Meedeniya, “Machine learning operations: A survey on MLOps tool support,” arXiv preprint arXiv:2202. 10169, 2022.]

[11] [H. E. Plesser, “Reproducibility vs. replicability: a brief history of a confused terminology,” Front. Neuroinform., vol. 11, p. 76, 2018.]

[12] [Z. Salekshahrezaee, J. L. Leevy, and T. M. Khoshgoftaar, “The effect of feature extraction and data sampling on credit card fraud detection,” Journal of Big Data, vol. 10, no. 1, pp. 1–17, 2023.]

Reproducible Evaluation of Multi-level Erasure Coding (Midterm)

Sat, 05 Aug 2023 00:00:00 +0000

Hi Everyone,

I hope everything goes well! This is my second blog post for my project Reproducible Evaluation of Multi-level Erasure Coding under the mentorship of John Bent, Anjus George, and Meng Wang. In summary, my project aims to build a platform to reproducibly evaluate the performance and durability of MLEC (Multi-Level Erasure Coding) for large-scale storage systems under different design configurations. The details are in this proposal.

In the course of these few weeks, I’ve completed several tasks to achieve the aim of this project, including

Literature Review
Studying the Erasure Coding Simulator and Creating Reproducible Evaluations, with the following policies
- Clustered/Declustered Local-level SLEC
- Clustered/Declustered Network-level SLEC
- MLEC with C/C, C/D, D/C, D/D configuration

Literature Review

Prior to developing the simulator, my first step was to delve into various literature related to distinct Erasure Coding policies. To understand a simulator for complex Erasure coding policy such as MLEC, I want to start from the simpler EC policies, and then extend my knowledge to more complex ones such as MLEC. Moreover, I also aimed to contrast the durability of MLEC with other comparable EC policies like LRC in my evaluations, making it vital to understand the implementation of these policies.

Over the first week, I read several papers regarding different chunk placement policies regarding erasure coding, including LRC (Local Reconstruction Codes), CL-LRC (Combined Locality for Local Reconstruction Codes), SODP (Single Overlap declustered parity), and MLEC (Multi-Level Erasure Coding). These papers offered a fundamental comprehension of each policy, their respective advantages and drawbacks, and their practical usage in production environments.

Simulator Reproduction

After gaining some understanding with the papers I read, I started to study the EC simulator by building the simulator myself. I got the MLEC simulator from the mentors. However, the simulator lacks documentation and guides, making it hard for others to reproduce evaluation results. The simulator is also complicated to understand, as it simulates various EC schemes, chunk placements, and rebuild policies, which results in 13,000 LOC. Therefore, my goal is to understand the design and implementation details of the simulator, after which I will create guides for reproducible evaluations.

In order to fully understand the simulator, the best way is to rebuild the simulator by myself. The simulator is designed to mimic disk failures over the span of a year under varying chunk placement policies. Once successfully rebuilt, the simulator will enable me to assess the durability of MLEC in relation to other widely-used chunk placement policies. I followed the given simulator and rewrote it on my own in Python.

Based on the skeleton of the given simulator, I first rebuilt a simple simulator that simulates SLEC (single level erasure coding, in both local and network settings) with clustered parities. With the arguments given, the simulator can run arbitrary numbers of iterations that simulate disk failures in one year. The simulator then collects iterations in which there is a data loss. The ratio of failed iterations to total executed iterations is the durability of the erasure coding policy. This simulation allows us to evaluate the durability of SLEC, laying foundations for later evaluation of MLEC.

Next, I extended my simulator from local-level SLEC implementation by adding more policies. I began by introducing a network-level SLEC policy with clustered parities. This differs slightly from the local-level EC as it necessitates the consideration of factors like network bandwidth within the simulator.

In addition, I have delved deeper into simulating declustered parities and successfully discovered a method to simulate disk failures. Basically, the simulator generates failures within a one-year timeframe and subsequently repairs them using priority queues. The disks associated with stripes experiencing the most failures are given the highest repair priority. With this construction, the simulator is capable of simulating local-level declustered parities, with the ability to specify parameters.

Upon successfully simulating local-level declustered parities, the construction of the simulator for network level declustered parities was rather straightforward. I then validated it using the simulator and math models provided by the mentors. The results perfectly agree with each other, which proves the correctness of my understanding for the SLEC declustered placements. By implementing the simulator myself, I strengthened my understanding of erasure coding designs and the simulation techniques, which equipped me with a solid foundation to continue to reproduce MLEC simulations.

Based on my knowledge gained from implementing SLEC simulators myself, I then reverse-engineered the MLEC simulator provided by the mentors from their MLEC paper. I choose to start from the simplest policy, which is clustered parities in both levels. After spending a considerable time digging into the simulator source codes, I was able to understand the simulation workflows, different repair methods that it implements, and the splitting method that it uses to simulate high durabilities. I then revised my simulator based on my understanding. I also tried to run a few experiments using the same configuration setups as specified in the paper. The results agree well with those in the paper, which verified the success of my reproducing work.

Technical Issues

In the process of building the MLEC, I’ve encountered many issues, conceptual or technical. The mentors are super helpful and responsive in the process, so I was able to have steady progress.

Summary

Overall, I’ve rebuilt a python simulator for various EC policies, and the simulator can successfully reproduce the results from paper.

Next Steps

My next step would be to package the simulator into reprodTrovi artifact, so others can reproduce evaluations on performance and durability of various EC policies, in particular MLEC

Mid Term Blog : Using Reproducibility in Machine Learning Education: Reproducibility with Incomplete Methodology Descriptions

Fri, 04 Aug 2023 00:00:00 +0000

Hey,

I am Shekhar and I am one of several students who are working on developing materials for reproducibility in machine learning education, under the mentorship of Fraida Fund. My Proposal aims to develop interactive educational materials about reproducibility in machine learning, for use in graduate and undergraduate classes. Our goal is to help students and researchers (1) understand some of the challenges they may face when trying to reproduce someone else’s published result, and (2) in their own publications, to specify the methodology so that the result will be more easily reproduced by others.

Motivation

My work is inspired by my participation in the 2022 Machine Learning Reproducibility Challenge, where I was reproducing a result related to bias in hate speech classifiers. The paper seemed at first to have complete methodology details. However, when I tried to implement their approach based on the description of the paper, I realized some important details were missing - for example, in the part where they replaced swear words in the text with other words having similar meaning. I wasn’t able to identify the exact list of swear words they used, or what approach they followed if the selected replacement was also a swear word. The choices I made when the authors’ approach was left ambiguous had a significant impact on the magnitude of the final result.

Milestones and Accomplishments

To inform researchers and students about this problem, I created a fictitious machine learning research paper, and a sequence of accompanying Python notebooks to highlight various choices that can be made to fill in the gaps, and explore how these choices can impact the overall results of the research. Our “research paper” is about the impact of data augmentation on few-shot learning for intent classification. We implemented a basic data augmentation strategy with synonym replacement using the HWU64 dataset and a BERT classifier, and the results suggest that synonym replacement as a data augmentation technique leads to only minor improvement in accuracy. In the fictitious paper, we left some of the methodology details ambiguous. When reproducing the results using the accompanying notebooks, the reader follows a “Choose Your Own Adventure” format, selecting a path through a tree, where each node represents ambiguous methodology details and branches out to different choices that are made at that instance. The leaf nodes will represent the final results, providing insights into the magnitude of the differences resulting from each node selection. Some of the choices that the reader makes are -

what subset of the source dataset to use.
some of the details of data pre-processing.
some of the details of the synonym replacement data augmentation strategy.
some training hyperparameters and the details of the hyperparameter search.

During the first phase of our project, we have implemented an initial draft of these notebooks, to explore various scenarios and see their impact on results. Next, we will further develop the interactive educational material around them.

Challenges

During the first half of the project, I faced two main challenges. First, I had to come up with a hypothetical research scenario that was realistic, yet easy for students without much expertise to understand. Attaining the right balance was essential to make it engaging and educational. The second challenge was to deliberately leave some details unclear in a realistic way while ensuring that the choices based on that ambiguity had a significant impact on the results. Fortunately, I had the guidance and support of my mentor, which allowed me to successfully tackle these challenges.

Throughout this project, I faced various challenges and obstacles, but it turned out to be an incredible learning experience. I had the opportunity to dive deep into the domains of few-shot learning and meta-learning, which were entirely new to me. Moreover, I was able to find ambiguous methodologies present in academic papers and explore diverse scenarios related to them. Looking ahead, I am eager to continue working on this project throughout the summer, as it promises further learning and personal growth.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

Thu, 03 Aug 2023 00:00:00 +0000

Introduction

As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time, our goal was to characterize the tools on genomic workflows in terms of system metrics and data quality to build machine learning models to predict the elapsed time of genomic workflows. While Shayantan (another contributor) did the analysis on data quality metrics, I contributed to the system metrics analysis. We are getting closer to that goal because we have managed to collect datasets and do some analysis.

Steps

In this project, we selected DNA-Seq Pipeline as the workflow to be analyzed. This pipeline consists of four tools for processing single-end reads, namely BWA-mem, Samtool-view, Picard-SortSam, Picard-MarkDuplicates. For each tool we executed it using various configurations and stored system metrics for each execution. To do this, we have to take two steps:

Step 1: Building the tools execution environment.
Step 2: Developing a program to execute tools using some configurations and collect runtime parameters (eg. CPU, RSS, VSZ, and IO) automatically.

Execution Environment

Tools are executed on Chameleon instances by submitting them using Slurm. The machine used in collecting system metrics is a Haswell instance of the Chameleon Texas server. This instance uses Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz with following detailed specifications.

Number of CPUs	48
Number of threads per core	2
Number of cores per socket	12
Number of sockets	2

In this experiment, we use n+1 instances, where there are n compute nodes and 1 master node. Each execution is done by submitting a job, which is a tool with a certain configuration, from a master node and it will be processed by one of the compute nodes. In order for the tool to be executed, we need to set the master node to be a common container using NFS. This common container is used to store input files and commands for executing tools so that all nodes can access them without having to download and install them.

Executing and Collecting System Metrics

Tools will be executed in various specific configurations by varying parameters such as input size, number of CPU allocation, memory allocation and threads. For example, for BWA-mem respectively the number of variations in values for the number of CPU allocations, memory allocations, and threads is 5, 4, and 5 using 10 different files so that there are 5 x 4 x 5 x 10 = 1000 configuration combinations. For each configuration will be executed 8 times so that there are 8000 data points. Configuration details can be seen in the following table.

	#repetions	#files	#allocated CPU	#allocated memory	#threads	total
BWA-mem	8	10	2, 4, 8, 16, 32	8, 16, 32, 64	2, 4, 8, 16, 32	8000
Samtool-view	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000
Picard-Sortsam	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000
Picard-MarkDuplicates	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000

Meanwhile, to run the tools, we use the following commands:

BWA-mem

$BWA mem -t $threads $REF_DIR/hg19.fa ${INPUT_DIR}/${sra_id}*.fastq > ${OUTPUT_DIR}/${sra_id}.sam

Samtool-view

$SAMTOOLS view $INPUT_DIR/${sra_id}.sam -Shb -o $OUTPUT_DIR/${sra_id}.bam

Picard-SortSam

java -jar $PICARD SortSam \
CREATE_INDEX=true \
INPUT=$INPUT_DIR/${sra_id}.bam \
OUTPUT=$OUTPUT_DIR/${sra_id}.bam \
SORT_ORDER=coordinate \
VALIDATION_STRINGENCY=STRICT

Picard-MarkDuplicates

java -jar $PICARD MarkDuplicates \
CREATE_INDEX=true \
INPUT=$INPUT_DIR/${sra_id}.bam \
OUTPUT=$OUTPUT_DIR/${sra_id}.bam \
METRICS_FILE=$OUTPUT_DIR/${sra_id}_rmd.txt \
VALIDATION_STRINGENCY=STRICT

In Slurm, each job has a job id. In addition, there is a scontrol listpids command to see the job id to PID mapping. Using this, we can obtain system metrics for a job by gathering information from the /proc/$PID system file. Information that can be obtained from it is the use of CPU, physical memory, virtual memory, read bytes, and write bytes at a particular time. So that in collecting this data, we will record these features along with the timestamp at 1 second intervals throughout the execution process.

Results

We also have calculated the correlation for each feature with the elapsed time. For BWA-mem, the features that correlate more than absolute of 0.5 are Input size, Average CPU Usage, and Output file size , which is in SAM format. For samtools there are input size, average cpu usage and output size in BAM. For Sortsam, there are input size, write operation, and BAM output size. For MarkDuplicates, there are input size and BAM output size.

Features\Tools	BWA-mem	Samtool-view	Picard-SortSam	Picard-MarkDuplicates
Allocated CPU	-0.145	-0.095	-0.179	-0.156
Allocated physical memory	-0.010	-0.038	-0.069	0.132
Input size	0.583	0.651	0.937	0.922
Threads	-0.072	-	-	-
Average CPU	-0.607	-0.567	-0.479	-0.480
Peak CPU	-0.175	0.174	-0.170	0.046
Average RSS	0.040	0.034	0.131	0.182
Peak RSS	0.068	0.046	0.314	0.175
Average VSZ	0.032	-0.349	-0.127	0.090
Peak VSZ	0.048	0.074	-0.130	0.088
Write bytes	0.037	0.190	0.735	0.244
Read bytes	-0.031	0.109	0.070	0.110
Output SAM size	0.589	-	-	-
Output BAM size	-	0.763	0.934	0.923
Output BAI size	-	-	0.400	0.399

Future Works

For further work, we will analyze the correlation between elapsed time and features whose scores are below an absolute 0.5. Because there is a possibility that these features are actually correlated with the elapsed time but do not appear to be correlated because the measurements are made by calculating the overall data. So we also need to calculate the feature correlation for each data grouped by input file. Then, we create a machine learning model to predict elapsed time.

[FLASHNET]: Leveraging ML-augmented I/O in Linux

Wed, 02 Aug 2023 00:00:00 +0000

Hello everyone,

This is my second blog post for SoR 2023. As you may recall from my initial blogpost, I am working on the Flashnet project under the mentorship of Haryadi S. Gunawi.

I’ve been assigned two major tasks under Flashnet:

Perform post-training quantization (PTQ) on existing Flashnet models
Implement a rocksDB client (to interface with the Flashnet kernel) with 3-way replication

Task 1: Perform post-training quantization (PTQ) on existing Flashnet models

Since all of our models are currently built using the keras API, I decided to use the tensorflow-lite library, which supports direct conversion. Unfortunately, I encountered several persistent bugs while attempting to apply full-integer quantization on our binary neural network model:

Shape/dimension distortion:

Bug description: The quantized tflite model produces outputs of shape (8, 1) –– same as input shape–– when the original model produces single-value outputs (1, 1).

Status: Resolved

The original model has an input dimension of 8 for each input/x-value and there could be several inputs grouped in a single batch.
Input/batch size is also determined implicitly in the normalization layer of the original model
However, the “interpreter” in the quantized model runs inference one by one, and so batch size needs to be explicitly set to “1” i.e. a shape of single input, (1,8)
Doing so resolves the model distortion

Incorrect y-value range:

Bug description: There are no variation in the quantized model outputs (i.e. it spits out the same value for each input row)

In the original model, each inference output is a floating point value between 0 and 1. Outputs also vary according to input. This output is rounded towards 0 or 1 using a 0.5 standard cutoff (i.e. x > 0.5 → x = 1). Since the quantized model condenses 32-bit floats into 8-bit integers, we should expect a similar variation in output values across an 8-bit integer range.

Printing the quantized model weights, I discovered that weight burst/exploding gradient may be occur during quantization process i.e. the values of weights are exploding to infinity or vanishing to 0, and therefore unable to deliver any meaningful value. The likely consequence of this is that the inference output always equals the bias matrix (since the Wx term in y = Wx + B gets zeroed out).

Status: Open

Multiple potential causes were considered, without any success:
- Improper quantization of inputs/outputs
- Insufficient training time/number of epochs
- Incompatible model type/structure
- Incompatible tensorflow-lite version
At this point, I concluded that tensorflow-lite is too bug-ridden to make making any further attempts with the library not worthwhile.

Task 2: Implement a rocksDB client (to interface with the Flashnet kernel) with 3-way replication

rocksdb is an embedded database for key-value data. Our Flashnet team is currently implementing a Flashnet client in ceph, and so they have tasked me to explore an implementation in rocksdb as an alternative.

I’ve started on this segment of the project only recently, so my current work is still in its formative stages. As of writing, I’ve been primarily concerned with setup of software (on a new chameleon instance), running toy db examples, and educating myself on basic terminology/rocksdb documentation.

Future work

I expect to continue working on Task 1 (do quantization from ground-up or use a different library) and Task 2 as detailed above. I also hope to implement a transformer-based model to supplement our existing suite of Flashnet models.

[Midterm] FlashNet: Towards Reproducible Continual Learning for Storage System

Wed, 02 Aug 2023 00:00:00 +0000

Mid-Term Report

As part of the FlashNet my proposal under the mentorship of Haryadi S. Gunawi and Daniar Kurniawan aims to implement and optimize the FlashNet model in real-world storage systems using continual learning techniques. We focus on predicting I/Os latency to decide whether or not the I/O should be failovered to other SSD. The following sections elaborates the work description, major milestones achieved, accomplishments, and challenges during the first half of summer.

Work Description, Major Milestones Achieved, and Accomplishments

For the first half of the summer, I implemented continual learning pipeline of the model and several drift detection algorithms. After that, I evaluated the effectiveness. Below are the detailed description for each subtask.

1. Continual Learning pipeline

Firstly, I designed the pipeline. As shown on the graph below, the pipeline contains 4 main modules, namely initial train, retrain, inference, and monitor.

The modules were first developed in Python using linear regression model. Turns out, linear regression model is not good enough that it gave bad accuracy. To overcome this problem, I introduced more models and learning task.

Hence, in the final implementation, we have random forest and neural networks model for both regression and classification task. Aforementioned models outperforms linear regression. The pipeline is also already optimized.

2. Drift detection algorithms

Sometimes, the built model’s performance may degrade when facing recent I/Os having different characteristics than what it was trained upon. Hence, there should be a retrain process. Retrain should be triggered. The trigger could be as simple as periodically, or using technique called drift detection. While retraining too often might cause big overhead for computation, retraining too seldom might also cause performance degradation. Hence, we should build a good and reliable drift detection algorithm that can sense the presence of concept and covariate drift in recent data.

In order to build a good algorithm, I used heuristics derivated from the understanding about latency and throughput change over time. However, the result turns out not really good. Thus, I’ve been relying on using statistical tests as the drift detector. By far, Kalmogorov-Smirnov Test–commonly known as ks-test–is the best drift detector.

3. Evaluation

The featured image in the headline of this blog, also shown below, is the result of the evaluation. I evaluated the models and drift detection algorithms using Cumulative Distribution Function (CDF) graph, to see if any tail cut is made.

Challenges

During the implementation, I encountered several challenges as follows,

1. Choice of Model

Since we want to integrate the pipeline to real storage systems, we had to be mindful of model choice. Machine learning based models are lighter than deep learning based models. However, deep learning based models offer higher accuracy, thus more preferable. Hence, I implemented both and examine the effectivity of the models.

2. Choice of Drift Detection Algorithm

Continual learning technique is chosen for this task may require the model to be retrained since the workload may change over time. However, the implication is we need to have a condition that triggers the retraining to be done. As training model is costly, we need to retrain it mindfully. Thus, we use drift detection algorithm to detect whether or not retraining is needed.

There are two types of drift detection algorithms, namely statistical based test and model based drift detection. For minimizing overhead reason, we pick statistical tests. There exists various algorithms of choice. I picked 5 of them to be implemented and evaluated.

Plan

For the second half of the summer, I am going to study Riak and create Chameleon Trovi artifact for deploying Riak in a cluster.

Introducing Levels of Reproduction and Replication in Machine Learning

Wed, 02 Aug 2023 00:00:00 +0000

Hello again,

I am Mohamed Saeed and this is my second blog post for the 2023 Summer of Reproducibility Fellowship. As you may recall from my previous post, I am working on the Using Reproducibility in Machine Learning Education project with Fraida Fund as my mentor. My goal is to create interactive open educational resources that teach reproducibility and reproducible research in machine learning (ML) as I proposed.

In this post, I will share with you some of the progress I have made so far, as well as some of the challenges I have faced and how I overcame them. I will also highlight some of the specific accomplishments that I am proud of and what I plan to do next.

Reproducing “On Warm Starting Neural Network Training”

This material is a reproduction of the paper “On Warm Starting Neural Network Training” by Jordan T. Ash and Ryan P. Adams (2020). This paper investigates the effect of warm-starting neural networks, which means using the weights of previous models trained on a subset of the data, to train on a new dataset that has more data.

The figure illustrates how the new model uses the weights from the previous model as its initial values. This allows the new model to train on both the “Original” data, which it has already seen, and the new data, which it has not encountered before. In contrast, the randomly initialized model treats the entire data as unfamiliar and starts from scratch.

The paper also shows that this method can lead to lower test accuracy than starting from scratch with random weights, even though the training loss is similar. The paper also proposes a simple way to improve the test accuracy of warm-starting by adding some noise to the previous weights.

To reproduce this paper, I followed a systematic approach that ensured reliable results. This approach involved:

Reading the paper and its main claims carefully.
Finding out what resources the authors shared, such as code, data, and models.
Looking for additional materials online that could help me save time and fill in the gaps left by the authors.
Setting up the environment and dependencies needed to run the code smoothly.
Writing code and updating any outdated functions that might cause errors.
Running the code and verifying that it matched the results reported in the paper.
Analyzing and interpreting the results and comparing them with the paper’s findings.

I used Chameleon as my platform for running and documenting my reproduction experiments. Chameleon is a large-scale, reconfigurable experimental platform that supports computer science systems research. It allows users to create and share Jupyter notebooks that can run Python code on Chameleon’s cloud servers.

I created a GitHub repository where you can find all related to my reproduction work in the form of interactive jupyter notebooks that will help you learn more about machine learning and reproducibility of machine learning research.

Challenges

Reproducing a paper is not an easy task. I faced several challenges along the way. One of the biggest challenges was the lack of code and pretrained models from the authors. This is a common problem for many reproducibility projects. Fortunately, I found a previous reproducibility publication for this paper on ReScience journal. I used some of their code and added some new functions and modifications to match the original paper’s descriptions. I also encountered other challenges that I discussed in the notebooks with the solutions that I applied.

How to use this material?

This material is a series of notebooks that walk you through the paper and its claims, experiments, and results. You will learn how to analyze, explain, and validate the authors’ claims. To get started, I suggest you skim the original paper briefly to get the main idea and the public information. This will help you understand how the authors could have been more clear and transparent in some sections. I have given clear instructions and explanations in the notebooks, as well as how I dealt with the missing components. You can use this material for self-learning or as an assignment by hiding the final explanation notebook.

Conclusion and Future Work

In this blog post, I have shared with you some of my work on reproducing warm starting neural network training. I have learned a lot from this experience and gained a deeper understanding of reproducibility and reproducible research principles in ML.

I am very happy with what I have achieved so far, but I still have more work to do. I am working on reproducing the Vision Transformer: An Image is Worth 16x16 Words paper by Alexey Dosovitskiy et al. This time my approach is to use the available pretrained models provided by the authors to verify the claims made in the paper. However, there are some challenges that I face in reproducing the paper. For example, some of the datasets and code that the authors used are not publicly available, which makes it hard to replicate their experiments exactly. These challenges are common in reproducing research papers, especially in computer vision. Therefore, it is important to learn how to deal with them and find ways to validate some of the claims.

I hope you enjoyed reading this blog post and found it informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for your attention and stay tuned for more updates!

Midterm Blog Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Wed, 02 Aug 2023 00:00:00 +0000

As part of the Measuring Research Prototypes under Unreported Settings my proposal under the mentorship of Yang Wang and Miao YU aims to understand the impact of missing settings in artifact evaluation.

Based on our project proposal, the first step is to test the benchmark application on targeted systems. We pick open-source database system PostgreSQL as the target system. We test the TPC-C benchmark on PostgreSQL under default settings. We measure the throughput performanace for the benchmark by setting scalefactor as 10 and incrementing worker terminals. The settings for database server are all default values. We will take these results as baseline. In order to test on more parameters and system settings, we need to choose an association of parameters to get optimal throughput.

We use an online tool PGTune, which aims to tune PostgreSQL config by the hardware. We select shared_buffer, min/max_wal_size and effective_cache_size as first set of parameters to measure. They are related to memory consumption, checkpoints and planner cost in the database server. Based on PostgreSQL official documentation, shared_buffer sets the amount of memory the database server uses for shared memory buffers. Max_wal_size sets the maximum size to let the WAL grow during automatic checkpoints. Larger settings for shared_buffers usually require a corresponding increase in max_wal_size, in order to spread out the process of writing large quantities of new or changed data over a longer period of time. Effective_cache_size sets the planner’s assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used.

We conduct the experiments by setting the parameters with increments and compare the throughput performance with each other and the baseline. Based on the results, the throughput of the benchmark with larger shared_buffer and max_wal_size is up to 1.5X of the performance under default settings. The improvement by tuning max_wal_size is larger than that of tuning shared_buffer. The increased effective_cache_size does not have effect for this benchmark workload compared to its default value of the system.

There are more values of above mentioned parameters to test. Next, I will test those parameters with increments of the values. Furthemore, we need to choose an association of more parameters to get optimal throughput. Also, the tuning tool may not generate optimal values for very high memory systems based on its description. This requires we test more possible parameters and their values for better performance.

ScaleBugs: Reproducible Scalability Bugs

Wed, 02 Aug 2023 00:00:00 +0000

Introduction

As part of the Scalebugs Project, we have worked on building a dataset of reproducible scalability bugs. To achieve this, we go through existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. Workloads are designed to reproduce these scalability bugs by triggering some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance.

So far we have worked on packaging the buggy and fixed versions of scalability systems, a runtime environment that ensures reproducibility, and the workloads used to trigger the symptoms of the bug inside docker containers. By packaging these versions together, we are simplifying the process of deployment and testing. This enables us to switch between different versions efficiently, aiding in the identification and comparison of the bug’s behavior. For each scalability system, we have carefully built a runtime environment that is consistent and reproducible. This approach ensures that each time we run tests or investigations, the conditions remain identical.

New Terms

In order to make sense of the various bug reports, we had to learn some terminologies associated with scalability systems:

Clusters: Clusters are groups of related or connected items, often found in various fields such as computer science, data analysis, or even social sciences. For example, in data analysis, clusters might represent groups of data points with similar characteristics, making it easier to understand patterns or trends in the data.

Cluster Membership: Cluster membership refers to the process of determining which items or entities belong to a particular cluster. This task can be done based on various criteria, such as similarity in attributes, spatial proximity, or shared characteristics.

Locks: In computer programming, locks are mechanisms used to manage access to shared resources, such as files, data structures, or hardware devices. When multiple processes or threads need to access a shared resource simultaneously, locks ensure that only one process or thread can access it at a time, preventing data corruption or conflicts.

Lock Contentions: Lock contention occurs when multiple processes or threads attempt to acquire the same lock simultaneously. When this happens, one process or thread must wait until the lock becomes available, leading to potential delays and reduced performance.

Critical Paths: In project management or process analysis, a critical path is the longest chain of dependent tasks that determines the overall duration of the project or process. Any delay in tasks along the critical path will directly impact the project’s completion time.

Tokens: Tokens can have various meanings depending on the context. In computer programming, tokens are the smallest units of source code recognized by a compiler or interpreter. In cryptography, tokens can represent digital certificates or authentication data used for secure communication.

Nodes: In the context of network theory or graph theory, nodes are individual points or entities that form a network or graph. In a computer network, nodes can be devices like computers or routers, and in a social network, nodes can represent individuals or entities.

Peers: Peers are entities within a network that have the same status or capabilities. In peer-to-peer networks, each node can act as both a client and a server, enabling direct communication between nodes without relying on a central server.

Gossipers, Gossip Protocol: In distributed systems, gossipers are nodes that share information with each other using the gossip protocol. The gossip protocol involves randomly selecting peers and exchanging information in a decentralized manner, allowing information to spread quickly across the network.

Threads: Threads are the smallest units of execution within a process in computer programming. Multiple threads can run concurrently within a single process, enabling multitasking and parallel processing. Threads can share the same resources within the process, making them more lightweight than separate processes. However, proper synchronization is essential to prevent data corruption or conflicts when multiple threads access shared resources.

Flush and Writes Contention: This refers to a situation where simultaneous operations involving data flushing (saving data to a storage medium) and data writing (updating or adding data) are causing conflicts or delays. This contention can arise when multiple processes or threads attempt to perform these operations concurrently, leading to performance bottlenecks or potential data integrity issues.

Accomplishments

We have been able to build docker containers for the following scalability bugs:

IGNITE 12087

This bug stems from the resolution of the IGNITE-5227 issue (another bug), which has led to a significant decline in the performance of a particular operation. Prior to addressing IGNITE-5227, the insertion of 30,000 entries displayed remarkable efficiency, completing in roughly 1 second. However, post the resolution, executing the same insertion process for 30,000 entries witnessed a considerable slowdown, taking approximately 130 seconds – a performance degradation of nearly 100 times.

CASSANDRA 14660

This bug is related to how clusters work together and how a lock is causing conflicts with the critical path. The issue arises from a method call that uses O(Peers * Tokens) resources while contending for a lock, which is causing problems in the write path. The lock is used to protect cached tokens that are essential for determining the correct replicas. The lock is implemented as a synchronized block in the TokenMetadata class.

How was this fixed?

It was fixed by reducing the complexity of the operation to O(Peers) taking advantage of some properties of the token list and the data structure.

CASSANDRA 12281

This bug is also related to how clusters work together and a lock conflict. The issue arises when a specific method is trying to access a lot of resources (O(Tokens^2)) while contending for a read lock. As reported, a cluster with around 300 nodes has around 300 * 256 (assuming the default number of tokens) tokens, thus joining a new member reportedly is taking more than 30 mins. This happens because due to the long execution time here, this lock makes every gossip message delayed, so the node never becomes active.

How was this fixed?

The granularity of the lock is decreased, meaning that the expensive function calls now do not take the problematic read lock and simply use a synchronized block, synchronizing on a specific field, that does the job much better.

HA16850

This is a bug related to obtaining thread information in the JvmMetrics package. When obtaining thread information, the original buggy version used MXBeans to obtain thread information. The call uses an underlying native implementation that holds a lock on threads, preventing thread termination or creation. This means that the more threads that we have to obtain information for, the longer the function call will hold a lock. The result is that the execution time scales on the number of active threads O(threads).

How was this fixed?

Developers utilized a ThreadGroup to keep track of obtaining metrics for threads. The result is that there is no lock held for every thread.

CA13923

This issue revolves around conflicts between the “flush” and “writes” processes. The main problem is that during the “flush” process, a resource-intensive function called “getAddressRanges” is invoked. This function has a high computational cost and its complexity is O(Tokens^2). In other words, the time it takes to complete this function grows quickly as the number of “tokens” increases. This situation is causing challenges and delays in the overall process.

How was this fixed?

This function call affected many paths and they made sure no one calls getAddressRanges in critical paths.

Challenges

Demanding Memory Requirements: Running certain builds consumes a significant amount of memory. This places a strain on system resources and can impact the overall performance and stability of the process.

Little Issues Impacting Execution: Often, seemingly minor details can obstruct the successful execution of a build. Resolving such issues requires thorough investigation and extensive research into similar problems faced by others in the past.

Complexities of Scalability Bugs: Identifying the underlying causes of scalability-related bugs is intricate. These bugs exhibit unique characteristics that can complicate the process of pinpointing and comprehending their root origins.

What is Docker? ( For those who don’t know about it )

Docker is a platform that facilitates the containerization of applications, leading to consistent and efficient deployment across diverse environments. Its benefits include portability, resource efficiency, isolation, and rapid development cycles. DockerHub complements Docker by providing a centralized hub for sharing and accessing container images, fostering collaboration and ease of use within the Docker ecosystem.

More about docker https://docs.docker.com/get-started/overview/

Mid-term blog post for Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Tue, 01 Aug 2023 00:00:00 +0000

Hello!

I am Srishti Jaiswal and this is my second blog post for the 2023 Summer of Reproducibility Fellowship.

Introduction

As I reach the halfway mark of my internship journey, I have had the incredible opportunity to work on a project that revolves around reproducing an adaptive video research result using cloud-based experimentation. This blog post delves into my exciting work so far, the significant milestones achieved, specific accomplishments to celebrate, and the challenges overcome. Utilizing CloudLab and FABRIC, I embarked on a journey to reproduce essential figures from the research paper Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming, ensure Python2 and Python3 compatibility and incorporate an Estimated Download Rate column in the log file produced by the video client. Let’s explore the details of this captivating internship experience.

Major Milestones Reached

Here are the milestones we have reached so far:

Familiar with CloudLab and Fabric Testbeds: I learned how to run an adaptive video experiment, which is the jumping-off point for my project, on the CloudLab and FABRIC platforms.
Python2 and Python3 Compatibility: My first task was to port an existing open-source code base developed for Python2 (which is no longer supported) so that it can run in Python3. Now code is running successfully in both versions for all the policies of the existing open source, i.e. Basic, Netflix and Sara. Fixed issue#1 .
Estimated Download Rate for Basic Policy: To make it easier for users to understand and visualize how the adaptive video policy works, I added an additional metric, “Estimated Download Rate”, to the output file produced by the adaptive video client.
Graphing Buffer Occupancy and Estimated Download Rate: I extended the existing experiment to show two additional visualizations that are important for understanding how the adaptive video client works: buffer occupancy vs time and estimated download rate vs time.

Overcoming Challenges

I encountered several challenges throughout this project, especially as it was my first time working independently on a research paper as a third-year engineering student. However, with my mentor’s guidance and support, I persevered and learned to tackle each obstacle with determination.

One significant challenge was porting the entire code from Python2 to Python3. This transition resulted in numerous errors, and I often found it challenging to pinpoint where the mistakes occurred. To overcome this, I adopted a step-by-step approach, fixing errors one by one and verifying them using Python2 for comparison.

Understanding the complex codebase was another hurdle that led to moments of feeling stuck in an infinite loop. But every time I faced such situations, I sought my mentor’s advice, and together, we made strategic changes to overcome these challenges.

I am immensely grateful for my mentor’s expertise and support throughout this internship. Her guidance played a crucial role in helping me navigate through the challenges and grow both professionally and personally. I eagerly look forward to the rest of the journey, knowing that I can continue making meaningful contributions to this research project with her inspiring mentorship.

Future Prospects

As the second half of my internship approaches, I am eager to refine further and expand our experimentation. Our main aim is to reproduce the existing work and provide a clear guide for other students to do the same for this, I have to create a framework that helps them improve and build upon this work.

I hope you enjoyed reading this blog post.If you have any questions or feedback, please feel free to contact me. Thank you for your attention and stay tuned for more updates!

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

Tue, 01 Aug 2023 00:00:00 +0000

We are currently midway into the OSRE 2023 program and the following post lists the progress that I have made on the project so far. As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time our overall goal was to enumerate the effect of sequence data quality on execution times. Towards that end, we decided to first identify suitable datasets from the two commmonly available -omics data modalities - transcriptomics and genomics. Albrecht et al. [1] developed seqQscorer to automate the quality control step of NGS data analysis through predictive modeling. They have also published the list of ENCODE datasets used for training the models. Quality label has been assigned as 0 for released files or 1 for revoked files. Based on the guidelines set forth by ENCODE’s Data Coordination Centre (DCC) a comprehensive manual annotation of the data was done by scientists and the resulting quality variable “status” was published to serve as an indication of the quality of the data. The following steps outline the process of generating the data table for building the machine learning models.

Step 1: Programmatically accessed 86 (34 released ; 34 revoked) RNA-seq files from ENCODE database. All the fastq files were single ended.
Step 2: Programmatically accessed 288 (144 released ; 144 revoked) DNA-seq files from ENCODE database. All the fastq files were paired ended.
Step 3: Implemeted the STAR aligner for RNA-seq and the BWA aligner for DNA seq. The resulting outputs contained the alignment times for both the “revoked” and “released”.
Step 4: Ran statistical tests to determine whether there is any significant differences in the runtimes of both types of files.

Currently I am running the FASTQC tool to extract data quality metrics for the same set of files as discsussed above. Once I have collected those metrics, I can start building regression models to determine whether there is any significant impact of data quality on execution time. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Through our analysis we aim to develop a reproducible ML model that will give the user an estimate of the runtime based on the raw FATSQ file as input.

References

[1] Albrecht, S., Sprang, M., Andrade-Navarro, M.A. et al. seqQscorer: automated quality control of next-generation sequencing data using machine learning. Genome Biol 22, 75 (2021). https://doi.org/10.1186/s13059-021-02294-2

[Mid-term] Capturing provenance into Data Science/Machine Learning workflows

Mon, 31 Jul 2023 00:00:00 +0000

This post describes our midterm work status and some achievements we have done so far in the project for the noWorkflow package.

The initial weeks

I started doing a bibliographical review on reproducibility in the Data Science (DS) and Machine Learning (ML) realms. It was a new subject to me, and I aimed to build a more robust theoretical background in the field. Meanwhile, I took notes in this series of posts.

Then, as planned, I integrated with the current noWorkflow supporters in order get a broader view of the project and their contributions. Additionally, Juliana Freire, João Felipe Pimentel, and I set up a weekly one-hour schedule to keep track of my activities.

Brainstormed opportunities

At the beginning of June, we also met with other project supporters to brainstorm about our initial proposal. From this meeting, we came up with a plan on how technically approach a noWorkflow new feature in Data Science and Machine Learning experimental management.

In this brainstorm, we aligned that Jupyter Notebooks are, by far, the most frequent set up in DS/ML computational experiments. They established themselves as the fundamental artifact by embedding code, text and enabling execution and visualization. Entire experiments are created and kept in Jupyter notebooks until they are sent to production. And the opportunity at hand is to integrate noWorkflow with Jupyter Notebooks. Then, our mid-term goal was adapted from the original plan of only selecting and executing a prototypical ML experiment. We added the goal of paving the way for providing a tagging feature for Notebook cells.

More specifically, DS/ML experimental workflows usually have well-defined stages composed of data reading, feature engineering, model scoring, and metrics evaluation. In our dream space, the user would tag a cell in their experiment, enabling the capture of the tagged metadata into a database. This step integrates the ultimate goal of facilitating comparisons, management, and even causal inference across different trials of a DS/ML experiment.

Current deliverables

So, based on our plans, we create a separate table to store the metadata from cell tagging. This table stores the cell hash codes and information to match the code executed within a cell. As a result, we can store tags and the activation ids of the cells enabling us to identify a cell containing a given stage in a DS/ML experiment.

The second feature implemented was tagging a specific variable. In the same way for a cell, now it is possible to stamp a given variable with a tag, keeping its name, id, and received value in this separated table.

Finally, we worked on displaying the dependencies of a given variable. In this case, by tagging a given variable, we can display the other variables, values, and cells activated in its construction. Then, we can visualize the dependencies that contributed to its final value.

For an overview of current developments, please refer to my fork of the main project.

Challenges

During this period, we had to make choices along the way. For instance, capturing the provenance of cells through tags is a different solution than tagging code chunks in scripts. In this case, we decided to stick with tagging Notebook cells at this moment. We also opted to start storing the metadata to enable comparisons between trials rather than focus on a sophisticated graphic and user-friendly cell tagging system. We also opted to keep this metadata info stored in a separate table in the database.

Next steps

In the second half of the summer, our goal is to integrate these features in order to proceed with comparisons among experiments. Such comparisons would use the tagged variables as the hyperparameters of DS/ML experiments or key variables to assess the experiments, such as errors or scores. As a result, we will be able to compare the results of two trials in a more accurate, and easily reproducible experiment.

Improving Video Applications' Accuracy by Enabling The Use of Concierge

Mon, 31 Jul 2023 00:00:00 +0000

Introduction

Hello, it’s me again, Faishal, a SoR project contributor for the edgebench project. For the past these two months, my mentors and I have been working on improving the performance of our system. In this report, I would like to share with you what we have been working on.

Motivation

Edgebench is a project that focuses on how to efficiently distribute resource (bandwidth and cpu usage) across several video applications. Nowaday’s video applications process its data or video on a server or known as edge computing, hence bandwidth or compute unit may be the greatest concern if we talk about edge computing in terms of WAN, because it is strictly limited.

Consider the following case, suppose we have 3 video applications running that is located in several areas across a city. Suppose the total bandwidth allocated to those 3 video applications is also fixed. Naively, we may divide the bandwidth evenly to every camera in the system. We may have the following graph of the allocated bandwidth overtime.

They are fixed and won’t change. However, every video application has its own characteristic to deliver such a good result or f1-score. It is our task to maintain high average f1-score. Therefore we need to implement a new solution which is accuracy-oriented. The accuracy-gradient[1] comes into this.

System Design

On our current design, we need a resource allocator, namely concierge. This concierge determines how much bandwidth is needed for every video application (vap) in the system. Concierge will do the allocation at a certain time interval that has been determined before. This process is called profiling, on this process, the concierge will first ask every vap to calculate their f1-score at a certain video segment when the bandwidth is added by profile_delta. Then the difference of this f1-score is substracted by the default f1-score, namely f1_diff_high. After that, the concierge will ask to reduce its bandwidth by profile_delta and do the same process as before, this result will be named f1_diff_low. Those two results will be sent to the concierge for the next step. On the concierge, there will be sensitivity calculation, where sensitivity is

This equation tells us which video application will give us the best f1-score improvement if we add more bandwidth to one vap while reducing other’s bandwidth. From this, we will optimize and the concierge will give the bandwdith to the one with the highest sensitivity and take the bandwidth from the app with the lowest sensitvity.

Results

As aforementioned, our main objective is to improve the accuracy. However, there are two parameters that will be taken into account which are improvement and the overhead of its improvement. We first choose 3 dds apps[2] that we think will be our ideal case. The following graphs show the profile of our ideal case

We can see that two of them have high sensitivity especially on lower bandwidth and one of them has low sensitivity. This is a perfect scenario since we may sacrifice one’s bandwidth and give it to the app that has the highest sensitivity at that iteration. We will do the experiment under the following setup

DATASETS=("" "uav-1" "coldwater" "roppongi")
MAX_BW=1200
PROFILING_DELTA=80
MI=5

That setup block tells us we will use the total bandwith of 1200 kbps, that means at first we will distribute the bandwidth evenly (400 kbps). The profiling_delta will be 80 kbps and profiling interval (MI) will be 5 seconds.

Mode	DDS (uav-1)	DDS (coldwater)	DDS (roppongi)	Average
Baseline	0.042	0.913	0.551	0.502
Concierge	0.542	0.854	0.495	0.63 (+25.5%)

From the result, we managed to improve the average f1-score by 0.1 or 25.5%. This is obviously a very good result. There are a total of 10 videos in our dataset, for the next experiment, we first will generate 6 combinations of dds apps. Noted that for each combination, one video will be uav-1 since we know that it has the highest sensitivity. We will the experiment with 4 bandwidth scenarios (1200, 1500, 1800, 2100) in kbps.

The left figure depicts the average improvement of the concierge. Here we can see that the improvement decreases when the total bandwidth increases. The reason behind this is at a higher bandwidth, the sensitivity tends to be closer to 0 and the concierge won’t do any allocation. Overall, this confirms our previous result that with the help of uav-1, the concierge can improve the f1-score up to 0.1. The next experiment is to randomly pick 3 dds videos out of 10 videos that will be generated 10 times. We would like to see how it perfoms without any help of uav-1.

From the result, we still managed to get the improvement. However, it seems that average improvement decreases compared to the previous one. The reason of this phenomenon will be discussed later.

Overhead Measurement

From the graph above, each graph represents the total bandwidth used. In this experiment, it is clearly known that the lower MI leads to higher overhead since there would be more profiling process than higher MI. From the 4 graphs above, it can be known that there would be a significant trade off if we lower the MI since the improvement itself is not highly significant. The highest improvement is at 1200kbps. Hence, for higher bandwidth, there is no need to do the profiling too often

Discussion

There are some limitations of our current design. If we have a look at box-plot in figure 5 above, we can see that there is some combinations where the improvement is negative.

The figure above depicts the profiling process from the segment 6 to determine the bandwidth used at segment 7. Here we can see that the f1-score at that bandwidth for (jakarta) drops significantly. Our current design cannot address this issue yet since we only consider current video segment. There is a need to not only look at current segment, but also the previous and the future segment should be taken into account as well.

Regarding the overhead, we are aware that 50% overhead is still considered bad. We might as well try the dynamic MI or skip the profiling for certain video if not neccesarry.

Conclusion

Regardless the aforementioned limitations, this report shows that the concierge is generally capable of giving an f1-score improvement. The update of the next will be shown in the final report later.

References

[1] https://drive.google.com/file/d/1U_o0IwYcBNF98cb5K_h56Nl-bQJSAtMj/view?usp=sharing
[2] Kuntai Du, Ahsan Pervaiz, Xin Yuan, Aakanksha Chowdhery, Qizheng Zhang, Henry Hoffmann, and Junchen Jiang. 2020. Server-driven video streaming for deep learning inference. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 557–570.

Mid-term blog post for Public Artifact Data and Visualization

Mon, 31 Jul 2023 00:00:00 +0000

Over the past few weeks, our platform development has been progressing steadily, and we are excited to share the milestones we have achieved so far. As planned in our introductory blog, we have successfully laid the groundwork for the platform with the guidance and support of our mentor.

Milestones and Accomplishments

Here are some of the key functionalities we have implemented so far:

Modular Architecture: We successfully designed the platform with a modular architecture, separating the Graphical User Interface (GUI) and Command-Line Interface (CLI) functionalities. This modularity allows users to interact with the platform in their preferred way.
Experiment and Bucket Creation: Users can now create experiments, buckets (for storing different implementations of experiments), and iterations using either the GUI or CLI.
Real-time Backend Environment Monitoring: Through the command line interface, users have the capability to control the monitoring of backend environment data, allowing for real-time tracking and analysis of important metrics.
Visualizing Environment Variables: Users can now visualize detected environment variables on the platform. Moreover, they can compare iterations within different buckets and gain more insights by observing the timeseries data, such as CPU usage, in a graphical format.

Challenges

In the early stages of designing our platform, we encountered significant challenges at the system design level. One of the most daunting obstacles we faced was devising an effective method to monitor backend environment variables. To tackle this obstacle, we engaged in extensive discussions and sought guidance from our mentor. After careful consideration, we decided to adopt a multi-process approach to monitor the backend environment variables effectively. Specifically, we devised a meticulous strategy of creating a separate process in the background for each specific metric we needed to monitor. By allocating a dedicated process to each metric, we ensured a streamlined and efficient monitoring process.

Currently, we are facing a challenge related to monitoring metrics. Since different users have varying monitoring requirements, it is impractical for us to manually write monitoring solutions for each user. To address this issue, we are actively working on implementing a pluggable design that allows users to configure their own monitoring preferences.

Our approach involves providing users with the flexibility to define their custom configuration files or write monitoring programs following our documented guidelines. This way, users can specify the specific metrics they wish to monitor and tailor the monitoring process to their individual needs.

Try it Out!

GUI Repository and CLI Repository
- In the README.md file of GUI repo, you will find detailed installation instructions to set up the Graphical User Interface (GUI). Follow the steps provided to get started with our platform.
Sample Repository
- In this repository, we have included scripts that allow you to run our program. Additionally, you can use these scripts as templates to monitor your own programs according to your specific requirements.

We welcome you to take the platform for a test drive and feel free to raise any issues you encounter during the installation process. Your feedback is invaluable to us, as it helps us identify and address any potential installation challenges and improve the user experience.

Enhancing Drift Detection through Fine-Tuning Llama2

Sun, 30 Jul 2023 00:00:00 +0000

Greetings everyone, I’m Kangrui. Over the past few weeks, we’ve dedicated our efforts and have consequently made significant progress in our drift detection methods. Now, I’m excited to present to you a detailed elaboration on how we prompted and fine-tuned Llama2 to efficiently carry out the drift detection task.

Motivation

Why LLM in drift detection method?

The use of large language models (LLMs) in drift detection methods presents numerous benefits that place it as a prominent solution in this domain.

Rapid Development: LLMs are in the vanguard of technological advancement. This field is evolving rapidly with continuous enhancements in model architecture, training techniques, and data handling. With every new version, these models are showing an increasing capacity to understand and generate human-like text, pushing the limits of what is achievable in Natural Language Processing (NLP) and Artificial Intelligence (AI) as a whole.
Superior Performance: Traditional drift detection methodologies such as Page-Hinkley, EDDM, and HDDM have their merits and have found success in numerous scenarios. Even Deep Learning (DL) techniques, like training a predictive model based on error rates, have made significant strides in the field. However, when handling complex, high-dimensional, and real-time data, LLMs have demonstrated exceptional results. They are not only able to effectively predict and respond to drifts but also adapt to new trends more swiftly. Our experiments using LLMs like GPT-3.5-turbo have yielded impressive results, notably outperforming other methods.

Fig. 1: Concept dirfts detected by GPT-3.5-turbo in Cori dataset

Flexibility: One of the major advantages of using LLMs is their flexibility in dealing with different types of input and output. In contrast to traditional methods, which are confined to single feature concept drift detection and can only process numerical values, LLMs can handle a range of input types including text, numbers, and more complex data structures. This capability allows them to detect multi-feature concept drifts, thereby broadening the scope and complexity of problems they can tackle. Moreover, the generation capability of LLMs can provide rich and detailed output, facilitating more comprehensive insights into the detected drifts.

Why Llama2 in drift detection method?

Llama2 presents a series of advantages that make it an excellent choice for applying llm in drift detection. Here’s a breakdown of the key reasons:

Performance Guarantee: As a newly released model, Llama2 has undergone extensive development and testing, providing a reliable guarantee of performance. It represents the cutting edge in AI technology, having benefited from the latest research and advancements in language model design.
Accessibility Guarantee: One significant advantage of Llama2 is that it is open-source. It is readily accessible on HuggingFace, which also provides a range of mature tools to fine-tune and deploy the model.
Flexibility for Fine-Tuning: Llama2 comes in different sizes, such as 7B, 13B, and 75B parameters, which allows for flexibility in model selection based on the task’s requirements and computational resources.

Data

Dataset

In our study, we employed Synthetic data streams for the fine-tuning of Llama2. Synthetic data streams serve as an invaluable resource for controlled experiments in the domain of drift detection. These curated datasets encompass varied types of drifts, providing us with the capability to assess the efficacy of our detection algorithms under diverse scenarios.

Here is a brief introduction to the synthetic datasets we used:

Sine1 & Sine2: These datasets induce abrupt concept drift within a two-dimensional feature space. The classification rule, a sine function, dictates the instance labels, which are flipped at every drift point.
Mixed: This dataset, characterized by its combination of numeric and boolean features, uses a composite classification rule. The abrupt concept drift is simulated via a periodic reversal of class labels.
Stagger: This categorical dataset incorporates abrupt concept drift by periodically altering the classification rules tied to the features.
Circles & LED: These datasets are designed to simulate gradual concept drift. In Circles, the classification of instances is determined by their spatial relation to specific circles. LED imitates a seven-segment digit display, introducing drift by interchanging the pertinent attributes.

Typically, the synthetic datasets contain 100,000 or 1,000,000 instances. The concept drift happens every 25000 or 33333 instances each portraying either abrupt (with drifting period of 50 instances) or gradual concept drifts (with drifting period of 500 instances).

Data Preprocessing and Metrics

Given the token limit of Llama2 and the specific requirements of our project, we needed to transform the data into an appropriate format.

As such, we processed each data stream into three sections: the ‘undrifted’ period, the ‘drifting’ period, and the ‘drifted’ period. All instances in each section were randomly and independently drawn from the original data stream, summing up to a maximum of 100 instances. The number of instances for the undrifted and drifted periods ranged from 20 to 50, and for the drifting period, it ranged from 10 to 20.

For instance, let’s consider a dataset containing 100,000 instances where the concept drift occurs every 25,000 instances, causing abrupt concept drift. To format a data point, we could draw 20 to 50 instances from the first 25,000 as the undrifted period. Then, we could draw 10 to 20 instances from the 25,001st to 25,050th instance as the drifting period. Finally, we would draw 10 to min(100 - num(undrifted period) - num(drifting period), 50) from the 25,051st to 50,050th instance as the drifted period. This newly formatted data stream would then be fed into Llama2.

We also included some additional information to assist Llama2’s inference process. A typical data point in our processed dataset includes:

{
 "before_period": [0, 31],
 "transition_period": [32, 38],
 "after_period": [39, 59],
 "before_index": [196, 19963],
 "transition_index": [20002, 20030],
 "after_index": [20310, 39984],
 "meta": "Dataset: MIXED\n\tv's type is nominal, range is ('False', 'True')\n\tw's type is nominal, range is ('False', 'True')\n\tx's type is numeric\n\ty's type is numeric\n\tclass's type is nominal, range is ('p', 'n')\n",
 "data_stream": ...
}

From this dictionary, the “meta” and “data_stream” entries are fed into Llama2. The “transition_period” serves as the criterion: if Llama2’s answer lies within the “transition_period”, we deem it correct.

Llama2

Inference

We experimented with three variations of prompts during the inference phase.

Prompt Version 1:

[INST] <<SYS>>
 You are a helpful, respectful, and honest assistant. Always provide the most helpful responses possible while ensuring safety. Ensure that your responses are socially unbiased, positive, and free from harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. If a question lacks coherence or sense, explain why instead of providing incorrect information. If you are uncertain about an answer, refrain from sharing false information.
 <</SYS>>
 Your task is to identify the index in a given data stream where the relationship between the features and labels begins to change. The data stream is formatted as a list, with each element being a two-element list: the first represents the features (also a list), and the second is the label. If your answer is 'x', it indicates that the data pattern starts shifting at the xth data point in the stream.
 Here's an example of the data's metadata: Dataset: SINE1
 x's type is numeric
 y's type is numeric
 class's type is nominal, range is ('p', 'n')

 The given data stream is: [[[0.7, 0.07], 'p'], [[0.45, 0.78], 'n'], ..., [[0.64, 0.45], 'n']]
 Your task is to respond with a single index. No additional information is required.
[/INST]

Prompt Version 2:

The same as Prompt 1, but with a specific range for the index response:

Please provide an index ranging from 0 to 96. No additional information is required.

Prompt Version 3:

This prompt uses an instruction-input-output design, which we adopted for fine-tuning:

Below is an instruction paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Identify the index in a given data stream where the relationship between features and labels begins to change. The data stream is formatted as a list, each element being a two-element list: the first represents the features (also a list), and the second is the label. For instance, if the response is 'x', it means that the data pattern starts shifting at the xth data point in the stream. Only respond with an index, no further information is necessary.

### Input:
Meta Data:
Dataset: SINE1
 x's type is numeric
 y's type is numeric
 class's type is nominal, range is ('p', 'n')

Data stream:
[[[0.7, 0.07], 'p'], [[0.45, 0.78], 'n'], .., [[0.64, 0.45], 'n']]

### Response:

Despite minor differences between Prompt Version 1 and Version 2, both suggested by Meta, the results varied significantly, a topic we will delve into in the following section. Prompt Version 3, employing the instruction-input-output structure, was used during our fine-tuning process.

Fine-Tuning

We utilized the tools provided by llama-recipes to fine-tune Llama2. The key command used to initiate the fine-tuning process is illustrated below:

python llama_finetuning.py --use_peft \
 --peft_method lora \
 --quantization \
 --model_name meta-llama/Llama-2-13b-chat-hf \
 --output_dir ./fine_tuned_model/Llama-2-13b-chat-hf-test_finetune \
 --dataset alpaca_dataset \
 --batch_size_training 40 \
 --num_epochs 1

Some explaination about the parameters:

--use_peft: This flag indicates the use of the Parameter-Efficient Fine-Tuning (PEFT) method. PEFT allows us to fine-tune the model more efficiently.
--peft_method lora: Here, we specify that the Lora (Layer-wise Optimal Brain Surgeon with Relevance-based Adjustment) method should be used for PEFT.
--quantization: The quantization flag is used to reduce the memory footprint of the model during the inference stage. It does so by reducing the precision of the model's weights.
--dataset alpaca_dataset: Specifies the dataset setting used for fine-tuning, in this case, the 'alpaca_dataset' indicates the instruction-input-output structure for fine-tuning.

Results

The performance of various models and prompt versions is depicted in Fig. 2.

Fig. 2: Performance comparison of different models and prompt versions.

It is evident from the results that the design of the prompt has a significant impact on Llama2’s performance. Furthermore, due to computational resource constraints, we have only managed to fine-tune Llama2 on a portion of our dataset (approximately 1,000 instances). The entire training set consists of 19,000 instances, and the test set includes 5,000 instances. Despite these limitations, a performance increase is noticeable after fine-tuning.

GPU Emulator for Easy Reproducibility of DNN Training -- Interim Blog Post

Sun, 30 Jul 2023 00:00:00 +0000

Introduction

Motivation

The growing popularity of Deep Neural Networks has resulted in a substantial increase in demand for Graphics Processing Units (GPUs). GPUs are crucial for conducting matrix computations in DNN training and inference. However, they are expensive to purchase for personal use, and the limited availability of GPU resources in public research clouds like Chameleon further exacerbates the issue. This scarcity of resources can cause delays in DNN-related research projects. Therefore, building an emulator can ameliorate the trouble of reserving GPUs, and the emulator can be modified to gather the profiles needed for optimization much quicker.

Overture

The follwing sections will introduce the completed tasks and specify the details within each. The contents are briefly summarized and will try to present the necessary information only. We finished the following tasks:

Literature Review
Emulator implementation:
- Time Profiling
- Pinned Memory
- Inter-GPUs Computation
Reproducing Figures

I will introduce them and the importance of each one.

Tasks + Reason

Literature Review

While waiting for the measurements, I started reading about other GPU-related papers, especially the ones about GPU Schedulers. We found that besides emulating computation and transfer time, we should also emulate the GPU memory profile in order to reproduce some other papers. Fortunately, it’s doable. In fact, without actually using a GPU, we can emulate many aspects of the GPU, more than just its timing. I found several papers that are reproducible theoretically, but they use Tensorflow while my current work targets Pytorch. Therefore I need to keep looking for the ones that use Pytorch.

Afterwards, we started doing more paper reviews and looked over the papers about GPU Scheduling from 2018-2023 to see if we can reproduce figures from other papers. We went over 150 papers to search for the ones that do have implementation in PyTorch and the complemented GitHub page. We managed to find about 15 papers built in PyTorch and 6 of them were published on GitHub.

We found the paper “CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs” and its GitHub page. The paper has three badges of “Artifacts Available, Evaluated, and Reproduced.” The paper’s content is implemented in PyTorch which means we can probably emulate this paper’s result with the emulator we already have by adding more features. We have started testing out to see if we can set up a similar environment and reproduce the experiments in the paper. After checking out the reproducibility of the paper, we will try to reproduce it using our emulator, and we might add new features to our emulator during this process.

Firstly, I tried to reproduce the figures in the paper “CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs”, but stopped after a considerable number of attempts because the README was incomplete and too hard to follow. I first headed to the GitHub of the paper. I read the paper and understood that the GNN’s training was not the same as regular deep learning training, because it had input irregularity, and CoGNN helped better schedule the jobs to the machines by their algorithm. However, when I tried to install the software by the requirement of their environment README in order to reproduce the figures, many dependency issues were there, and barely any packages required were installed successfully. Their README in the software module was unclear on how to run the experiments too. Following the experiment setup did not give me the expected results. After a set of struggles with even completing one suggested experiment, we eventually decided to move on with other papers and abandoned this paper, reminding me the importance of reproducibility again.

Secondly, we found another paper “Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent”. After reading the paper, we figured that the main focus was on distributing the resources (CPU, GPU) of the nodes to the jobs that were distributed by the Kubernetes Scheduler. In this way, there would be less GPU fragmentation and a higher utility rate of the resources. The paper used a simulator to simulate a large number of nodes and run the jobs by simulation. I successfully ran the experiments demonstrated in the repo and even created a smaller sample so that we could gain the result faster, because their original experiment takes 1020 times which will take about a month. However, when we dug deeper into their paper, we soon realized that their emulator is not a “real” one. Although their emulator is built off Kubernetes, the side where they used to create the figures are mere simulators and therefore doesn’t fit with our goal of emulating only GPU-related parts while running other real-system parts.

Reason:

The purpose is to figure out which papers can be reproduced using the emulator, and what other features are needed for the emulator to work.

Emulator implementation

Time Profiling

I did the performance profiling of different GPUs, which included CPU-to-GPU data transfer time and GPU computation time. These two elements will always be rather constant on GPUs so they can be easily emulated by profiling first and then utilized in the emulation. We did it for 6 different GPUs including k80, rtx6000, m40, a100pcie, v100, and p100.

After having the performance profiling information of a few types of GPU nodes, I implemented the first naive version of the emulator. I used the profile recorded and sleep() function to represent the amount of time that each step needs to accomplish. Meanwhile, the time also varies with the command given so some simple arithmetics were implemented too. It’s implemented on a CPU node yet if we want to know the time profile of a GPU, we can still get them just like on a real GPU node.

Reason:

The time profile collected can be compared with Data Wait Time to conduct research on minimizing pipeline stall across different GPUs and models.

Pinned Memory

Pin memory threads – GPU-based Pytorch utilizes such threads to copy data from SHM to pinned memory, but CPU-based Pytorch doesn’t do so. Therefore, I need to implement an emulation of the pin mem threads. Fortunately, the data copy time is predictable. I have already found out that pin mem time has little to do with # of workers or the model type but only the batch size. I still need to find out if it has anything to do with the GPU nodes, which I assume not at this point.

While implementing the features, We first emulated the CPU-to-GPU transfer time and GPU computation time for the p100 GPU based on the profiled information. Another CUDA behavior that requires emulation is that CUDA copies data from shared memory to pinned memory. In order to emulate it, we measured and emulated the time for copying such data (pinned memory). However, the emulator did not behave exactly as the real GPU. This was because we only emulated the time cost of using pinned_memory, but didn’t emulate its memory cost. In order to resolve the problem above, we wrote a CPython module to manually allocate page-locked memory (which behaves the same as CUDA’s pinned_memory). After we implemented this mechanism, the emulator’s fundamental functions were equipped and properly mimicked CUDA’s behaviors.

Reason:

After collecting the GPU profile, I did a comparison with the actual GPU but noticed some differences in their IO time, meaning there was a difference between the emulation-based Pytorch and the actual GPU-based Pytorch.

Inter-GPUs Computation

We worked on the emulation of inter-GPU computation time in order to emulate Figure 9 in the DNN stall paper. This is one of the influential factors in multi-GPU training and we decided to first figure out how to implement this feature. As claimed in the paper, the larger the batch size, the less time it took to update the model. However, our current emulator would give out the same computation time since we have not added features to emulate inter-GPU behaviors. The smaller the batch size, more overheads were proven to be larger. The first step was to rent a lease that had 2 GPUs and saw the effects of inter-GPUs on computation time. We found that there was a small amount of overhead when running two GPUs instead of 1 GPU on the p100 node. My job was to find out where and how these overheads happened and find ways to emulate these features in order to reproduce Figure 9. We used resnet18, 4 workers, 10 batches to separately run 128 batch-size with 1 GPU (Group A) and 256 batch-size with 2 GPUs (Group B). With our current emulator, we would get the same computation time for both experiments to finish 1 batch. However, we saw that the computation time of Group B was longer than Group A, meaning there were some overheads in computation time. I then hacked into the source code of PyTorch and successfully figured out one part of the overhead contributing factors.

Reason:

To better complete the emulator so that it can procide accurate emulation even when using more than 1 GPU on a machine.

Reproducing Figures

After implementing the emulator, we managed to use it to reproduce Figures 3, 4, 5, and 6 in the paper “Analyzing and Mitigating Data Stalls in DNN Training” after a series of experiments and testing. It was noted that some environments in the paper were not the same as what we ran in the past week, but general patterns did apply to the expected hypothesis and measurements. We double checked all the data and figures produced and found out that our prototype meets our expectations, and it was time to look for other papers to reproduce to make the emulator more interesting. The orginial comparing with the reproduced figures are demonstrated as below, you can notice that the patterns do reflect our expected results: Original Figure 3:

Reproduced Figure 3:

Original Figure 4:

Reproduced Figure 4:

Original Figure 5:

Reproduced Figure 5:

Original Figure 6:

Reproduced Figure 6:

Reason:

Our origninal goal was to reproduce papers. Therefore, reproducing figures is a really good step to achieve that.

Summary + Coming Future

We will keep on trying to complete the emulator and figure out the exact mechanisms needed for the implementation. We will also seek for more features and see if it’s possible to add in better features into the emulator.

Building extensions between Python libraries for Biotechnology laboratories

Fri, 28 Jul 2023 00:00:00 +0000

Hello again! This is Luiza, a GSoC contributor for the LabOp Project. My task is to build bridges between programming languages for Biotechnology Laboratory automation.

When talking about life sciences, reproducibility is a issue amongst most research centers. Biotechnology focused laboratories usually have their own protocols developed in house for their own applications. Researchers rely on such protocols to perform their experiments and collect data but when it comes to sharing those protocols and performing them in different laboratories many difficulties arise. Whether it is by lack of equipment, reagents or even by having different orders of execution, replicating a protocol in another laboratory is a challenge. To address this issue LabOp was developed to represent a protocol and convert it in many ways possible, so it can be executed by humans and by machines.

PylabRobot and PyHamilton also come to the picture as such libraries exist to make it possible to write protocols for Hamilton robots(and Tecan machines as well for PylabRobot) but those libraries share the limitation of being able to only represent laboratory protocols at their lower levels, with the user having to write every single command in Python for the protocol to be executed. Thus I’m currently developing an extension for LabOp protocols to be converted into PylabRobot/PyHamilton scripts. This way the researcher writing the protocol can do it in a friendlier fashion, using human-friendly terms to write protocols for robot execution.

BehaviourSpecialization for Liquid Handling class

The first step is building a correspondence spreadsheet with a hello world protocol written in both languages (LabOp | PylabRobot ). This way we can make an equivalence between the functions, parameters and default commands of both Libraries, as well as their structure. This spreadsheet will serve as guidance for the conversion of the Liquid handling steps from their representation in LabOp to their representation in Pylabrobot.

The second step is to create a file that’ll do execute the conversion. In this file I will define a Labware map that’s basically a dictionary translating the resources LabOp names into Labware IDs recognizable by PylabRobots “resource” classes and a Behaviourspecialization class that should convert LabOp actions into PylabRobots Liquid Handler class operations as they’ll coordinate the commands sent from the script to the machines.(see featured images)

Dictionary for LabOp to Pylabrobot container correspondence

Then we move to the protocol that will be tested on the Hamilton Machines, this is a Plasmid purification protocol that is usually performed by a human at a very lower level, one sample at a time. This limitation is not present on Hamilton robots as they can handle many samples at the same time with only one protocol execution. The robot that will be running this protocol has two modules that are not yet present in PylabRobot’s extensions, a pressure pump module and a on deck heatershaker. I’ll be implemmenting this modules in PylabRobot based on their default commands present in PyHamilton and run the protocol on a Hamilton Starlet unit.

The steps of the protocol have been decoupled to facilitate the pilot testing, they are as follows:

Liquid handling - GOOD TO GO
Pressure pump module- requires adjustments
plate grippers(necessary to move the plasmid plate from one module to another) - requires adjustment
On deck heaterShaker- GOOD TO GO

The first pilot tests of the protocol will be run with water instead of plasmid to verify that all the steps are going smoothly, when that’s out of the way we will perform the protocol with dirty plasmids that require purification (which is what the protocol is for). The measurements for success will be sequencing the plasmid (if possible), performing a gel eletrophoresis and measuring absorbance of the DNA.

The goal of this tests is to gather data from the efectiveness of the protocol and its execution on the machine, thus confirming that it is in fact a useful mechanism for DNA purification.

Uncovering Actionable Insights using ReadTheDocs Analytics

Thu, 27 Jul 2023 00:00:00 +0000

Introduction

Hello again! This is Jack, a GSoC contributor for the OpenROAD Project. My task is to update and optimise the documentation to encourage user adoption and engagement.

For open-source repo maintainers, readthedocs is a godsend. One of its more underrated features are in providing search and traffic analytics of up to 90 days for the Community tier users. This is awesome, because ReadTheDocs is “always free for open source and community projects”.

Motivation

Why are analytics important?

Analytics are great as a proxy indicator for documentation engagement. For instance, traffic to a page, could highlight how popular the tool is, or it could also mean the tool is unclear and therefore people might need more visits to the page to further understand usage. But overall, it still indicates that the page needs to be taken care of due to the increased visits.

In what follows we aim to provide a quick tutorial as well as list out some of the actionable insights we uncovered in the OpenROAD/OpenROAD-flow-scripts documentation project.

Preamble

To download the analytics raw csv files, refer to this website.

You should also have the following packages installed: pandas, numpy, matplotlib, scipy.

Traffic Analytics

Traffic analytics are easy to understand. It comes in the format Date, Version, Path, DailyViews as follows:

df = pd.read_csv('ta_or.csv')[::-1].reset_index(drop=True)
df.Date = df.Date.apply(lambda x: x.split()[0])
df.head()

Figure 1: Loading traffic analytics dataframe

The raw data is not all that informative. Let us aggregate the data to obtain the weekly views.

weeklydf = df.copy()
weeklydf.Date = pd.to_datetime(weeklydf.Date) - pd.to_timedelta(7, unit='d')
weeklydf = weeklydf.groupby(['Path', pd.Grouper(key='Date', freq='W')])['Views']\
 .sum()\
 .reset_index()\
 .sort_values('Date')
weeklydf[weeklydf.Path == '/index.html']

Figure 2: Aggregated weekly traffic

Note that we can replace the page path with any interesting page path we desire. A useful command to obtain all possible page paths in this dataset is to use:

weeklydf.Path.unique()

Figure 3: Unique paths in dataset

With these neat data in our arsenal, let us do some plotting! For the visualisation, we have chosen to use the traffic aggregated on a daily scale. On top of this, we also plot a linear best-fit line of all the points to track the trendline over time.

The code below shows how to plot the top 20 pages.

def plot_views(df, numPages = 20):
 # Groupby Path, sum views
 pathResults = df.groupby('Path').Views.sum().sort_values(ascending=False)
 fig, ax = plt.subplots(numPages, figsize = (15,30))
 fig.tight_layout()

 for i in range(numPages):
 key = pathResults.index[i]
 temp = df[df.Path == key]
 ax[i].scatter(temp.Date, temp.Views)
 ax[i].set_xticks(np.arange(0,90, 7)) # this line is to not clutter the x-axis too much.
 ax[i].set_ylabel('Views')
 ax[i].set_title(key)

 # linear regression
 x, y = temp.Date, temp.Views
 bestfit = stats.linregress(range(len(y)),y)
 print(bestfit)
 equation = str(round(bestfit[0],2)) + "x + " + str(round(bestfit[1],2))
 ax[i].plot(range(len(y)), np.poly1d(np.polyfit(range(len(y)), y, 1))(range(len(y))), '--',label=equation)
 ax[i].legend(loc='upper right')

Figure 4: Top 20 pages by daily view counts (in descending order)

Also, we can aggregate the total views by day to plot daily traffic:

def plot_daily_traffic(df):
 # Groupby Date, sum views
 fig = plt.figure(figsize = (15,10))
 dateResults = df.groupby('Date').Views.sum()
 x, y = dateResults.index, dateResults.values
 plt.scatter(x, y)
 plt.xticks(np.arange(0,90, 7))
 plt.ylabel('Views')
 plt.title('Traffic by Day')

 # linear regression
 bestfit = stats.linregress(range(len(y)),y)
 print(bestfit)
 equation = str(round(bestfit[0],2)) + "x + " + str(round(bestfit[1],2))
 plt.plot(range(len(y)), np.poly1d(np.polyfit(range(len(y)), y, 1))(range(len(y))), '--',label=equation)
 plt.legend(loc='upper right')

Figure 5: Daily aggregated traffic

Key Trends:

Notice how there seems to be a cyclical pattern every week - rise in average view counts during Mon-Fri, then a falloff on weekends. This is most evident in the pages /index.html, /main/README.html. This could be attributed to the standard work or study week of Mon-Fri.
According to the gradient of the best-fit line for Figure 2, there seems to be a slow decline of traffic for the OpenROAD docs. For a gradient of -0.77, it translates roughly to decline of 22 views per month. The small decline could be attributed to the higher traffic from 19-29 March 2023, the dates for the OpenROAD 7nm design contest. Contest are always good for driving traffic.

Actionable insights:

Top pages are usually landing pages: index.html, main/README.html, main/src/README.html. We thus prioritised making these pages more readable and concise.
This is followed by tutorial /tutorials/index.html and /search.html. The prominence of the tutorials page made us shift the tutorials link to a higher position on the left navigation sidebar. Search tips were also included to obtain better search results. More about search in the next section.
Next, as OpenROAD consists of 20 tools: traffic analytics helps us come up with an order to update: ifp, gui, odb, ppl, sta, grt, mpl, gpl, rsz, rcx. pdn, cts, psm

Search Analytics

Search analytics come in the form of: Date, Query, TotalResults. Contrary to traffic analytics, TotalResults do not refer to search count for the query that day, but rather it corresponds to the total results returned by that query on that day. Separate aggregation still needs to be done to obtain the final count.

Firstly, let us load the dataset and perform a groupby on the column Date to obtain the daily count aggregates.

df = pd.read_csv('sa_or.csv')[::-1].reset_index(drop=True)
df = df.rename(columns ={'Created Date': 'Date', 'Total Results': 'TotalResults'})
df.Date = df.Date.apply(lambda x: x.split()[0])

dateResults = df.groupby('Date').TotalResults.count()
dateResults

Figure 6: Code output for daily aggregated search counts.

Now we are ready to plot the daily aggregated searches. This represents the number of times a search was performed on the documentation website.

def plot_daily_searches(df):
 dateResults = df.groupby('Date').TotalResults.count()
 x, y = dateResults.index, dateResults.values
 plt.scatter(x, y)
 plt.xticks(np.arange(0,90, 7))
 plt.ylabel('# Times Searched')
 plt.title('Search count by day')

 # linear regression
 bestfit = stats.linregress(range(len(y)),y)
 print(bestfit)
 equation = str(round(bestfit[0],2)) + "x + " + str(round(bestfit[1],2))
 plt.plot(range(len(y)), np.poly1d(np.polyfit(range(len(y)), y, 1))(range(len(y))), '--',label=equation)
 plt.legend(loc='upper right')

Figure 7: Daily aggregated search counts

We can also do an additional plot for queries that return zero results. In other words, we are interested in the terms people are curious about; but is not covered by our documentation currently. Think of it as an on-site search engine optimisation.

zeroResults = df[df.TotalResults == 0]
zeroResults = zeroResults.groupby('Query').Date.count().sort_values(ascending=False)
print('\nAll 0 results queries (desc)\n')
print(zeroResults.index.tolist())

Example output as follows:

['autotuner', 'tdms', '*macro*', 'rtlmp_max_inst', 'get_property',
'check_setup', 'centos', 'initialize_padring', 'core_utilization',
'pin_access', 'read_libraries', 'config', 'eco', 'rpt',
'improve_placement', 'define_process_corner', 'global_place',
'report_worst_slack', 'max_phi_cof', 'report_power', 'get_pins',
'registerfile', 'set_global_routing', 'prebuilt', 'env',
'repair_clock_inverters', 'set_thread_count', 'report_',
'partition_design', 'place_cell', 'blockage', 'partitionmgr',
'nmos', 'tuner', 'write_sdf', 'place_density', 'place_pins_args',
'size_cell', '*macor*', 'repair_clock_inverter', 'misk',
'readhaty', 'readhat', 'obstruct', 'odbpy', 'openpdn', 'openram',
'placement_cfg', 'read_macro_placement', 'output_drc', 'positon',
'pct', 'qrctechtable', 'qrctechfile', 'qrctech', 'qrc',
'properly covered', 'precision innovations', 'repeater', '"rcx-0487"',
'report_worst', 'report_area', 'report_clock_properties', 'skywater',
'study', 'sv', 'synth', 'synth_hierarchical', 'systemverilog',
'tdm', 'tdms_place', 'triton', 'ungroup', 'verilog_files',
'wrc', 'write_lef', 'write_partition_verilog', 'שואם',
'si2', 'sever', 'setrc', 'rtl_macro', 'report_dcalc', 'report_design',
'report_design_info', 'report_instance', 'report_slews', 'resize',
'rtlmp', 'set_power_activity', 'rtree', 'run_all', 'run_all.tcl',
'sc', 'set_all_input_output_delays', 'set_io_pin_constraints', 'metis',
'lefdef', 'make_result_file', 'macro_placement_cfg', 'clock__details',
'clocks__details', 'combinational', 'config.mk', 'coord',
'core_margin', 'db_process_node', 'dbblocjs', 'dbdatabase',
'dbr', 'dbrt', 'dbrttree', 'debian', 'define_pin_shape',
'densiy', 'desgin', 'diff_file', 'clk_period', 'clk_io_ptc',
'cdl', 'analog', './env.sh', '178', '6_final',
'6_final.odb', '_placement', 'abat', 'add_stripe', 'arch',
'ccs', 'binaries', 'bookshelf', 'buff_cell', 'buildwithdocker',
'busbitchars', 'buschar', 'captable', 'directoryobject',
'disallow_one_site_gaps', 'distribute', 'is_port', 'hierarch',
'hop', 'hyper', 'initialie_flooorplan', 'initialize_flooorplan',
'instance_count', 'is_chip', 'lean', 'gui_final', 'lec',
'*def*', 'limitation', 'lyp', 'maco', 'macro_pin',
'macro_place', 'harness', 'gui.py', 'dont', 'fill_cell',
'dreamplace', 'em', 'enable_dpo', 'energy', 'env.sh', 'erc',
'export', 'findmaste', 'grt_layer_adjustments', 'findmaster',
'freepdk45', 'gdt', 'global_', 'global_place_db',
'global_placementy', 'graph', '갲']

For our case we can roughly the problem with these zero-result queries fall under one of these categories:

Missing documentation: Either the parameter of functionality
Typo: User has the right keyword, but did not type it correctly. We will therefore provide them with search tips such as using fuzziness ~N operator for better matches.

Future Work

ReadTheDocs could also be linked with Google Analytics, but this remains for more advanced users.

Another rich source of information helpful to open-source maintainers are GitHub issues. These are the direct platform where users discuss their problems. Another great way to track documentation engagement is to use metrics such as: installation issues per unit week, or user-issue retention rate, which tracks the number of users that continue to file issues after their first.

Conclusion

This post showcases the amount of insight one can gather from parsing traffic and search analytics. It also provides useful Python functions that can be applied to the analytics dataset for fast prototyping and experimentation. If you are a contributor to open-source projects, try uncovering some insights for your doc pages today!

Halfway Through GSOC: My Experience and Learnings

Mon, 17 Jul 2023 00:00:00 +0000

Hello there! I’m Jonathan Edwin, all the way from the beautiful archipelago of Indonesia. This year, I got the exciting chance to jump on board the 2023 Summer of Reproducibility initiative. It’s been quite the adventure! Right now, I’m pouring my energy into a fascinating project titled Using Reproducibility in Machine Learning Education project. I’m thrilled to be able to make my own little mark on it.

For those of you who are not familiar with what I’m working on, let me shed some light. My project, as part of the “Using Reproducibility in Machine Learning Education” initiative under guidance of Fraida Fund, focuses on creating educational resources that center around reproducing some key machine learning techniques. These include Cutout data augmentation, U-Net, and Siamese networks, to name a few. The end product will be a series of interactive Jupyter notebooks that provide step-by-step guidance for students, helping them not only understand these complex models but also gain hands-on experience in achieving research reproducibility.

Progress and Challenges

Embarking on this project, I dove headfirst into the world of Cutout data augmentation, immersing myself in the many experiments outlined in the foundational paper. This initial study proved to be an intricate blend of multiple datasets, two network architectures, and a performance evaluation of models with and without Cutout data augmentation. Additionally, it included the exploration of these models in combination with other data augmentation techniques.

One of our main objectives has been to help students visualize how the model interacts with the data, and for this, we’ve been leveraging a tool called Grad-CAM. The initial paper provided a rich landscape for exploration and learning, leading us to segment our journey into five interactive Jupyter notebooks - Introduction, CutOut, ResNet, WideResNet, Regularization, and Grad-CAM.

I’m excited to share that, as we’ve hit the mid-term milestone, I’ve managed to make significant strides and completed the notebooks up to the WideResNet section. It’s been a journey full of learning and growth, overcoming various challenges along the way - understanding the intricacies of the experiments, deconstructing complex architectures, and distilling all this into digestible, interactive notebooks for students. Despite the challenges, the process has been incredibly rewarding. As we gear up for the next half of the project, I’m eager to tackle the remaining sections and share my work with the community.

Learnings and Skills Gained

Embracing the Iterative Process of Open Source Development: My initial foray into open source development had me writing and running code in one environment, then copying parts of it to another environment and pushing it from there to GitHub. This occasionally led to mistakes during the code migration. However, I’ve since learned to write or change a little bit of code, run the new version directly from GitHub, catch errors, and improve. In open source development, the end goal is to ensure everything works flawlessly, even if it involves several iterations. This is especially true considering the code from GitHub might directly run on platforms like Chameleon or Google Colab.

Understanding the Distinction between Reproducing Experiments and Crafting Educational Content: There’s a stark difference between merely reproducing an experiment from a research paper and creating an educational resource around that experiment. The former generally involves cloning and running the code, verifying it against the claims in the paper with minimal modifications. The latter, however, necessitates adapting and simplifying the code, regardless of the learner’s skill level, to ensure their comprehension. It’s about carefully guiding learners through each step for a more profound understanding.

The Power of ‘Show, Don’t Tell’: This priceless lesson was imparted by my mentor, Ms. Fraida Fund. Rather than telling me what to do when I erred or needed to learn something new, she demonstrated the correct way first-hand. This hands-on approach made understanding far easier. This principle is also reflected in the creation of our notebooks. For instance, we chose to include the Grad-CAM notebook. Although not directly referenced in the paper, it offers students a clear visual understanding of the impact of the Cutout technique, embodying the “show, don’t tell” philosophy.

Next Steps

As we step into the second half of this thrilling journey, our primary goal is to complete the remaining sections of our Cutout project. We’re setting our sights on the final notebook - Grad-CAM. The Grad-CAM notebook will offer a visual exploration of how our models interpret and interact with data, thereby solidifying the students’ understanding of Cutout data augmentation. So, stay tuned for more as we plunge into these fascinating topics!

Conclusion

Looking back, my time with the Summer of Reproducibility initiative has been nothing short of a profound learning experience. Working on the “Using Reproducibility in Machine Learning Education” project has been both challenging and rewarding, and I am incredibly grateful for this opportunity.

I’ve gained valuable insights into open-source development, delved deeper into the intricacies of machine learning techniques, and experienced firsthand the transformative power of a ‘show, don’t tell’ teaching approach. Moreover, I’ve learned that the creation of educational resources requires a delicate balance between preserving the essence of original research and adapting it to foster easy understanding.

As we press forward, I’m excited about the prospects of the coming weeks. The completion of the Grad-CAM notebook lies ahead, marking the final pieces of our Cutout project. Beyond this project, the skills and lessons I’ve acquired during this initiative will undoubtedly guide me in future endeavours.

I can confidently say that my GSOC journey has been a remarkable chapter in my growth as a developer and researcher. Here’s to more learning, more coding, and more breakthroughs in the future!

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Wed, 12 Jul 2023 00:00:00 +0000

As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim, Martin Putra and collaborator Charis Christopher Hulu (another OSRE fellow) aims to analyze large-scale sequencing datasets in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times.
Recent advancements in Next-Generation Sequencing (NGS) technologies have resulted in massive amounts of nucleotide sequence data and automated genomic workflows to streamline analysis and data interpretation. The success of NGS-driven research has also led to a sudden increase in data of varying size and complexity, making it more time-consuming for researchers to test hypotheses. Analyzing high-throughput genomic data requires a step-by-step execution of dedicated tools - also known as workflows. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Prior work in this area has suggested that the runtimes of genomic workflows get affected due to qualitative differences in the data. Additionally, there is very little consensus on what constitutes “input quality” regarding data from large genomic experiments. In this proposal, we hypothesize that genomic data quality significantly impacts the genomic workflows’ execution time. We aim to leverage machine learning techniques to extract predictive features from quality control tools that robustly predict workflow execution time.

Public Artifact Data and Visualization

Sat, 17 Jun 2023 00:00:00 +0000

Hello! As part of the Public Artifact Data and Visualization our proposals (proposal from Jiayuan Zhu and proposal from Krishna Madhwani) under the mentorship of Anjo Vahldiek-Oberwagner aims to design a system that allows researchers to conveniently record and compare the environmental information, such as CPU utilization, of different iterations and versions of code during an experiment.

In academic experiments, there is often a need to compare results and performance between different iterations and versions. This comparative analysis helps researchers evaluate the impact of different experimental parameters and algorithms on the results and enables them to optimize experimental design and algorithm selection. However, to conduct effective comparative analysis, it is essential to record and compare environmental information, alongside the experimental data. This information provides valuable insights into the factors that may influence the observed outcomes.

Through this summer, we aim to develop a system that offers a streamlined interface, enabling users to effortlessly monitor their running programs using simple command-line commands. Moreover, our system will feature a user-friendly dashboard where researchers can access historical runtime information and visualize comparisons between different iterations. The dashboard will present comprehensive graphs and charts, facilitating the analysis of trends and patterns in the environmental data.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Fri, 16 Jun 2023 00:00:00 +0000

Hi! I’m Charis, an undergraduate student in the IT and Big Data Analytics program at the Calvin Institute of Technology. As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim and Martin Putra aims to gain insight into features that are highly correlated with execution times of genomics workflows and build machine learning models for predicting workflow execution time.

Genomics workflows exhibit a long-tail pattern in their execution times. According to the previous project team’s findings, approximately 2% of genomics workflows had a median execution time of up to 15%, resulting in weeks of execution. Interestingly, it was observed that input quality plays a role in these execution time differences. Therefore, we will analyze features such as the quality of input data as well as the amount of resources allocated in the execution of genomics workflows to find features that correlate with execution time. Based on these features we will build a machine learning model that can predict the execution time of genomics workflows.

By collaborating with Shayantan Banerjee (another contributor) who will study data quality, I will study the system metrics of genomics workflows both at workflow-level and tool-level. Metrics will be collected by running genomics workflows using the Slurm workload manager under various resource allocation conditions. Genomics workflows will be executed on Chameleon clusters of different sizes.

GPU Emulator for Easy Reproducibility of DNN Training

Tue, 13 Jun 2023 00:00:00 +0000

Hi! I’m Haoran Wu, a third year at the University of Chicago majoring in Economics and Computer Science. With my proposal, I’m working on the GPU Emulator for Easy Reproducibility of DNN Training project with Professor Vijay Chidambaram. A Deep Neural Network (DNN) is an advanced artificial neural network that employs multiple layers to process intricate patterns and relationships within data. It finds applications in various fields such as image and speech recognition, natural language processing, and predictive modeling. The layers in a DNN progressively extract higher-level features from raw input data, enabling the network to learn and generalize patterns effectively.

Nevertheless, not all DNN research experiments require the use of a GPU. System researchers, for instance, may be primarily interested in performance profiles and not necessarily in the accuracy of training or inference. These researchers might focus on optimizing the storage layer and data loading of DNN training. In such cases, a GPU emulator that accurately replicates GPU behavior without needing a physical GPU can fulfill their requirements. By utilizing a GPU emulator, system researchers can evaluate their system optimizations’ performance without competing for limited GPU resources in the cloud, thereby avoiding unnecessary delays in their research progress. Our work will eventually be open source and benefit the community.

Using Reproducibility in Machine Learning Education

Mon, 05 Jun 2023 00:00:00 +0000

I am Jonathan Edwin, coming from Indonesia, and I am extremely thrilled to be involved in the 2023 Summer of Reproducibility initiative. I am actively contributing to the project by making valuable contributions to the Using Reproducibility in Machine Learning Education project.

As part of the Using Reproducibility in Machine Learning Education my proposal under the mentorship of Fraida Fund aims to develop educational resources focusing on reproducing and replicating fundamental machine-learning techniques, such as Cutout data augmentation, U-Net, and Siamese networks. The project aims to provide students with a hands-on learning experience that enhances their understanding of the models and their underlying principles while imparting valuable skills in ensuring research reproducibility. The project will involve the creation of a series of interactive Jupyter notebooks covering the selected papers, guiding students through reproducing results, and focusing on best practices for ensuring reproducibility. Upon completion, the notebooks will provide a comprehensive and accessible learning experience for students while emphasizing the importance of reproducibility in machine learning education. The proposal also identifies potential challenges associated with the project and proposed solutions to address them. Challenges include incompatibility issues with the original code and current frameworks or environments, difficulty in reproducing the exact results due to factors such as randomness or lack of specific details in the paper, and ensuring that the interactive elements in the Jupyter Notebooks are engaging and effective in teaching reproducibility concepts.

FlashNet: Towards Reproducible Continual Learning for Storage System

Sun, 04 Jun 2023 00:00:00 +0000

Hello! I’m Rani, a third year undergraduate student at Institut Teknologi Bandung majoring at Informatics. As part of the FlashNet my proposal under the mentorship of Haryadi S. Gunawi and Daniar Kurniawan aims to implement and optimize the FlashNet model in real-world storage systems using continual learning techniques.

In real world workloads, it is known that the I/O stream changes and varies. Hence, the performance of I/O read/write could vary and introduce the tail latency. We would like to predict the latency of I/O read to cut the tail and improve the system’s performance. This project focuses on improving the FlashNet pipeline and introducing adaptability to the machine learning models built.

During the summer, we planned to implement the continual learning pipeline using machine learning models we have built previously in the project. Of course, continual learning isn’t a continual learning without the ability of self-motivated retraining. Thus, we will implement several drift detection algorithms, evaluate, and test them. Besides, we will also build a visualization platform to evaluate and monitor the performance of the models built. Lastly, we planned to create Chameleon Trovi artifacts to demonstrate our experiments and make these implementations available and reproducible to the public.

Introducing Levels of Reproduction and Replication in Machine Learning

Thu, 01 Jun 2023 00:00:00 +0000

Greetings everyone,

I am Mohamed Saeed and I am delighted to be part of the 2023 Summer of Reproducibility program, where I am contributing to the Using Reproducibility in Machine Learning Education project.

My proposal was accepted, and I am fortunate to have Fraida Fund as my mentor. The objective of my project is to develop highly interactive open educational resources that can be utilized by instructors teaching graduate or undergraduate machine learning courses. These resources will focus on integrating instruction on reproducibility and reproducible research principles.

Understanding and practicing reproducibility in machine learning (ML) research is of utmost importance in today’s scientific and technological landscape. Reproducibility ensures the reliability, transparency, and credibility of ML findings and discoveries. By learning the principles of reproducibility, students from different levels can validate research results, test introduced methodologies, and understand level of reproducibilty of research.

My contribution will involve developing interactive educational resources that encompass code examples, writing exercises, and comprehensive explanations of key concepts of reproducing ML research. These resources will be carefully crafted to assist students at various levels of expertise. Our aim is for these resources to be widely adopted by instructors teaching graduate or undergraduate machine learning courses, as they seek to enhance the understanding of reproducibility and reproducible research principles.

I think this is a great opportunity to learn more about ML research reproducibility. I’ll be posting regular updates and informative blogs throughout the summer, so stay tuned!

ScaleBugs: Reproducible Scalability Bugs

Thu, 01 Jun 2023 00:00:00 +0000

Hello! As part of the ScaleBugs project our proposals (proposal from Goodness Ayinmode and proposal from Zahra Nabila Maharani) under the mentorship under the mentorship of Cindy Rubio González,Haryadi S. Gunawi and Hao-Nan Zhu aims to build a dataset of reproducible scalability bugs by analyzing bug reports from popular distributed systems like Cassandra, HDFS, Ignite, and Kafka. For each bug report, we will analyze whether the reported bug is influenced by the scale of the operation, such as the number of nodes being used or a number of requests. The resulting dataset will consist of bug artifacts containing the buggy and fixed versions of the scalability system, a reproducible runtime environment, and workload shell scripts designed to demonstrate bug symptoms under different scales. These resources will help support research and development efforts in addressing scalability issues and optimizing system performance.

Reproducible Evaluation of Multi-level Erasure Coding

Wed, 31 May 2023 00:00:00 +0000

Hi! My name is Alex, an undergraduate student at the University of Chicago. As part of the Reproducible Evaluation of Multi-level Erasure Coding, my proposal under the mentorship of John Bent and Anjus George aims to build a platform to reproducibly evaluate the performance and durability of MLEC (Multi-Level Erasure Coding) for large-scale storage systems under different design configurations.

To provide some context, Erasure Coding (EC) is a common approach to protect data from disk failures. Data centers nowadays increasingly use Multi-Level Erasure Coding (MLEC), a newly developed erasure coding method that aims to deal with the drawbacks of Single-Level Erasure Coding (SLEC). Despite its increasing popularity, there have not been many systematic studies to analyze and evaluate MLEC, which is the focus of this project.

The evaluation will primarily be conducted through simulations, since modifying configurations in a real large-scale system is costly and impractical. The expected deliverables of this project will be:

An MLEC simulator that can reproducibly simulate different configurations of the MLEC system, e.g. coding parameter selection, chunk placement scheme, repair method choice, etc.
An analysis of the performance and durability tradeoffs between different MLEC design choices based on the evaluation results from the simulation
Reproduced SLEC evaluation results using existing SLEC simulators
A comparison between MLEC and SLEC on performance and durability tradeoffs
Well-written documents and detailed guides on how to reproduce the evaluation results

Our plan is to build the simulator throughout the summer. We hope our simulator and evaluation results can provide designers of large-scale storage systems with valuable insights on choosing the most appropriate erasure coding configuration per their needs.

[FLASHNET]: Leveraging ML-augmented I/O in Linux

Tue, 30 May 2023 00:00:00 +0000

Hi! I’m Justin, an undergraduate at the University of Chicago. As part of the Flashnet project my proposal under the mentorship of Daniar Kurniawan and Haryadi S. Gunawi aims to port the Flashnet model into the Linux kernel.

In this attempt, I will borrow architecture/design choices from LAKE (to take advantage of its integration of ML-focused hardware acceleration in the kernel) and evaluation criteria from LinnOS to test for model inference accuracy. I also plan to support latency “bucket” inference output to improve accuracy. Ultimately, my goal is to gain further insight into best practices for integrating ML models into real-life operating systems like Linux and to inform general design choices for the Flashnet pipeline.

Reproduce and benchmark self-adaptive edge applications under dynamic resource management

Tue, 30 May 2023 00:00:00 +0000

Hello there!

I am Faishal Zharfan, a senior year student studying Telecommunication Engineering at Bandung Institute of Technology (ITB) in Bandung, Indonesia, my proposal. I’m currently part of the Edgebench under the mentorship of Yuyang Huang. The main goal of this project is to be able to reproduce and benchmark self-adaptive video applications using the proposed solution.

The topic that I’m currently working on is “Reproduce and benchmark self-adaptive edge applications under dynamic resource management” or known as edgebench is led by Prof. Junchen Jiang and Yuyang Huang. Edgebench is a project that focuses on how to efficiently distribute resource (bandwidth and cpu usage) across several video applications. Nowaday’s video applications process its data or video on a server or known as edge computing, hence bandwidth or compute unit may be the greatest concern if we talk about edge computing in terms of WAN, because it is strictly limited. We may distribute the bandwidth evenly across the cameras, however the needs of bandwidth/compute unit of each camera is different. Therefore we need another solution to tackle this problem, the solution proposed recently is called “accuracy gradient”, with this solution, we can tell how much of one application needs the bandwidth on a certain time to achieve higher accuracy. The goal of this solution is to allocate more bandwidth to the apps which has the higher f1-score improvement and reduce the other which doesn’t have a significant diminishment of f1-score. Henceforth, in the end we would have a higher total f1-score.

Throughout this summer, we have planned to implement the “accuracy gradient” and test several baselines to be compared with the solution. As for the implementation, we are currently implementing the latency measurement. We are aware that there is an overhead over this solution, therefore the latency should be taken into account.

Automatic Cluster Performance Shifts Detection Toolkit

Sat, 27 May 2023 00:00:00 +0000

Hi! I am Kangrui, a Pre-doc student at the University of Chicago. As part of the Automatic Cluster Performance Shifts Detection Toolkit my proposal under the mentorship of Sandeep Madireddy and Ray Andrew aims to design a real-time performance shift detection algorithm for high-performance computing clusters, ensuring minimal overheads.

This project focuses on developing a real-time performance shift detection algorithm tailored to heterogeneous workloads, aiming to promptly inform administrators about performance changes. The primary goal is to design an algorithm that efficiently detects shifts in real-time, with minimal system overheads.

In addition to algorithm development, we plan to enhance the Darshan toolkit’s functionality by integrating our algorithm, offering users early performance shift detection. This integration will aid administrators in making informed system utilization and scheduling decisions.

To promote transparency and reproducibility, we’ll encapsulate our findings, scripts, and profiling data within a Jupyter notebook, especially Chameleon Trovi, enabling other researchers to reproduce our experiments easily.

Looking ahead, we plan to expand the algorithm’s applicability to cater to diverse HPC workloads and infrastructures. Other areas of interest include its use in detecting shifts in financial markets or monitoring IoT data streams. Further refinement of our algorithm, to reduce overheads and improve real-time detection capabilities, is also a part of our future endeavours. This task may involve evaluating various shift detection methods and noise filtering techniques.

Using Reproducibility in Machine Learning Education: Reproducibility with Incomplete Methodology Descriptions

Sat, 27 May 2023 00:00:00 +0000

Hey,

I am Shekhar and I am one of several students who are working on the project Using Reproducibility in Machine Learning Education under the mentorship of Fraida Fund. My Proposal aims to develop interactive educational materials about reproducibility in machine learning, for use in graduate and undergraduate classes. My project is inspired by my experience in the Machine Learning Reproducibility Challenge, where I found that a major challenge for reproducibility was that some details were left ambiguous in the paper I was trying to reproduce. For my project, I will develop an interactive tutorial to help demonstrate how if the methodology details are not fully specified in a publication, then someone trying to reproduce the result will have to make choices that may not match the authors’, and these choices will affect whether or not the final result is validated.

Update OpenROAD Documentation and Tutorials

Fri, 26 May 2023 00:00:00 +0000

Hi! I am Jack, a Masters student at the National University of Singapore. In GSoC 2023, I will be undertaking the project entitled Update OpenROAD Documentation and Tutorials to improve the user experience and documentation of this exciting open-source RTL-to-GDSII framework, jointly mentored by Indira Iyer Almeida and Vitor Bandeira. Check out my proposal here!

This project aims to review and update missing documentation and tutorials in OpenROAD-flow-scripts. A key focus will be on increasing ease-of-setup by updating documentation, setup scripts and docker-based commands. Next, we will also update documentation for the following OpenROAD components: Makefile flow variable, distributed detailed routing, Hier-RTLMP, Autotuner. If time permits, cloud enablement will be implemented, alongside notebook-based packaging to further increase ease of adoption.

Advancing Reproducible Science through Open Source Laboratory Protocols as Software

Thu, 25 May 2023 00:00:00 +0000

Hello everyone!

My name is Luiza, I am an eighth-semester Bsc Biological Sciences student from São Paulo, Brazil. As part of the LabOp working group, my proposal under the mentorship of Dan Bryce and Tim Fallon aims to build a conversor that takes normal laboratory protocols and translates them into machine executable protocols. This is possible thanks to LabOP’s versatility to represent what a Laboratory protocol should look like. I´ll be testing this specialization in Hamilton machines that are great for experimenting scalling up.

Nowadays we face a very common issue between Biotechnology laboratories, that is that protocols are difficult to share and to adapt for machine execution. Laboratory protocols are critical to biological research and development, yet complicated to communicate and reproduce across projects, investigators, and organizations. While many attempts have been made to address this challenge, there is currently no available protocol representation that is unambiguous enough for precise interpretation and automation, yet simultaneously abstract enough to enable reuse and adaptation.

With LabOP we can take a protocol and convert it in multiple ways depending on the needs of the researcher for automation or human experimentation and allowing flexibility for execution and experimentation so I`ll be building a specialization that translates protocols in a way that they can be executed by Hamilton machines.

Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Thu, 25 May 2023 00:00:00 +0000

The project plans to measure the impact of different missing settings for open-source database systems, such as MySQL and PostgreSQL particularly under the TPC-C Benchmark. The objective requires to run experiments on popular settings that are not reported and fix any problems during the experiments for the target systems. The project will compare the performance characteristics, and analyze the impact of missing settings on the performance of the target systems.

Verify the reproducibility of an experiment

Wed, 24 May 2023 00:00:00 +0000

Hello everyone, my name is Jesse and I’m proud to be a fellow in this 2023 Summer of Reproducibility program, contributing to noWorkflow project.

My proposal was accepted under the mentorship of João Felipe Pimentel and Juliana Freire and aims to work mapping and testing the capture of the provenance in typical Data Science and Machine Learning experiments.

What…

Although much can be said about what reproducibility means, the ability to replicate results in day-to-day Data Science and Machine Learning experiments can pose a significant challenge for individuals, companies and researche centers. This challenge becomes even more pronounced with the emergence of analytics and IA, where scientific methodologies are extensively applied on an industrial scale. Then reproducibility assumes a key role in productivity and accountability expected from Data Scientists, Machine Learning Engineers, and other roles engaged in ML/AI projects.

How…

In the day-to-day, the pitfalls of non-reproducibility appear at different points of the experiment lifecycle. These challenges arise when multiple experiments need to be managed for an individual or a team of scientists. In a typical experiment workflow, reproducibility appears in different steps of the process:

The need to track the provenance of datasets.
The need to manage changes in hypothesis tests.
Addressing the management of system hardware and OS setups.
Dealing with outputs from multiple experiments, including the results of various model trials.

In academic environments, these issues can result in mistakes and inaccuracies. In companies, they can lead to inefficiencies and technical debts that are difficult to address in the future.

Finally…

I believe this is a great opportunity to explore the emergence of these two hot topics that are IA and reproducilibity! I will share more updateds here throughout this summer and hope we can learn a lot together!

Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Tue, 23 May 2023 00:00:00 +0000

As part of the Teaching Computer Networks with Reproducible Research project my proposal under the mentorship of Fraida Fund aims to develop a classroom competition for adaptive video delivery policies, leveraging an existing open-source reproducible result. The competition will challenge students to extend the original work and design their adaptive policies for head-to-head competition against their classmates.The project will involve packaging the existing result for easy reproducibility and building on it by implementing other adaptive video policies from the literature, developing different network settings for evaluating student submissions, and creating an evaluation framework for scoring submissions based on various criteria (so that competition remains fair and unbiased). The deliverables include a functional submission and evaluation process, an evaluation framework, and documentation and materials for course instructors to use in the classroom.

Reproducible Evaluation of Multipath Network Protocols

Thu, 16 Feb 2023 00:00:00 +0000

Lead Mentor: Ilknur Aydin

As mobile devices with dual WiFi and cellular interfaces become widespread, network protocols have been developed that utilize the availability of multiple paths. However, the relative effectiveness of these protocols is highly dependent on the characteristics of the network (including the relationship between the two paths, which are often not independent). Researchers typically evaluate a multipath protocol for a small set of network scenarios, which vary from one publication to the next. It is therefore difficult to get a good picture of how different protocols perform in a range of settings.

Framework for repeatable, direct comparison of multipath transport protocols

Topics: Computer networks, wireless systems
Skills: Linux, networking, data analysis and visualization, writing
Difficulty: Large
Size: 350 hours
Mentor(s): Ilknur Aydin and Fraida Fund

In single-path congestion control, the Pantheon work created a reference set of executable benchmarks that researchers could use to evaluate novel congestion control designs against existing work in a wide range of the scenarios. This project seeks to achieve something similar for multipath protocols, using publicly available networking testbeds like FABRIC. For this project, the participant will:

Prepare a set of network benchmarks for multipath protocols, using live network links, real link traces, and emulated scenarios
Develop an experiment using the benchmarks to evaluate existing multipath protocol implementations
Prepare materials that researchers can use to evaluate novel multipath protocols against the others in the benchmark

Is Reproducibility Enough? Understanding the Impact of Missing Settings in Artifact Evaluation

Wed, 08 Feb 2023 00:00:00 +0000

While Artifact Evaluation tries to ensure that the evaluation results in a paper are reproducible, it leaves one question: How about experiment settings NOT reported by the paper? Such “missing settings” may create multiple problems: 1) sometimes the artifacts simply do not work under these missing settings, creating problems when a later work needs to compare to an earlier work under these settings; 2) sometimes the artifacts do not perform well under these missing settings, which may create a bias during the evaluation; 3) to improve the artifact to work under these missing settings, sometimes one needs to re-design the system, which may change the results of the original experiments.

In this project, we plan to understand the impact of this problem: On the necessity side, how would these missing settings affect the conclusions of the original work? On the feasibility side, how much effort does it require to carry out extensive experiments? We plan to answer these questions by reproducing prior works, running them on popular settings that are not reported by these works, and fixing problems if any.

Measuring Research Prototypes under Unreported Settings

Topics: reproducibility, databases, key-value stores, DNN training
Skills: Java/Python, Linux, TPC/YCSB
Difficulty: Medium
Size: 350 hours
Mentor(s): Yang Wang, Miao YU
Contributor(s): Xueyuan Ren

The student will first pick one or a few systems she is interested in. Then she will first try to reproduce their reported results. If successful, she will further try to measure these systems under previously unreported settings. During the procedure, she will need to diagnose and fix any problems that may show up. Finally, she will analyze whether the original conclusions still hold under these new settings and whether fixing any problems will change the performance characteristics of the target systems.

noWorkflow

Tue, 07 Feb 2023 00:00:00 +0000

The noWorkflow project aims at allowing scientists to benefit from provenance data analysis even when they don’t use a workflow system. Also, the goal is to allow them to avoid using naming conventions to store files originated in previous executions. Currently, when this is not done, the result and intermediate files are overwritten by every new execution of the pipeline.

noWorkflow was developed in Python, and it is currently able to capture provenance of Python scripts using Software Engineering techniques such as abstract syntax tree (AST) analysis, reflection, and profiling, to collect provenance without the need of a version control system or any other environment.

At the moment of this writing, the main version of noWorkflow is in the 2.0-alpha branch. We intend to release it before the summer.

Verify the reproducibility of an experiment

Topics: Reproducibility
Skills: Python, SQL or SQLAlchemy ORM
Difficulty: Moderate
Size: Medium or large (175 or 350 hours)
Mentors: João Felipe Pimentel, Juliana Freire

Implement an algorithm to compare the provenance from two (or more) trials (i.e., executions of an experiment) to check their reproducibility. The provenance stored in the relational (sqlite) database by noWorkflow 2 contains intermediate variable values from a trial. These values could be compared to check how much or where executions deviate from each other.

Specific tasks:

Compare trials of the same script (Medium)
Estimate how much on trial deviate from another (Medium)
Consider different scripts and execution flows (Large)
Indicate which parts of the scripts are not reproducible (Large)

Control levels of provenance collection

Topics: Log experiments
Skills: Python
Difficulty: Challenging
Size: Large (350 hours)
Mentors: João Felipe Pimentel, Juliana Freire
Contributor(s): Jesse Lima

Add support for different levels of provenance collection in noWorkflow 2. Currently, noWorkflow 2 collects Python construct evaluations and all the dependencies among the evaluations. However, this collection is inefficient, since some of the collected provenance may not be necessary for end-users. In this project, it is desirable to provide ways to temporarily disable the provenance collection and to manually indicate the provenance in this situation.

Specific tasks:

Disable the collection inside specific functions (through decorators?)
Disable the collection inside specific regions of the code (through with statements?)
Collect only function activations in a region, instead of all variable dependencies
Disable the collection of specific modules
Design a DSL to express general dependencies for parts of the code where the collection is disabled

Upgrade noWorkflow collection to support new Python constructs

Topics: Log experiments
Skills: Python
Difficulty: Moderate
Size: Large (350 hours)
Mentors: João Felipe Pimentel, Juliana Freire

Implement new AST transformations for provenance collection. While noWorkflow 2 works for newer Python versions, most of its implementation was targeted at Python 3.7. Newer Python versions have new constructs in which the provenance is ignored.

Specific tasks:

Identify which AST constructs implementations are missing
Design AST transformations to execute functions before and after the evaluation of the constructs
Create the dependencies for the new constructs

ScaleBugs: Reproducible Scalability Bugs

Tue, 07 Feb 2023 00:00:00 +0000

Scalable systems lay essential foundations of the modern information industry. HPC data centers tend to have hundreds to thousands of nodes in their clusters. The use of “extreme-scale” distributed systems has given birth to a new type of bug: scalability bugs. As its name suggests, scalability bugs may be presented depending on the scale of a run, and thus, symptoms may only be observable in large-scale deployments, but not in small or median deployments. For example, Cassandra-6127 is a scalability bug detected in the popular distributed database Cassandra. The scalability bug causes unnecessary CPU usage, however, the symptom is not observed unless ~1000 nodes are deployed. This demonstrates the main challenge of studying scalability bugs: it is extremely challenging to reproduce without deploying the system at a large scale.

In this project, our goal is to build a dataset of reproducible scalability bugs. To achieve this, we will go through the existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. For each bug report, we determine if the reported bug depends on the scale of the run, such as the number of nodes utilized. With the collected scale-dependent bugs, we then will craft the workload to reproduce those scalability bugs. Our workloads will be designed to trigger some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance. For example, a successful reproduction should be able to show the performance drop along with an increasing number of nodes.

Building a Dataset of Reproducible Scalability Bugs

Topics: Scalability systems, bug patterns, reproducibility, bug dataset
Skills: Linux Shell, Docker, Java, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Cindy Rubio González, Haryadi S. Gunawi, Hao-Nan Zhu
Contributor(s): Goodness Ayinmode, Zahra Nabila Maharani

The student will build a dataset of reproducible scalability bugs. Each bug artifact in the dataset will contain (1) the buggy and fixed versions of the scalability system, (2) a runtime environment that ensures reproducibility, and (3) a workload shell script that could demonstrate the symptoms of the bug under different scales.

Specific Tasks

Work with the mentors to understand the context of the project.
Learn the background of scalability systems.
Inspect the bug reports from Apache JIRA and identify scale-dependent bugs.
Craft shell scripts to trigger the exact scalability bug described by the bug report.
Organize the reproducible scalability bugs and write documentation to build the code and trigger the bug.

LabOP - an open specification for laboratory protocols, that solves common interchange problems stemming from variations in scale, labware, instruments, and automation.

Mon, 06 Feb 2023 00:00:00 +0000

Project idea 1: Software, hardware, and wetware building LabOP with simultaneous language & protocol development & test executions

Topics: Software standard development, Laboratory automation, Biology
Skills: Python, Semantic Web Technologies (RDF, OWL), interest to think about describing biological & chemical laboratory processes
Difficulty: Moderate
Size: Large (350 hours)
Mentors:
1. Tim Fallon
2. Dan Bryce

About: The Laboratory Open Protocol Language (LabOP)

See link: https://bioprotocols.github.io/labop/

LabOP is an open specification for laboratory protocols, that solves common interchange problems stemming from variations in scale, labware, instruments, and automation. LabOP was built from the ground-up to support protocol interchange. It provides an extensible library of protocol primitives that capture the control and data flow needed for simple calibration and culturing protocols to industrial control.

Software Ecosystem

LabOP’s rich representation underpins an ecosystem of several powerful software tools, including:

labop: the Python LabOP library, which supports:
- Programming LabOP protocols in Python,
- Serialization of LabOP protocols conforming to the LabOP RDF specification,
- Execution in the native LabOP semantics (rooted in the UML activity model),
- Specialization of protocols to 3rd-party protocol formats (including Autoprotocol, OpenTrons, and human readible formats), and
- Integration with instruments (including OpenTrons OT2, Echo, and SiLA-based automation).
laboped: the web-based LabOP Editor, which supports:
- Programming LabOP protocols quickly with low-code visual scripts,
- Storing protocols on the cloud,
- Exporting protocol specializations for use in other execution frameworks,

About the Bioprotocols Working Group

The Bioprotocols Working Group is an open community organization developing a free and open standard for representation of biological protocols.

To join the Bioprotocols Working Group:

Join the community mailing list at: https://groups.google.com/g/bioprotocols
Join the #collab-bioprotocols channel on the Bits in Bio Slack.

Leadership

Elected Term: August 24th, 2022 - August 23rd, 2023

Chair: Dan Bryce (SIFT)

Finance Committee:

Governance

Approved by community vote on August 16th, 2022

https://bioprotocols.github.io/labop/about#Governance

Mission:

The Bioprotocols Working Group is an open community organization developing free and open standards for representation of biological protocols. In support of that goal, the organization also develops tools and practices and works with other organizations to facilitate dissemination and adoption of these standards.

As an organization, the Bioprotocols Working Group holds the following values:

The standards developed by the community should be available under permissive free and open licenses.
Technical decisions of the community should be made following open and inclusive processes.
The community is strengthened by fostering a culture of diversity and inclusion, in which all constructive participants feel comfortable making their voices heard.

GPU Emulator for Easy Reproducibility of DNN Training

Sun, 05 Feb 2023 00:00:00 +0000

Deep Neural Networks (DNN) have achieved success in many machine learning (ML) tasks including image recognition, video classification and natural language processing. Nonetheless, training DNN models is highly computation intensive and usually requires running complex computations on GPUs, while GPU is a very expensive and scarce resource. Therefore, many research works on DNN training are delayed because of the lack of access to GPUs. However, many research prototypes don’t require GPUs but only the performance profiles of GPUs. For example, research on DNN training storage systems doesn’t need to run real computations on GPUs, but only needs to know how much time each GPU computation will take. Meanwhile, GPU performance in DNN training is predictable and reproducible, as every batch of training performs a deterministic sequence of mathematical operations on a fixed number of data.

Therefore, in this project we seek to build a GPU emulator platform on PyTorch to easily reproduce DNN training without using real GPUs. We will measure the performance profiles of GPU computations for different models, GPU types, and batch sizes. Based on the measured GPU performance profiles, we will build a platform to emulate the GPU behaviors and reproduce DNN training using CPUs only. We will make the platform and the measurements open-source, allowing other researchers to reproduce the performance measurements and easily conduct research on DNN training systems. We will also encourage the community to enrich the database by adding GPU performance measurements for their own models and GPU types. We will be the first one to build and release this kind of GPU emulator for DNN training, and we believe researchers and the community can benefit a lot from it, especially after more and more GPU performance profiles are added by the community.

Building a platform to emulate GPU performance in DNN training

Topics: DNN training, reproducibility, GPU emulator, performance measurement - Skills: Linux, Python, PyTorch, deep learning
Difficulty: Medium
Size: 350 hours
Mentor(s): Vijay Chidambaram, Yeonju Ro
Contributor(s): Haoran Wu

The student will measure the GPU performance profiles for different models and GPU types, based on which the student will build a platform to emulate the GPU behaviors and easily reproduce DNN training. The GPU performance measurements should be made open-source and reproducible for other researchers to reproduce results and add GPU profiles for their own needs.

Specific tasks:

Work with mentors on understanding the context of the project.
Study and get familiar with the PyTorch DNN training pipelines
Measure GPU performance profiles for different DNN models and GPU types
Based on the GPU performance measurements, build a platform to emulate the GPU behaviors and reproduce DNN training without using real GPUs
Organize and document the codes to make them reproducible for the community

Reproduce and benchmark self-adaptive edge applications under dynamic resource management

Thu, 02 Feb 2023 00:15:56 -0700

With the flourishing of the ideas like smart cities or smart manufacturing, a massive amount of edge devices (e.g., traffic or security cameras, thermometers, flood sensors, et al.) are deployed and connected to the network to collect/analyze data across the space and time and help the stakeholders like city governments or manufacturers optimizing their plans and operations. Such a large number of edge devices and large amount of communications among the devicesdd or to the central servers rise a big challenge on how to manage/schedule the resource (i.e., network bandwidth between the devices and/or computing power on both edge devices and bare metal servers) to ensure the running applications’ capability of providing a reliable service. Furthermore, with the nature of limited resources available to the edge devices, there is an uprising trend to reduce the average compute and/or bandwidth usage by leveraging the uneven distribution of interesting events with respect to both time and space in the input data. This brings further challenges for provisioning and managing the amount of resources available to the edge devices, as the running applications’ resource demands can greatly depend on the input data which is both dynamic and unpredictable.

With these challenges in mind, the team previously designed and implemented a dynamic resource manager that could understand the applications and make decisions based on such understanding at run time. This understanding is achieved based on a key insight - applications will have different magnitudes of performance improvement/degradation toward the change in the amount of resources available depending on the input data and how many resources the applications currently have, which we define as applications’ sensitivities. However, such a resource manager has only been tested with a limited number and types of video analytic applications. Hence, through the OSRE23 project, we aim to:

reproduce other state-of-art self-adaptive video analytic applications,
integrate the reproducible applications into the resource manager framework,
compare the performance with and without resource manager.

Reproduce/benchmark the self-adaptive video analytic applications’ performance under dynamic resource management

Topics: Benchmark, Reproducibility, Video analytics, Machine Learning, Resource Management
Skills: Python, PyTorch, TensorFlowd
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Junchen Jiang, Yuyang Huang
Contributor(s): Faishal Zharfan

Integrate various types of video analytic applications into the aforementioned dynamic resource manager and reproduce/benchmark the applications’ performance.

Specific tasks:

Reproduce state-of-art video analytic applications
Integrate such applications into the resource manager framework - Benchmark video analytic applications
Analysis the benchmarked performance results

FlashNet: Towards Reproducible Data Science for Storage System

Thu, 02 Feb 2023 00:00:00 +0000

The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers. The person must know both the storage side as well the ML side as if studying two different fields at the same time. This project aims to answer these questions:

How can we encourage data scientists to look into storage problems?
How can we create a transparent platform that allows such decoupling?
Within the storage/ML community can we create two collaborative communities, the storage engineers and the storage data scientists?

In the ML/Deep Learning community, the large ImageNet benchmarks have spurred research in image recognition. Similarly, we would like to provide benchmarks for fostering storage research in ML-based per-IO latency prediction. Therefore, we present FlashNet, a reproducible data science platform for storage systems. To start a big task, we use I/O latency prediction as a case study. Thus, FlashNet has been built for I/O latency prediction tasks. With FlashNet, data engineers can collect the IO traces of various devices. The data scientists then can train the ML models to predict the IO latency based on those traces. All traces, results, and codes will be shared in the FlashNet training ground platform which utilizes Chameleon trovi for better reproducibility.

In this project, we plan to improve the modularity of the FlashNet pipeline and develop the Chameleon trovi packages. We will also continue to improve the performance of our binary-class and multiclass classifiers and test them on the new production traces that we collected from SNIA IOTA public trace repository. Finally, we will optimize the deployment of our continual-learning mechanism and test it in a cloud system environment. To the best of our knowledge, we are building the world-first end-to-end data science platform for storage systems.

Building FlashNet Platform

Topics: Storage systems, reproducibility, machine learning, continual learning
Skills: C++, Python, PyTorch, Experienced with Machine Learning pipeline
Difficulty: Medium
Size: Large (350 hours)
Mentors: Haryadi S. Gunawi
Contributor(s): Justin Shin, Maharani Ayu Putri Irawan

Build an open-source platform to enable collaboration between storage and ML communities, specifically to provide a common platform for advancing data science research for storage systems. The platform will be able to reproduce and evaluate different ML models/architecture, dataset patterns, data preprocessing techniques, and various feature engineering strategies.

Specific tasks:

Work with mentors on understanding the context of the project.
Reproduce the FlashNet evaluation results from prior works.
Build and improve FlashNet components based on the existing blueprint.
Collect and analyze the FlashNet evaluation results.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Thu, 02 Feb 2023 00:00:00 +0000

A high-throughput workflow execution system is needed to continuously gain insights from th e increasingly abundant genomics data. However, genomics workflows often have long execution times (e.g., hours to days) due to their large input files. This characteristic presents many complexities when managing systems for genomics workflow execution. Furthermore, based on our observation of a large-scale genomics data processing platform, ~2% of genomics workflows exhibit a tail behavior which multiplied their execution time up to 15x of the median, resulting in weeks of execution.

On the other hand, input files for genomic workflows often vary in quality due to differences in how they are collected. Prior works suggested that these quality differences can affect genomics workflow execution time. Yet, to the best of our knowledge, input quality has never been accounted for in the design of a high-throughput workflow execution system. Even worse, there does not appear to be a consensus on what constitutes ‘input quality,’ at least from a computer systems perspective.

In this project, we seek to analyze a huge dataset from a large-scale genomics processing platform in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times. Following that, we will build machine learning (ML) models for predicting workflow execution time, in particular those which exhibit tail behavior. We believe these insights and models can become the foundation for designing a novel tail-resilient genomics workflow execution system. Along the way, we will ensure that each step of our analysis is reproducible (e.g., in the form of Jupyter notebooks) and make all our ML models open-source (e.g., in the form of pre-trained models). We sincerely hope our work can offload some burdens commonly faced by operators of systems for genomics and, at the same time, benefit future researchers who work on the intersection of computer systems and genomics.

Analyze genomics data quality & build exec. time prediction models

Topics: genomics, data analysis, machine learning
Skills: Linux, Python, Matplotlib, Pandas/Numpy, any ML library
Difficulty: Medium
Size: 350 hours
Mentor(s): In Kee Kim
Contributor(s): Charis Christopher Hulu

Analyze a large-scale trace of genomics workflow execution along with metrics from various genomics alignment tools (e.g., FastQC, Picard, and GATK metrics) and find features that correlate the most with workflow execution time and its tail behavior. Then, based on the results, we will build ML models that accurately predict genomic workflows’ execution times.

Specific tasks:

Acquire basic understanding of genomics data processing & workflow execution (will be guided by the mentor)
Reproduce past analysis & models built by prior members of the project
Propose features from FastQC/Picard/GATK metrics that can be used as a predictor for execution time and tail behavior
Write a brief analysis as to why those features might work
Build ML models for predicting execution time
Package the analysis in the form of Jupyter notebooks
Package the models in a reloadable format (e.g., pickle)

Reproducible Evaluation of Multi-level Erasure Coding

Thu, 02 Feb 2023 00:00:00 +0000

Massive storage systems rely heavily on erasure coding (EC) to protect data from drive failures and provide data durability. Existing storage systems mostly adopt single-level erasure coding (SLEC) to protect data, either performing EC at the network level or performing EC at the local level. However, both SLEC approaches have limitations, as network-only SLEC introduces heavy network traffic overhead, and local-only SLEC cannot tolerate rack failures.

Accordingly, some data centers are starting to use multi-level erasure coding (MLEC), which is a hybrid approach performing EC at both the network level and the local level. However, prior EC research and evaluations mostly focused on SLEC, and it remains to be answered how MLEC is compared to SLEC in terms of durability, capacity overhead, encoding throughput, network traffic, and other overheads.

Therefore, in this project we seek to build a platform to evaluate the durability and overheads of MLEC. The platform will allow us to evaluate dozens of EC strategies in many dimensions including recovery strategies, chunk placement choices, various parity schemes, etc. To the best of our knowledge, there is no other evaluation platform like what we propose here. We seek to make the platform open-source and the evaluation reproducible, allowing future researchers to benefit from it and conduct more research on MLEC.

Building a platform to evaluate MLEC

Topics: Storage systems, reproducibility, erasure coding, evaluation
Skills: Linux, C, Python
Difficulty: Medium
Size: 350 hours
Mentor(s): John Bent and Anjus George
Contributor(s): Zhiyan "Alex" Wang

Build a platform to evaluate the durability and overheads of MLEC. The platform will be able to evaluate different EC strategies in various dimensions including repair strategies, chunk placement choices, parity schemes, etc. Analyze the evaluation results.

Specific tasks:

Work with mentors on understanding the context of the project.
Reproduce the SLEC evaluation results from prior SLEC evaluation tools
Based on prior SLEC evaluation tools, build a platform to evaluate the durability and overheads of MLEC under various EC strategies
Collect and analyze the MLEC evaluation results

Automatic Cluster Performance Shifts Detection Toolkit

Wed, 01 Feb 2023 10:15:56 -0700

High-performance computing (HPC) clusters typically suffer from performance degradation over time. The heterogeneous nature of clusters and the inevitable defects in various infrastructure layers will result in a harder performance prediction inside. On the other hand, when software upgrades or any such events happen, we might also observe performance improvement or degradation even though nothing changes in the hardware. Due to these uncertainties, it is necessary to send early notification to administrators of changes in cluster performance in a specific time window to inform scheduling decisions and increase cluster utilization.

We are targeting HPC clusters that cater to heterogeneous, compute, and I/O intensive workloads that range from scientific simulation to AI model training that have high degree of parallelization in their workloads. In this scenario, we plan to use the Darshan open-source toolkit (https://github.com/darshan-hpc/darshan) as data collection or profiling tools to design our performance drift algorithms. Furthermore, we will possibly incorporate the distribution shift detection into Darshan, making it viable as a notification to the HPC system administrators.

Our goal is to show the efficacy of our algorithm by plotting the profiling data that display specific time windows where the performance shifts happened after being processed by our algorithm. Finally, we will package all our profiling data and experiment scripts inside Jupyter notebook, especially Chameleon Trovi, to help others reproduce our experiments.

Through this research, we seek to contribute the following:

Designing an algorithm to detect performance shifts in HPC clusters that can be adapted for heterogeneous workloads
Real-time detection of the performance shifts without introducing great overheads into the system
Contribute to Darshan to be able to automatically detect performance changes while profiling the clusters.

Automatic and Adaptive Performance Shifts Detection

Topics: Statistical Machine Learning, Deep Learning, and High-Performance Computing (HPC)
Skills: C++, Python, Statistics, good to have: Machine Learning, Deep learning
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Sandeep Madireddy (https://www.anl.gov/profile/sandeep-r-madireddy, http://www.mcs.anl.gov/~smadireddy/ ), Ray Andrew Sinurat (https://rayandrew.me)
Contributor(s): Kangrui Wang

All in all, these are the specific tasks that the student should do:

Collaborate and work with mentors to understand the goal of this project.
Implement distribution shift detection in pure statistical or machine/deep learning
Deploy the algorithm and try to see its efficacy in the clusters.
Package this experiment to make it easier for others to reproduce

OpenROAD - An Open-Source, Autonomous RTL-GDSII Flow for VLSI Designs (2023)

Wed, 01 Feb 2023 00:00:00 +0000

The OpenROAD project is a non-profit, DARPA-funded and Google sponsored project committed to creating low-cost and innovative Electronic Design Automation (EDA) tools and flows for IC design. Our mission is to democratize IC design, break down barriers of cost and access and mitigate schedule risk through native and open source innovation and collaboration with ecosystem partners. OpenROAD provides an autonomous, no-human-in-the-loop, 24-hour, RTL-GDSII flow for fast ASIC design exploration, QoR estimation and physical implementation for a range of technologies above 12 nm. We welcome a diverse community of designers, researchers, enthusiasts, software engineers and entrepreneurs to use and contribute to OpenROAD and make a far-reaching impact. OpenROAD has been used in > 600 tapeouts across a range of ASIC applications with a rapidly growing and diverse user community.

Enhance OpenROAD GUI Flow Manager

Topics: GUI, Visualization, User Interfaces
Skills: C++, Qt
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty, Ethan Mahintorabi

Develop custom features for analysis and visualizations in the [OpenROAD GUI] (https://openroad.readthedocs.io/en/latest/main/src/gui/README.html) to support native and third party flows. These include OpenROAD-flow-scripts, OpenLane and other third-party flows . Create documentation: commands, developer guide notes, tutorials to show GUI usage for supported flows.

Profile and tune OpenROAD flow for Runtime improvements

Topics: OpenROAD-flow-scripts, Flow Manager, Runtime Optimization
Skills: Knowledge about Computational resource optimization, Cloud-based computation, Basic VLSI design and tools knowledge
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty, Ethan Mahintorabi

Test, analyze and develop verifiable and re-producible strategies to improve run times in OpenROAD-flow-scripts. These include optimizations of computational resources over the cloud, tuning of algorithmic and design flow parameters. Create test plans using existing or new designs to show runtime improvements.

Update OpenROAD Documentation and Tutorials

Topics: Documentation, Tutorials, VLSI design basics
Skills: Knowledge of EDA tools, basics of VLSI design flow, tcl, shell scripts, Documentation, Markdown
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Indira Iyer, Vitor Bandeira
Contributor(s): Jack Luar

Review and update missing documentation and tutorials in OpenROAD-flow-scripts for existing and new features. Here is an example Tutorial link: https://openroad-flow-scripts.readthedocs.io/en/latest/tutorials/FlowTutorial.html for reference.

LEF and Liberty Model Testing

Topics: Testing, LEF, ‘LIB’, VLSI design basics
Skills: Knowledge of EDA tools, basics of VLSI design, lef and lib model abstracts, tcl, shell scripts, Verilog, Layout
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty

Test the accuracy of generated LIB and LEF models for signoff in OpenROAD-flow-scripts for flat and hierarchical design flows. Build test cases to validate and add to the regression suite.

Teaching Computer Networks with Reproducible Research

Wed, 18 Jan 2023 00:00:00 +0000

Lead Mentor: Fraida Fund

In the field of computer networks and wireless communication systems, the availability of open access networking and cloud computing testbeds (GENI, CloudLab, Chameleon, FABRIC, and others) has been transformative in promoting reproducible research and in making high-quality experiential learning available to students and educators at a wide range of colleges and universities. This project seeks to unite research and education use of these testbeds by developing new ways of using reproducible research to teach computer networks and related topics.

Bringing foundational results into the classroom

Topics: Computer networks, reproducibility, education
Skills: Linux, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD

To make foundational results from computer networks more concrete, this project seeks to reproduce a selection of key results and package them for use as interactive classroom demonstrations. (An example of a “foundational” result might be the result from the 1980s that motivates congestion control by showing how congestion collapse occurs when the network is under heavy load.) This involves:

Reproducing the original results on an open-access testbed
Packaging the materials for use as a classroom demo, with interactive elements
Creating assessment questions and sample “solutions” related to the materials, that instructors may use in homework assignments or exams

Developing a “classroom competition” for adaptive video delivery policies

Topics: Computer networks, adaptive video, reproducibility, education
Skills: Linux, Python, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contributor(s): Srishti Jaiswal

A carefully designed competition can be a fun and exciting way for students to challenge themselves and gain “ownership” of a new topic. This projects builds on an existing open source reproducible result for adaptive video delivery, and will challenge students to extend this work and design their own adaptive video policies for head-to-head competition against their classmates. This includes:

Packaging the result to make it easier for students to reproduce and then build on the original work
Implementing other adaptive video policies from the literature, so that students can use them as a baseline
Developing different network settings (using live link traces and emulated link patterns) in which student submissions may be evaluated
Developing an evaluation framework for scoring student submissions on different criteria and in different network settings, and making the results available in a leaderboard format

Using Reproducibility in Machine Learning Education

Wed, 18 Jan 2023 00:00:00 +0000

Lead Mentor: Fraida Fund

The computer science and engineering classroom is as essential part of the reproducibility “ecosystem” - because of broad reach and potential for big impact, and because for many students, the classroom is their first exposure to research in their field. For machine learning in particular, reproducibility is an important element of the research culture, and can be a valuable part of any introductory or advanced courses in the field. These projects will develop highly interactive open educational resources, that may be adopted by instructors of graduate or undergraduate machine learning courses to incorporate more instruction about reproducibility and reproducible research.

Introducing “levels” of reproduction and replication in ML

Topics: Machine learning, reproducibility, education
Skills: Python, machine learning, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contributor(s): Mohamed Saeed

In machine learning, replicating a published result to confirm the validity of the experimental results and the broader conclusions of the paper can take several forms, with increasing levels of effort:

using authors’ code and pre-trained weights, run the model on the same benchmarks as the original paper
training a model using authors’ code and published hyperparameters,
training a model using authors’ code and a new hyperparamter search,
validating the authors’ code e.g. with unit tests, in addition to training,
re-implementing the model,
designing additional experiments to validate that the suggested mechanism is in fact responsible for the result,
and more.

This project will develop interactive materials (using one or more exemplar published results) to illustrate and to highlight relevant aspects and pitfalls of each of these “levels” of reproduction and replication.

Packaging existing reproducible results for the ML classroom

Topics: Machine learning, reproducibility, education
Skills: Python, machine learning, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contribuor(s): Shekhar, Jonathan Edwin

The goal is to make it easier for instructors to expose students to state-of-the-art research in the classroom. This project will work with an existing set of recent reproducible results in machine learning, and will package them for easier consumption by students and more effective use in the classroom. This may include, but is not necessarily limited to:

Re-validating the result and re-packaging along with computational environment on an open access testbed
Creating tutorial material around the result, including interactive visualizations to demonstrate key elements of the work
Creating one-click demos for applying the model/technique to a new test sample
Curating test samples to highlight important advantages and limitations of the result
Creating assessment questions and sample “solutions” that instructors may use to “assign” the work to students

Public Artifact Data and Visualization

Mon, 09 Jan 2023 10:15:56 -0700

Reproducibility and Artifact Evaluation efforts have focused on reproducing the results, but not necessarily on storing, visualizing and making the results accessible. This set of projects builds the initial building blocks to log, capture, and visualize experiments.

Experiment Log

Topics: Provide tools to log experiments
Difficulty: Simple
Size: Medium or large (175 or 350 hours)
Mentors: Anjo Vahldiek-Oberwagner

Develop a client and server side tool to start/stop an experiment, timestamp the experiment. Document each iteration of the experiment and create a database to visualize the log of experiments.

Capture HW/SW state & continuous monitoring

Topics: Record initial state
Difficulty: Medium
Size: Medium or large (175 or 350 hours)
Mentors: Anjo Vahldiek-Oberwagner

Provide simple tools to gather the initial state of each experimental machine and its connected devices, configurations, software versions, … Upload into the experiment log database and visualize the recorded data. Ideally, provide diff function between experimental runs.

In a second step, monitor the machine’s state during the execution. This includes, network, memory, CPU, general OS statistics.

Record and visualize experimental results

Topics: Record results in various formats and visualize them
Difficulty: Hard
Size: Medium or large (175 or 350 hours)
Mentors: Anjo Vahldiek-Oberwagner
Contributor(s): Jiayuan Zhu, Krishna Madhwani

Description: Experiments generate results in various formats (e.g., CSV, json, text files, …). The goal of this project is to provide tools to extract common formats, connect the results to the experiment log and visualize them. Ideally, allowing to compare different experimental runs. Initially, the project could dump their results into a Prometheus instance (https://prometheus.io/) which would later become available for everyone to explore the data.

Package Management & Reproducibility

Mon, 07 Nov 2022 10:15:56 -0700

Project ideas related to reproducibility and package management, especially as it relates to store type package managers (NixOS, Guix or Spack).

Lead Mentor: Farid Zakaria mailto:fmzakari@ucsc.edu

Investigate the dynamic linking landscape

Topics: Operating Systems Compilers Linux Package Management NixOS
Skills: Experience with systems programming and Linux familiarity
Difficulty: Moderate to Challenging
Size: Large (350 hours)
Mentors: Farid Zakaria & Tom Scogland mailto:scogland1@llnl.gov

Dynamic linking as specified in the ELF file format has gone unchallenged since it’s invention. With many new package management models that eschew the filesystem hierarchy standard (i.e. Nix, Guix and Spack), many of the idiosyncrasies that define the way in which libraries are discovered are no longer useful and potentially harmful.

Specific tasks:

Continue development on Shrinkwrap a tool to make dynamic library loading simpler and more robust.
Evaluate it’s effectiveness across a wide range of binaries.
Upstream contributions to NixOS or Guix to leverage the improvement when suitable.
Investigate alternative improvements to dynamic linking by writing a dynamic linker “loadder wrapper” to explore new ideas.