Michael Sherman | UCSC OSPO

Assessing and Enhancing CC-Snapshot for Reproducible Experiment Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

A critical challenge in computer systems research reproducibility is establishing and sharing experimental environments. While open testbeds like Chameleon provide access to hardware resources, researchers still face significant barriers when attempting to recreate the precise software configurations, dependencies, and system states needed for reproducible experiments. Environment snapshotting tools offer a solution, but face technical challenges in consistently capturing running systems without introducing distortions or requiring disruptive system modifications. This project addresses these fundamental reproducibility barriers by enhancing CC-Snapshot, an tool capturing the experimental environment configured by the user on bare metal images, to create more reliable and consistent system captures that can be shared and redeployed without loss of fidelity.

CC-Snapshot is a tool on the Chameleon testbed that enables users to package their customized environments as complex images or appliances. By allowing researchers to share these environments easily, CC-Snapshot offers a powerful mechanism for reproducibility, ensuring that experiments can be replicated and extended by others.

In this project, you will review existing CC-Snapshot workflows, research the latest snapshotting technologies, and develop enhancements that improve the tool’s usability and reliability. This includes ensuring snapshots are created consistently (even when the OS is actively running), preserving the integrity of user systems, and exploring advanced features such as out-of-band snapshotting and API-based triggers.

Key Outcomes

Improved Snapshot Consistency: New methods to capture the full state of a disk without risking corruption or data inconsistency.
Enhanced Reproducibility: A refined workflow that allows researchers to reliably share custom environments, facilitating collaborative and repeatable experiments.
User-Friendly Tooling: Streamlined processes that reduce disruption to running systems—so installing dependencies or rebooting into special environments is less burdensome.
Exploratory Features (Stretch Goals): Advanced mechanisms to stream disk data in real time during snapshotting and to initiate snapshots via an API call (for parity with VM snapshots).

Topics: Cloud Computing, Systems & Infrastructure, Reproducibility, Operating System Internals

Skills: Linux / OS Concepts, Cloud Tools, Systems Programming / Scripting, DevOps / CI

Difficulty: Moderate

Size: Medium

Mentors: Michael Sherman, Mark Powers

Tasks:

Ensure Snapshot Consistency
- Reboot into a ramdisk and copy the offline disk.
- Use kexec to switch to/from a ramdisk environment without a full reboot.
- Change images to use a snapshot-capable filesystem (e.g., LVM) for safer live snapshots.
- Investigate additional methods (e.g., blog.benjojo.co.uk) for safely imaging live disks.
Prevent System Modifications During Snapshot
- Currently, CC-Snapshot installs dependencies (e.g., qemu-img) on the running system, affecting its state.
- In-Band Fix: Download and run tools in a temp directory with static linking, avoiding system-level changes.
- Out-of-Band Approach: Snapshots done via ramdisk or kexec do not require altering the running system.
API-Triggered Snapshots
- Extend or integrate with the Nova “snapshot instance” API to support the same workflow for bare metal.
- Leverage Ironic’s new “service steps” feature for an automated snapshot pipeline.
(Stretch Goal) Streaming Snapshots
- Modify the workflow to stream data directly to storage, rather than making a full local copy first.
- Explore incremental or differential snapshot techniques to reduce bandwidth usage and storage overhead.

Widgets for Python-chi in Jupyter

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Reproducibility challenges in research extend beyond code and environments to the experimental workflow itself. When experiments involve dynamic resource allocation, monitoring, and reconfiguration, researchers often struggle to document these interactive steps in a way that others can precisely follow. The lack of structured workflow documentation and real-time feedback creates barriers for reviewers attempting to reproduce experiments, as they cannot easily verify whether their resource configurations match the original experiment’s state. This project addresses these challenges by developing interactive Jupyter widgets that make experiment resource management more visual, intuitive, and self-documenting—transforming ad-hoc command sequences into reproducible workflows that automatically log interactions and configuration changes while providing immediate visual feedback on experiment topology and resource states.

As cloud researchers often work with Jupyter Notebooks for interactive data analysis and experimentation, the python-chi library offers a powerful way to automate and control resources on Chameleon Cloud. This project will extend python-chi by adding interactive widgets specifically designed for use in Jupyter, empowering users to launch, monitor, and manage their experiments without leaving the notebook environment. By bringing visual and intuitive controls directly into the user’s workflow, we aim to improve both reproducibility and usability for complex resource management tasks.

Key Outcomes

User-Friendly Jupyter Widgets: Develop a suite of widgets to visualize reserved resources, hardware availability, and experiment topologies in real time.
Integrated Experiment Management: Enable researchers to orchestrate experiments (launch, configure, monitor) within a single, notebook-centric workflow.
Enhanced Feedback & Usability: Provide clear, asynchronous status updates and resource reconfiguration progress, reducing confusion and user error.
Improved Reproducibility: By automating and logging widget interactions, experiments become more traceable and easier to replicate.

Topics: Interactive Data Tools, Cloud Resource Management, DevOps & Automation, User Experience (UX)

Skills:

Python & Jupyter: Experience creating custom Jupyter widgets, using ipywidgets or similar frameworks.
Cloud Automation: Familiarity with how resources are provisioned, monitored, and deprovisioned on Chameleon.
Frontend / GUI Development: Basic understanding of web technologies (HTML/CSS/JavaScript) can be helpful for widget design.
Software Engineering & CI: Ability to version-control, test, and deploy Python packages.

Difficulty: Moderate

Size: Medium

Mentor: Michael Sherman, Mark Powers

Tasks:

Resource Visualization Widgets
- Build custom widgets that show reserved resources (nodes, networks, storage) in Jupyter.
- Provide an interactive topology view for experiments, indicating node statuses and connections.
Experiment Setup & Execution
- Add controls for launching and managing experiments directly from notebooks.
- Show feedback (e.g., progress bars, status messages) as resources are being allocated or reconfigured.
Hardware Availability & Status Tracking
- Implement a widget that provides real-time data on Chameleon’s hardware availability (bare metal, VMs, GPU nodes, etc.).
- Allow users to filter or select specific resources based on current hardware states.
Usability & Feedback Loop
- Gather user feedback on the widget designs and workflows.
- Refine the interface to minimize clicks, improve clarity, and reduce friction for common tasks.

ML-Powered Problem Detection in Chameleon

Wed, 06 Mar 2024 16:33:57 -0600

Today’s Continuous Integration/Continuous Development (CI/CD) trends encourage rapid design of software using a wide range of software components, followed by frequent updates that are immediately deployed on the cloud. The complexity of cloud systems along with the component diversity and break-neck pace of development amplify the difficulty in identifying or fixing problems related to performance, resilience, and security. Furthermore, existing approaches that rely on human experts—e.g., methods involving manually-written rules/scripts—have limited applicability to modern CI/CD processes, as they are fragile, costly, and often not scalable. Consequently, there is growing interest in applying machine learning (ML) based methods for identifying vulnerabilities in code, non-compliant or otherwise problematic software, and resilience problems in systems and networks. However, despite some success stories in applying AI for cloud operations (e.g., in resource management), much of cloud operations still rely on human-centric methods, which require updates as the cloud undergoes CI/CD cycles. The goal of this summer project is to explore methods of automation for the Chameleon Cloud to enable faster detection and diagnosis of problems. Overall, the project will contribute to an overarching vision of building an infrastructure that collects and synthesizes cross-layer data from large-scale cloud systems, applying ML-powered methods to automate cloud ops, and, further, making this data available to researchers through coherent APIs and analytics engines.

Currently, Chameleon uses runbooks as manual guides for operational tasks, including routine maintenance and troubleshooting. However, these traditional runbooks often fall short in dynamic and fast-paced CI/CD environments, as they lack the flexibility to adapt to changes in software versions, deployment configurations, and the unique challenges of emerging issues. To overcome these challenges, the project will leverage ML to automate anomaly detection based on telemetry data collected from Chameleon Cloud’s monitoring frameworks. This method will not only facilitate rapid identification of performance anomalies but also enable automated generation of runbooks. These runbooks can then offer operators actionable steps to resolve issues efficiently, thereby making the anomaly mitigation process more efficient. Furthermore, this approach supports the automatic creation of targeted runbooks for newly generated support tickets, enhancing response times and system reliability.

Time-permitting, using a collection of automated runbooks (each targeting a specific problem), we will analyze support tickets, common problems, and their frequency to offer insights and suggestions to help roadmapping for Chameleon Cloud to offer the best return on investment on fixing problems.

A key aspect of this summer project is enhancing the reproducibility of experiments in the cloud and improving data accessibility. We plan to design infrastructures and APIs so that the telemetry data that is essential for anomaly detection and automated runbooks is systematically documented and made available. We also aim to collect and share insights and modules on applying ML for cloud operations, including ML pipelines, data labeling strategies, data preprocessing techniques, and feature engineering. By sharing these insights, we aim to promote best practices and support reproducible experiments on public clouds, thus fostering future ML-based practices within the Chameleon Cloud community and beyond. Time permitting, we will explore applying lightweight privacy-preserving approaches on telemetry data as well.

Topics: Machine Learning, Anomaly Detection, Automated Runbooks, Telemetry Data
Skills:
- Proficiency in Machine Learning: Understanding of ML algorithms for anomaly detection and automation.
- Cloud Computing Knowledge: Familiarity with CI/CD environments and cloud architectures.
- Programming Skills: Proficiency in languages such as Python, especially in cloud and ML contexts.
- Data Analysis: Ability to analyze telemetry data using data analytics tools and libraries.
Difficulty: Hard
Size: Large
Mentors: Michael Sherman