<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Michael Sherman | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/michael-sherman/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/michael-sherman/index.xml" rel="self" type="application/rss+xml"/><description>Michael Sherman</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/michael-sherman/avatar_huefa27a8892765b18bdd66d9aadac4c15_86280_270x270_fill_q75_lanczos_center.jpg</url><title>Michael Sherman</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/michael-sherman/</link></image><item><title>Assessing and Enhancing CC-Snapshot for Reproducible Experiment Environments</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/uchicago/cc-snapshot/</link><pubDate>Tue, 18 Feb 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/uchicago/cc-snapshot/</guid><description>&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>A critical challenge in computer systems research reproducibility is establishing and sharing experimental environments. While open testbeds like Chameleon provide access to hardware resources, researchers still face significant barriers when attempting to recreate the precise software configurations, dependencies, and system states needed for reproducible experiments. Environment snapshotting tools offer a solution, but face technical challenges in consistently capturing running systems without introducing distortions or requiring disruptive system modifications. This project addresses these fundamental reproducibility barriers by enhancing CC-Snapshot, an tool capturing the experimental environment configured by the user on bare metal images, to create more reliable and consistent system captures that can be shared and redeployed without loss of fidelity.&lt;/p>
&lt;p>&lt;a href="https://chameleoncloud.readthedocs.io/en/latest/technical/images.html#the-cc-snapshot-utility" target="_blank" rel="noopener">CC-Snapshot&lt;/a> is a tool on the &lt;a href="chameleoncloud.org">Chameleon&lt;/a> testbed that enables users to package their customized environments as complex images or appliances. By allowing researchers to share these environments easily, CC-Snapshot offers a powerful mechanism for reproducibility, ensuring that experiments can be replicated and extended by others.&lt;/p>
&lt;p>In this project, you will review existing CC-Snapshot workflows, research the latest snapshotting technologies, and develop enhancements that improve the tool’s usability and reliability. This includes ensuring snapshots are created consistently (even when the OS is actively running), preserving the integrity of user systems, and exploring advanced features such as out-of-band snapshotting and API-based triggers.&lt;/p>
&lt;h2 id="key-outcomes">Key Outcomes&lt;/h2>
&lt;ul>
&lt;li>Improved Snapshot Consistency: New methods to capture the full state of a disk without risking corruption or data inconsistency.&lt;/li>
&lt;li>Enhanced Reproducibility: A refined workflow that allows researchers to reliably share custom environments, facilitating collaborative and repeatable experiments.&lt;/li>
&lt;li>User-Friendly Tooling: Streamlined processes that reduce disruption to running systems—so installing dependencies or rebooting into special environments is less burdensome.&lt;/li>
&lt;li>Exploratory Features (Stretch Goals): Advanced mechanisms to stream disk data in real time during snapshotting and to initiate snapshots via an API call (for parity with VM snapshots).&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Topics&lt;/strong>: Cloud Computing, Systems &amp;amp; Infrastructure, Reproducibility, Operating System Internals&lt;/p>
&lt;p>&lt;strong>Skills&lt;/strong>: Linux / OS Concepts, Cloud Tools, Systems Programming / Scripting, DevOps / CI&lt;/p>
&lt;p>&lt;strong>Difficulty&lt;/strong>: Moderate&lt;/p>
&lt;p>&lt;strong>Size&lt;/strong>: Medium&lt;/p>
&lt;p>&lt;strong>Mentors&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/michael-sherman/">Michael Sherman&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/mark-powers/">Mark Powers&lt;/a>&lt;/p>
&lt;p>&lt;strong>Tasks&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Ensure Snapshot Consistency
&lt;ul>
&lt;li>Reboot into a ramdisk and copy the offline disk.&lt;/li>
&lt;li>Use kexec to switch to/from a ramdisk environment without a full reboot.&lt;/li>
&lt;li>Change images to use a snapshot-capable filesystem (e.g., LVM) for safer live snapshots.&lt;/li>
&lt;li>Investigate additional methods (e.g., blog.benjojo.co.uk) for safely imaging live disks.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Prevent System Modifications During Snapshot
&lt;ul>
&lt;li>Currently, CC-Snapshot installs dependencies (e.g., qemu-img) on the running system, affecting its state.&lt;/li>
&lt;li>In-Band Fix: Download and run tools in a temp directory with static linking, avoiding system-level changes.&lt;/li>
&lt;li>Out-of-Band Approach: Snapshots done via ramdisk or kexec do not require altering the running system.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>API-Triggered Snapshots
&lt;ul>
&lt;li>Extend or integrate with the Nova “snapshot instance” API to support the same workflow for bare metal.&lt;/li>
&lt;li>Leverage Ironic’s new “service steps” feature for an automated snapshot pipeline.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>(Stretch Goal) Streaming Snapshots
&lt;ul>
&lt;li>Modify the workflow to stream data directly to storage, rather than making a full local copy first.&lt;/li>
&lt;li>Explore incremental or differential snapshot techniques to reduce bandwidth usage and storage overhead.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>Widgets for Python-chi in Jupyter</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/uchicago/jupyter-widgets/</link><pubDate>Tue, 18 Feb 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/uchicago/jupyter-widgets/</guid><description>&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>Reproducibility challenges in research extend beyond code and environments to the experimental workflow itself. When experiments involve dynamic resource allocation, monitoring, and reconfiguration, researchers often struggle to document these interactive steps in a way that others can precisely follow. The lack of structured workflow documentation and real-time feedback creates barriers for reviewers attempting to reproduce experiments, as they cannot easily verify whether their resource configurations match the original experiment&amp;rsquo;s state. This project addresses these challenges by developing interactive Jupyter widgets that make experiment resource management more visual, intuitive, and self-documenting—transforming ad-hoc command sequences into reproducible workflows that automatically log interactions and configuration changes while providing immediate visual feedback on experiment topology and resource states.&lt;/p>
&lt;p>As cloud researchers often work with Jupyter Notebooks for interactive data analysis and experimentation, the &lt;a href="https://python-chi.readthedocs.io/" target="_blank" rel="noopener">python-chi&lt;/a> library offers a powerful way to automate and control resources on &lt;a href="chameleoncloud.org">Chameleon Cloud&lt;/a>. This project will extend python-chi by adding interactive widgets specifically designed for use in Jupyter, empowering users to launch, monitor, and manage their experiments without leaving the notebook environment. By bringing visual and intuitive controls directly into the user’s workflow, we aim to improve both reproducibility and usability for complex resource management tasks.&lt;/p>
&lt;h2 id="key-outcomes">Key Outcomes&lt;/h2>
&lt;ul>
&lt;li>User-Friendly Jupyter Widgets: Develop a suite of widgets to visualize reserved resources, hardware availability, and experiment topologies in real time.&lt;/li>
&lt;li>Integrated Experiment Management: Enable researchers to orchestrate experiments (launch, configure, monitor) within a single, notebook-centric workflow.&lt;/li>
&lt;li>Enhanced Feedback &amp;amp; Usability: Provide clear, asynchronous status updates and resource reconfiguration progress, reducing confusion and user error.&lt;/li>
&lt;li>Improved Reproducibility: By automating and logging widget interactions, experiments become more traceable and easier to replicate.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Topics&lt;/strong>: Interactive Data Tools, Cloud Resource Management, DevOps &amp;amp; Automation, User Experience (UX)&lt;/p>
&lt;p>&lt;strong>Skills&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Python &amp;amp; Jupyter: Experience creating custom Jupyter widgets, using ipywidgets or similar frameworks.&lt;/li>
&lt;li>Cloud Automation: Familiarity with how resources are provisioned, monitored, and deprovisioned on Chameleon.&lt;/li>
&lt;li>Frontend / GUI Development: Basic understanding of web technologies (HTML/CSS/JavaScript) can be helpful for widget design.&lt;/li>
&lt;li>Software Engineering &amp;amp; CI: Ability to version-control, test, and deploy Python packages.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Difficulty&lt;/strong>: Moderate&lt;/p>
&lt;p>&lt;strong>Size&lt;/strong>: Medium&lt;/p>
&lt;p>&lt;strong>Mentor&lt;/strong>: &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/michael-sherman/">Michael Sherman&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/mark-powers/">Mark Powers&lt;/a>&lt;/p>
&lt;p>&lt;strong>Tasks&lt;/strong>:&lt;/p>
&lt;ul>
&lt;li>Resource Visualization Widgets
&lt;ul>
&lt;li>Build custom widgets that show reserved resources (nodes, networks, storage) in Jupyter.&lt;/li>
&lt;li>Provide an interactive topology view for experiments, indicating node statuses and connections.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Experiment Setup &amp;amp; Execution
&lt;ul>
&lt;li>Add controls for launching and managing experiments directly from notebooks.&lt;/li>
&lt;li>Show feedback (e.g., progress bars, status messages) as resources are being allocated or reconfigured.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Hardware Availability &amp;amp; Status Tracking
&lt;ul>
&lt;li>Implement a widget that provides real-time data on Chameleon’s hardware availability (bare metal, VMs, GPU nodes, etc.).&lt;/li>
&lt;li>Allow users to filter or select specific resources based on current hardware states.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Usability &amp;amp; Feedback Loop
&lt;ul>
&lt;li>Gather user feedback on the widget designs and workflows.&lt;/li>
&lt;li>Refine the interface to minimize clicks, improve clarity, and reduce friction for common tasks.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul></description></item><item><title>ML-Powered Problem Detection in Chameleon</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/ml_detect_chameleon/</link><pubDate>Wed, 06 Mar 2024 16:33:57 -0600</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/uchicago/ml_detect_chameleon/</guid><description>&lt;p>Today’s Continuous Integration/Continuous Development (CI/CD) trends encourage
rapid design of software using a wide range of software components, followed by
frequent updates that are immediately deployed on the cloud. The complexity of
cloud systems along with the component diversity and break-neck pace of
development amplify the difficulty in identifying or fixing problems related to
performance, resilience, and security. Furthermore, existing approaches that
rely on human experts—e.g., methods involving manually-written
rules/scripts—have limited applicability to modern CI/CD processes, as they are
fragile, costly, and often not scalable. Consequently, there is growing
interest in applying machine learning (ML) based methods for identifying
vulnerabilities in code, non-compliant or otherwise problematic software, and
resilience problems in systems and networks. However, despite some success
stories in applying AI for cloud operations (e.g., in resource management),
much of cloud operations still rely on human-centric methods, which require
updates as the cloud undergoes CI/CD cycles. The goal of this summer project is
to explore methods of automation for the Chameleon Cloud to enable faster
detection and diagnosis of problems. Overall, the project will contribute to an
overarching vision of building an infrastructure that collects and synthesizes
cross-layer data from large-scale cloud systems, applying ML-powered methods to
automate cloud ops, and, further, making this data available to researchers
through coherent APIs and analytics engines.&lt;/p>
&lt;p>Currently, Chameleon uses runbooks as manual guides for operational tasks,
including routine maintenance and troubleshooting. However, these traditional
runbooks often fall short in dynamic and fast-paced CI/CD environments, as they
lack the flexibility to adapt to changes in software versions, deployment
configurations, and the unique challenges of emerging issues. To overcome these
challenges, the project will leverage ML to automate anomaly detection based on
telemetry data collected from Chameleon Cloud&amp;rsquo;s monitoring frameworks. This
method will not only facilitate rapid identification of performance anomalies
but also enable automated generation of runbooks. These runbooks can then offer
operators actionable steps to resolve issues efficiently, thereby making the
anomaly mitigation process more efficient. Furthermore, this approach supports
the automatic creation of targeted runbooks for newly generated support
tickets, enhancing response times and system reliability.&lt;/p>
&lt;p>Time-permitting, using a collection of automated runbooks (each targeting a
specific problem), we will analyze support tickets, common problems, and their
frequency to offer insights and suggestions to help roadmapping for Chameleon
Cloud to offer the best return on investment on fixing problems.&lt;/p>
&lt;p>A key aspect of this summer project is enhancing the reproducibility of
experiments in the cloud and improving data accessibility. We plan to design
infrastructures and APIs so that the telemetry data that is essential for
anomaly detection and automated runbooks is systematically documented and made
available. We also aim to collect and share insights and modules on applying ML
for cloud operations, including ML pipelines, data labeling strategies, data
preprocessing techniques, and feature engineering. By sharing these insights,
we aim to promote best practices and support reproducible experiments on public
clouds, thus fostering future ML-based practices within the Chameleon Cloud
community and beyond. Time permitting, we will explore applying lightweight
privacy-preserving approaches on telemetry data as well.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Machine Learning&lt;/code>, &lt;code>Anomaly Detection&lt;/code>, &lt;code>Automated Runbooks&lt;/code>, &lt;code>Telemetry Data&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>:
&lt;ul>
&lt;li>Proficiency in Machine Learning: Understanding of ML algorithms for anomaly detection and automation.&lt;/li>
&lt;li>Cloud Computing Knowledge: Familiarity with CI/CD environments and cloud architectures.&lt;/li>
&lt;li>Programming Skills: Proficiency in languages such as Python, especially in cloud and ML contexts.&lt;/li>
&lt;li>Data Analysis: Ability to analyze telemetry data using data analytics tools and libraries.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Hard&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/michael-sherman/">Michael Sherman&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>