cloud computing | UCSC OSPO

Final Report: MPI Appliance for HPC Research on Chameleon

Mon, 01 Sep 2025 00:00:00 +0000

Hi Everyone, This is my final report for the project I completed during my summer as a Summer of Reproducibility (SOR) student. The project, titled “MPI Appliance for HPC Research in Chameleon,” was undertaken in collaboration with Argonne National Laboratory and the Chameleon Cloud community. The project was mentored by Ken Raffenetti and was completed over the summer. This blog details the work and outcomes of the project.

Background

Message Passing Interface (MPI) is the backbone of high-performance computing (HPC), enabling efficient scaling across thousands of processing cores. However, reproducing MPI-based experiments remains challenging due to dependencies on specific library versions, network configurations, and multi-node setups.

To address this, we introduce a reproducibility initiative that provides standardized MPI environments on the Chameleon testbed. This is set up as a master–worker MPI cluster. The master node manages tasks and communication, while the worker nodes do the computations. All nodes have the same MPI libraries, software, and network settings, making experiments easier to scale and reproduce.

Objectives

The aim of this project is to create an MPI cluster that is reproducible, easily deployable, and efficiently configurable.

The key objectives of this project were:

Pre-built MPI Images: Create ready-to-use images with MPI and all dependencies installed.
Automated Cluster Configuration: Develop Ansible playbooks to configure master–worker communication, including host setup, SSH key distribution, and MPI configuration across nodes.
Cluster Orchestration: Develop orchestration template to provision resources and invoke Ansible playbooks for automated cluster setup.

Implementation Strategy and Deliverables

Openstack Image Creation

The first step was to create a standardized pre-built image, which serves as the base image for all nodes in the cluster.

Some important features of the image include:

Built on Ubuntu 22.04 for a stable base environment.
Spack + Lmod integration:
- Spack handles reproducible, version-controlled installations of software packages.
- Lmod (Lua Modules) provides a user-friendly way to load/unload software environments dynamically.
- Together, they allow users to easily switch between MPI versions, libraries, and GPU toolkits
MPICH and OpenMPI pre-installed for standard MPI support and can be loaded/unloaded.
Three image variants for various HPC workloads: CPU-only, NVIDIA GPU (CUDA 12.8), and AMD GPU (ROCm 6.4.2).

These images have been published and are available in the Chameleon Cloud Appliance Catalog:

MPI and Spack for HPC (Ubuntu 22.04) - CPU Only
MPI and Spack for HPC (Ubuntu 22.04 - CUDA) - NVIDIA GPU (CUDA 12.8)
MPI and Spack for HPC (Ubuntu 22.04 - ROCm) - AMD GPU (ROCm 6.4.2)

Cluster Configuration using Ansible

The next step is to create scripts/playbooks to configure these nodes and set up an HPC cluster. We assigned specific roles to different nodes in the cluster and combined them into a single playbook to configure the entire cluster automatically.

Some key steps the playbook performs:

Configure /etc/hosts entries for all nodes.
Mount Manila NFS shares on each node.
Generate an SSH key pair on the master node and add the master’s public key to the workers’ authorized_keys.
Scan worker node keys and update known_hosts on the master.
(Optional) Manage software:
- Install new compilers with Spack
- Add new Spack packages
- Update environment modules to recognize them
Create a hostfile at /etc/mpi/hostfile.

The code is publicly available and can be found on the GitHub repository: https://github.com/rohanbabbar04/MPI-Spack-Experiment-Artifact

Orchestration

With the image now created and deployed, and the Ansible scripts ready for cluster configuration, we put everything together to orchestrate the cluster deployment.

This can be done in two primary ways:

Python CHI(Jupyter) + Ansible

Python-CHI is a python library designed to facilitate interaction with the Chameleon testbed. Often used within environments like Jupyter notebooks.

This setup can be put up as:

Create leases, launch instances, and set up shared storage using python-chi commands.
Automatically generate inventory.ini for Ansible based on launched instances.
Run Ansible playbook programmatically using ansible_runner.
Outcome: fully configured, ready-to-use HPC cluster; SSH into master to run examples.

If you would like to see a working example, you can view it in the Trovi example

Heat Orchestration Template

Heat Orchestration Template(HOT) is a YAML based configuration file. Its purpose is to define/create a stack to automate the deployment and configuration of OpenStack cloud resources.

Challenges

We faced some challenges while working with Heat templates and stacks in particular in Chameleon Cloud

OS::Nova::Keypair(new version): In the latest OpenStack version, the stack fails to launch if the public_key parameter is not provided for the keypair, as auto-generation is no longer supported.
OS::Heat::SoftwareConfig: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.

To tackle these challenges, we designed an approach that is both easy to implement and reproducible. First, we launch instances by provisioning master and worker nodes using the HOT template in OpenStack. Next, we set up a bootstrap node, install Git and Ansible, and run an Ansible playbook from the bootstrap node to configure the master and worker nodes, including SSH, host communication, and MPI setup. The outcome is a fully configured, ready-to-use HPC cluster, where users can simply SSH into the master node to run examples.

Users can view/use the template published in the Appliance Catalog: MPI+Spack Bare Metal Cluster. For example, a demonstration of how to pass parameters is available on Trovi.

Conclusion

In conclusion, this work demonstrates a reproducible approach to building and configuring MPI clusters on the Chameleon testbed. By using standardized images, Ansible automation, and Orchestration Templates, we ensure that every node is consistently set up, reducing manual effort and errors. The artifact, published on Trovi, makes the entire process transparent, reusable, and easy to implement, enabling users/researchers to reliably recreate and extend the cluster environment for their own experiments.

Future Work

Maintaining these images and possibly creating a script to reproduce MPI and Spack on a different image base environment.

Final Update(Mid-Term -> Final): MPI Appliance for HPC Research on Chameleon

Sun, 31 Aug 2025 00:00:00 +0000

Hi everyone! This is my final update, covering the progress made every two weeks from the midterm to the end of the project MPI Appliance for HPC Research on Chameleon, developed in collaboration with Argonne National Laboratory and the Chameleon Cloud community. This blog follows up on my earlier post, which you can find here.

🔧 July 29 – August 11, 2025

With the CUDA- and MPI-Spack–based appliances published, we considered releasing another image variant (ROCm-based) for AMD GPUs. This will be primarily used in CHI@TACC, which provides AMD GPUs. We have successfully published a new image on Chameleon titled MPI and Spack for HPC (Ubuntu 22.04 - ROCm), and we also added an example to demonstrate its usage.

🔧 August 12 – August 25, 2025

With the examples now available on Trovi for creating an MPI cluster using Ansible and Python-CHI, my next step was to experiment with stack orchestration using Heat Orchestration Templates (HOT) on OpenStack Chameleon Cloud. This turned out to be more challenging due to a few restrictions:

OS::Nova::Keypair (new version): In the latest OpenStack version, the stack fails to launch if the public_key parameter is not provided for the keypair, as auto-generation is no longer supported.
OS::Heat::SoftwareConfig: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.

To address these issues, we adopted a new strategy for configuring and creating the MPI cluster: using a temporary bootstrap node.

In simple terms, the workflow of the Heat template is:

Provision master and worker nodes via the HOT template on OpenStack.
Launch a bootstrap node, install Git and Ansible on it, and then run an Ansible playbook from the bootstrap node to configure the master and worker nodes. This includes setting up SSH, host communication, and the MPI environment.

This provides an alternative method for creating an MPI cluster.

We presented this work on August 26, 2025, to the Chameleon Team and the Argonne MPICH Team. The project was very well received.

Stay tuned for my final report on this work, which I’ll be sharing in my next blog post.

Mid-Term Update: MPI Appliance for HPC Research on Chameleon

Sun, 03 Aug 2025 00:00:00 +0000

Hi everyone! This is my mid-term blog update for the project MPI Appliance for HPC Research on Chameleon, developed in collaboration with Argonne National Laboratory and the Chameleon Cloud community. This blog follows up on my earlier post, which you can find here.

🔧 June 15 – June 29, 2025

Worked on creating and configuring images on Chameleon Cloud for the following three sites: CHI@UC, CHI@TACC, and KVM@TACC.

Key features of the images:

Spack: Pre-installed and configured for easy package management of HPC software.
Lua Modules (LMod): Installed and configured for environment module management.
MPI Support: Both MPICH and Open MPI are pre-installed, enabling users to run distributed applications out-of-the-box.

These images are now publicly available and can be seen directly on the Chameleon Appliance Catalog, titled MPI and Spack for HPC (Ubuntu 22.04).

I also worked on some example Jupyter notebooks on how to get started using these images.

🔧 June 30 – July 13, 2025

With the MPI Appliance now published on Chameleon Cloud, the next step was to automate the setup of an MPI-Spack cluster.

To achieve this, I developed a set of Ansible playbooks that:

Configure both master and worker nodes with site-specific settings
Set up seamless access to Chameleon NFS shares
Allow users to easily install Spack packages, compilers, and dependencies across all nodes

These playbooks aim to simplify the deployment of reproducible HPC environments and reduce the time required to get a working cluster up and running.

🔧 July 14 – July 28, 2025

This week began with me fixing some issues in python-chi, the official Python client for the Chameleon testbed. We also discussed adding support for CUDA-based packages, which would make it easier to work with NVIDIA GPUs. We successfully published a new image on Chameleon, titled MPI and Spack for HPC (Ubuntu 22.04 - CUDA), and added an example to demonstrate its usage.

We compiled the artifact containing the Jupyter notebooks and Ansible playbooks and published it on Chameleon Trovi. Feel free to check it out here. The documentation still needs some work.

📌 That’s it for now! I’m currently working on the documentation, a ROCm-based image for AMD GPUs, and some container-based examples. Stay tuned for more updates in the next blog.

MPI Appliance for HPC Research on Chameleon

Sat, 14 Jun 2025 00:00:00 +0000

Hi Everyone,

I’m Rohan Babbar from Delhi, India. This summer, I’m excited to be working with the Argonne National Laboratory and the Chameleon Cloud community. My project focuses on developing an MPI Appliance to support reproducible High-Performance Computing (HPC) research on the Chameleon testbed.

For more details about the project and the planned work for the summer, you can read my proposal here.

👥 Community Bonding Period

Although the project officially started on June 2, 2025, I made good use of the community bonding period beforehand.

I began by getting access to the Chameleon testbed, familiarizing myself with its features and tools.
I experimented with different configurations to understand the ecosystem.
My mentor, Ken Raffenetti, and I had regular check-ins to align our vision and finalize our milestones, many of which were laid out in my proposal.

🔧 June 2 – June 14, 2025

Our first milestone was to build a base image with MPI pre-installed. For this:

We decided to use Spack, a flexible package manager tailored for HPC environments.
The image includes multiple MPI implementations, allowing users to choose the one that best suits their needs and switch between them using simple Lua Module commands.

📌 That’s all for now! Stay tuned for more updates in the next blog.

Thanks for reading!

Chameleon Trovi Support for Complex Experiment Appliances

Tue, 18 Feb 2025 00:00:00 +0000

Overview

The discoverability and accessibility of research artifacts remains a significant barrier to reproducibility in computer science research. While digital libraries index research papers, they rarely provide direct access to the artifacts needed to reproduce experiments, especially complex multi-node systems. Additionally, when artifacts are available, they often lack standardized metadata, versioning, and deployment mechanisms that would enable researchers to easily find and reuse them. This project addresses these challenges by extending Trovi, a repository of experimental artifacts executable on open platforms, to support complex, multi-node appliances, making sophisticated experimental environments discoverable, shareable, and deployable through a standardized interface - ultimately lowering the barriers to reproducing complex systems experiments.

Chameleon has historically enabled researchers to orchestrate complex appliances—large, multi-node clusters configured via OpenStack Heat—to conduct advanced experiments. Meanwhile, Chameleon team introduced Trovi as repository for open platforms (beyond Chameleon) that pioneers mechanisms for artifact and platform integration leading to immediate execution for pratical reproducibility. This project aims to bridge the two by adding support in Trovi for importing, discovering, and launching complex appliances. By integrating these capabilities, researchers will be able to one-click deploy complex appliances directly from the Trovi dashboard, archive them for future reference, and reproduce experiments on demand.

Key Outcomes

Extended Trovi API: Enable the import and management of complex appliances as artifacts.
Streamlined One-Click Launch: Integrate with Chameleon’s existing provisioning workflows so users can launch multi-node clusters directly from Trovi.
Enhanced Dashboard Experience: Provide UI assistance for discovering, reviewing, and customizing complex appliance artifacts.
Improved Artifact Reproducibility: Automate the process of exporting CC-snapshot images and other resources to ensure everything is preserved across sites (UC, TACC), highlighting any parameters that need user attention for cross-site portability.

Topics: Reproducible Research, Cloud Computing & Orchestration, OpenStack Heat, UI/UX & Web Development

Skills: Python, APIs, Cloud (OpenStack), DevOps & Automation, Frontend

Difficulty: Hard

Size: Large

Mentors: Mark Powers

Tasks:

Extensions to the Trovi API
- Add support for importing complex appliances as artifacts (including Heat templates, metadata, and associated disk images).
- Develop methods for tagging, versioning, and categorizing these appliances, making them easier to discover.
One-Click Launch of Complex Appliances
- Integrate with Chameleon’s orchestration engine, enabling single-click cluster deployments from the Trovi UI.
- Validate correct configuration and resource availability through automated checks.
Trovi Dashboard Enhancements
- Update the front-end to provide intuitive controls for customizing or parameterizing complex appliances before launching.
- Offer a clear workflow for reviewing dependencies, resource requirements, and usage instructions.
Automated Export & Multi-Site Testing
- Streamline the export of snapshots or images into Trovi as part of the appliance import process.
- Optionally re-run the imported appliances at multiple sites (UC, TACC), detecting any unparameterized settings or missing dependencies.

Contextualization – Extending Chameleon’s Orchestration for One-Click Experiment Deployment

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Reproducibility in computer systems research is often hindered by the quality and completeness of artifact descriptions and the complexity of establishing experimental environments. When experiments involve multiple interconnected components, researchers struggle with hardcoded configurations, inadequate documentation of setup processes, and missing validation steps that would verify correct environment establishment. This project addresses these challenges by extending orchestration capabilities beyond basic hardware provisioning to include comprehensive contextualization—making complex, multi-component experimental environments deployable via parameterized templates with clear validation points, standardized metadata, and minimal user intervention—thus significantly reducing the barriers to reproducing complex distributed systems experiments.

Chameleon already provides powerful capabilities to orchestrate and configure resources through Heat templates (similar to Terraform) and the python-chi library. However, these focus primarily on provisioning (i.e., allocating and configuring hardware resources). This project goes a step further by addressing contextualization—the process of creating complete, ready-to-use experimental environments that incorporate everything from network layout to instance-level configuration and discovery—with additional features such as parameterized templates, experiment-level metadata, and output reporting.

Key Outcomes

Template-Based One-Click Launch: Users can deploy multi-resource experiments (VMs, networks, storage, etc.) via a single click or a minimal set of input parameters.
Enhanced Experiment Contextualization: Each launched resource can gain access to global “experiment-level” metadata (e.g., IP-to-hostname mappings for cluster authentication) and outputs that summarize important details.
Streamlined User Experience: An asynchronous deployment workflow that provides notifications and uses “outputs” to highlight critical connection information (e.g., bastion host IP, final results).
Optional Advanced Features: Partial reconfiguration to avoid full rebuilds when changes are minor, an “export” function to capture existing deployments into a new template, and potential publishing to Trovi for reproducibility and archiving.

Topics: Cloud Computing & Orchestration, Infrastructure as Code, DevOps & Automation, Reproducible Research Environments

Skills:

OpenStack & Heat Templates: Familiarity with provisioning resources on Chameleon using Heat or Terraform-like workflows.
Python & Scripting: For enhancing or extending the python-chi library.
Systems / Network Knowledge: Understanding multi-VM topologies, cluster configurations, and network-level interactions.
CI/CD & DevOps: Experience building or integrating asynchronous deployment and notifications.

Difficulty: Hard

Size: Large (suitable for a semester-long project or a summer internship)

Mentors: Paul Marshall

Tasks:

One-Click Template Launch
- Design a template (in Heat or similar) specifying multiple cloud resources (images, networks, disk images, SSH keys, etc.).
- Ensure the template author can define input parameters with defaults.
- Allow the user to launch the template quickly with default values or adjust parameters before deployment.
Asynchronous Provisioning & Notifications
- Implement a long-running process that deploys resources step-by-step.
- Provide status updates to the user (e.g., via UI notifications, email, or logs) when deployments complete or fail.
Experiment-Level Metadata
- Inject metadata such as IP-to-hostname mappings to each instance for easy cluster authentication.
- Allow the template to define “outputs” (like a public IP of a bastion or location of final results).
Partial Reconfiguration (Optional)
- Enable partial updates if only one of several servers changes, saving time and resources.
- Improve fault tolerance by avoiding full redeploys in the event of partial failures.
Export Running Configurations into a New Template (Optional)
- Build a web-interface or script to detect existing user-owned resources (servers, networks, etc.).
- Generate a proposed template from those resources, suggesting parameters (e.g., flavor, disk image, or SSH key).
- Extend or modify existing templates by adding discovered resources.
Integration with Trovi / Multi-Site Testing (Optional)
- Provide a method to archive or publish the final template (and associated disk images, data sets) in Trovi.
- Attempt to re-run the template at multiple Chameleon sites (e.g., UC, TACC) to identify parameters or modifications needed for cross-site reproducibility.

MPI Appliance for HPC Research on Chameleon

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Message Passing Interface (MPI) is the dominant programming model for high-performance computing (HPC), enabling applications to scale efficiently across thousands of processing cores. In reproducibility initiatives for HPC research, MPI implementations are critical as they manage the complex communications that underpin parallel scientific applications. However, reproducing MPI-based experiments remains challenging due to the need for specific library versions, network configurations, and multi-node setups that must be precisely orchestrated.

The popularity of an “MPI cluster” as a base layer for many results in HPC caused support for MPI template and appliance to be specifically requested by the SC24 reproducibility chair to support the conference’s reproducibility initiative, providing researchers with standardized environments for validating results. By extending the work begun for SC24, this project aims to create higher-quality, ready-to-use, and maintainable MPI environments for the Chameleon testbed that abstracts away complex configuration details while ensuring consistent performance across experiments—thus making HPC experiments more accessible and reproducible for the broader research community.

You will lead efforts to configure disk images with the necessary MPI dependencies and provide orchestration templates that set up networking and instances automatically. The resulting appliance will allow researchers to quickly and consistently deploy distributed computing environments with MPI. The goal is to facilitate reproducible and scalable computational experiments for a wide range of scientific and engineering applications.

Key Outcomes

Ready-to-Use MPI Disk Images: Create one or more images pre-configured with the correct versions of MPI and dependencies, ensuring a consistent environment.
Simple Cluster Configuration Scripts: Provide scripts or playbooks that efficiently bring up a fully functional MPI cluster on Chameleon, abstracting away manual setup steps.
Orchestration Template: An automated workflow that sets up networks, instances, and additional resources needed to run large-scale MPI workloads.

Topics: High-Performance Computing (HPC), Cloud Computing, MPI & Distributed Systems, DevOps & Automation

Skills:

MPI & Parallel Programming: Understanding of MPI libraries, cluster configuration, and typical HPC workflows.
Cloud Orchestration: Familiarity with OpenStack Heat or other Infrastructure-as-Code (IaC) tools for provisioning resources.
Linux System Administration: Experience configuring and troubleshooting packages, network settings, and performance optimizations.
Scripting & Automation: Ability to write scripts (e.g., Bash, Python) to automate setup and deployment steps.

Difficulty: Moderate to Hard

Size: Medium

Mentor: Ken Raffenetti

Tasks

Disk Images with MPI Dependencies
- Build base images with the correct versions of MPI (e.g., MPICH, OpenMPI) and any required libraries (e.g., GCC, network libraries).
- Ensure all packages are up to date and tested for compatibility with Chameleon’s bare metal and/or VM environments.
Cluster Setup Scripts
- Develop lightweight scripts or Ansible playbooks that join new instances into an MPI cluster, configuring hostnames, SSH keys, and MPI runtime settings.
- Validate cluster functionality by running simple distributed “Hello World” tests and more advanced benchmarks (e.g., Intel MPI Benchmarks).
Orchestration Template
- Provide a Heat template (or similar) specifying the network configuration, instance counts, and environment variables for MPI.
- Enable easy parameterization of cluster size, disk images, and other variables so users can customize their setups on the fly.
Integration & Testing
- Document best practices for launching and using the MPI images in Chameleon.
- Demonstrate reproducibility with multiple cluster sizes and workloads to ensure reliability.

ML-Powered Problem Detection in Chameleon

Fri, 18 Oct 2024 00:00:00 +0000

Hello! My name is Syed Mohammad Qasim, a PhD candidate at the Department of Electrical and Computer Engineering, Boston University. This summer I worked on the project ML-Powered Problem Detection in Chameleon as part of the Summer of Reproducibility (SoR) program with the mentorship of Ayse Coskun and Michael Sherman.

Chameleon is an open testbed that has supported over 5,000 users working on more than 500 projects. It provides access to over 538 bare metal nodes across various sites, offering approximately 15,000 CPU cores and 5 petabytes of storage. Each site runs independent OpenStack services to deliver its offerings. Currently, Chameleon Cloud comprehensively monitors the sites at the Texas Advanced Computing Center (TACC) and the University of Chicago. Metrics are collected using Prometheus at each site and fed into a central Mimir cluster. All logs are sent to a central Loki, with Grafana used for visualization and alerting. Chameleon currently collects around 3,000 metrics. Manually reviewing and setting alerts for them is time-consuming and labor-intensive. This project aims to help Chameleon operators monitor their systems more effectively and improve overall reliability by creating an anomaly detection service to augment the existing alerting framework.

Over the summer, we focused on analyzing the data and identified 33 key metrics, after discussions with Chameleon operators, from the Prometheus Node Exporter that serve as leading indicators of resource usage on the nodes. For example:

CPU usage: Metrics like node_load1, node_load5, and node_load15.
Memory usage: Including buffer utilization.
Disk usage: Metrics for I/O time, and read/write byte rates.
Network activity: Rate of bytes received and transmitted.
Filesystem metrics: Such as inode_utilization_ratio and node_procs_blocked.
System-level metrics: Including node forks, context switches, and interrupts.

Collected at a rate of every 5 minutes, these metrics provide a comprehensive view of node performance and resource consumption. After finalizing the metrics we wanted to monitor, we selected the following four anomaly detection methods, primarily due to their popularity in academia and recent publication in high-impact conferences such as SIG-KDD and SC.

Omni Anomaly, [KDD 2019] [without POT selection as it requires labels.]
USAD, [KDD 2020]
TranAD, [KDD 2022]
Prodigy, [SC 2023] [Only the VAE, not using their feature selection as it requires labels.]

We collected 75 days of healthy data from Chameleon, and after applying min-max scaling, we trained the models. We then used these models to run inference on the metrics collected during outages, as marked by Chameleon operators. The goal was to determine whether the outage data revealed something interesting or anomalous. We can verify our approach by manually reviewing the results generated by these four anomaly detection methods. Below are the results from the four methods on different outages, followed by an example of how these methods identified the root cause of an anomaly.

The above figure shows the percentage of outage data that was flagged as anomalous by different models.

The above two plots shows two examples of the top 5 metrics which contributed to the anomaly score by each anomaly detection model.

Although the methods seem to indicate anomalies during outages, they are not able to pinpoint the affected service or the exact cause. For example, the first partial authentication outage was due to a DNS error, which can manifest in various ways, such as reduced CPU, memory, or network usage. This work is still in progress, and we are conducting the same analysis on container-level metrics for each service, allowing us to narrow the scope to the affected service and more effectively identify the root cause of anomalies. We will share the next set of results soon.

Thanks for your time, please feel free to reach out to me for any details or questions.

ML-Powered Problem Detection in Chameleon

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I am Syed Mohammad Qasim, a PhD candidate in Electrical and Computer Engineering at Boston University. I will be spending my summer working on the project ML-Powered Problem Detection in Chameleon under the mentorship of Ayse Coskun and Michael Sherman.

Currently, Chameleon Cloud monitors sites at the Texas Advanced Computing Center (TACC), University of Chicago, Northwestern University, and Argonne National Lab. They collect metrics using Prometheus at each site and feed them all to a central Mimir cluster. All the logs go to a central Loki, and Grafana is used to visualize and set alerts. Chameleon currently collects around 3000 metrics. Manually reviewing and setting alerts on them is time-consuming and labor-intensive. This project aims to help Chameleon operators monitor their systems more effectively and improve overall reliability by creating an anomaly detection service that can augment the existing alerting framework.

ML-Powered Problem Detection in Chameleon

Wed, 06 Mar 2024 16:33:57 -0600

Today’s Continuous Integration/Continuous Development (CI/CD) trends encourage rapid design of software using a wide range of software components, followed by frequent updates that are immediately deployed on the cloud. The complexity of cloud systems along with the component diversity and break-neck pace of development amplify the difficulty in identifying or fixing problems related to performance, resilience, and security. Furthermore, existing approaches that rely on human experts—e.g., methods involving manually-written rules/scripts—have limited applicability to modern CI/CD processes, as they are fragile, costly, and often not scalable. Consequently, there is growing interest in applying machine learning (ML) based methods for identifying vulnerabilities in code, non-compliant or otherwise problematic software, and resilience problems in systems and networks. However, despite some success stories in applying AI for cloud operations (e.g., in resource management), much of cloud operations still rely on human-centric methods, which require updates as the cloud undergoes CI/CD cycles. The goal of this summer project is to explore methods of automation for the Chameleon Cloud to enable faster detection and diagnosis of problems. Overall, the project will contribute to an overarching vision of building an infrastructure that collects and synthesizes cross-layer data from large-scale cloud systems, applying ML-powered methods to automate cloud ops, and, further, making this data available to researchers through coherent APIs and analytics engines.

Currently, Chameleon uses runbooks as manual guides for operational tasks, including routine maintenance and troubleshooting. However, these traditional runbooks often fall short in dynamic and fast-paced CI/CD environments, as they lack the flexibility to adapt to changes in software versions, deployment configurations, and the unique challenges of emerging issues. To overcome these challenges, the project will leverage ML to automate anomaly detection based on telemetry data collected from Chameleon Cloud’s monitoring frameworks. This method will not only facilitate rapid identification of performance anomalies but also enable automated generation of runbooks. These runbooks can then offer operators actionable steps to resolve issues efficiently, thereby making the anomaly mitigation process more efficient. Furthermore, this approach supports the automatic creation of targeted runbooks for newly generated support tickets, enhancing response times and system reliability.

Time-permitting, using a collection of automated runbooks (each targeting a specific problem), we will analyze support tickets, common problems, and their frequency to offer insights and suggestions to help roadmapping for Chameleon Cloud to offer the best return on investment on fixing problems.

A key aspect of this summer project is enhancing the reproducibility of experiments in the cloud and improving data accessibility. We plan to design infrastructures and APIs so that the telemetry data that is essential for anomaly detection and automated runbooks is systematically documented and made available. We also aim to collect and share insights and modules on applying ML for cloud operations, including ML pipelines, data labeling strategies, data preprocessing techniques, and feature engineering. By sharing these insights, we aim to promote best practices and support reproducible experiments on public clouds, thus fostering future ML-based practices within the Chameleon Cloud community and beyond. Time permitting, we will explore applying lightweight privacy-preserving approaches on telemetry data as well.

Topics: Machine Learning, Anomaly Detection, Automated Runbooks, Telemetry Data
Skills:
- Proficiency in Machine Learning: Understanding of ML algorithms for anomaly detection and automation.
- Cloud Computing Knowledge: Familiarity with CI/CD environments and cloud architectures.
- Programming Skills: Proficiency in languages such as Python, especially in cloud and ML contexts.
- Data Analysis: Ability to analyze telemetry data using data analytics tools and libraries.
Difficulty: Hard
Size: Large
Mentors: Michael Sherman