<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>experiment tracking | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/experiment-tracking/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/experiment-tracking/index.xml" rel="self" type="application/rss+xml"/><description>experiment tracking</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Thu, 18 Sep 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>experiment tracking</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/experiment-tracking/</link></image><item><title>Final Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/09182025-alghali/</link><pubDate>Thu, 18 Sep 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/09182025-alghali/</guid><description>&lt;h1 id="final-report-applying-mlops-to-overcome-reproducibility-barriers-in-ml">Final Report: Applying MLOps to Overcome Reproducibility Barriers in ML&lt;/h1>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Generating project" srcset="
/report/osre25/nyu/mlops/09182025-alghali/image1_hu9510d428e5a70e6f0fb80fd1f824e093_949203_8793561656181f829e3597ae957831b0.webp 400w,
/report/osre25/nyu/mlops/09182025-alghali/image1_hu9510d428e5a70e6f0fb80fd1f824e093_949203_c1f605866d28e52418a2120d1e90b899.webp 760w,
/report/osre25/nyu/mlops/09182025-alghali/image1_hu9510d428e5a70e6f0fb80fd1f824e093_949203_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/09182025-alghali/image1_hu9510d428e5a70e6f0fb80fd1f824e093_949203_8793561656181f829e3597ae957831b0.webp"
width="760"
height="447"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>Hello! I’m Ahmed Alghali, and this is my final report the project &lt;a href="https://ucsc-ospo.github.io/project/osre25/nyu/mlops/" target="_blank" rel="noopener">&lt;strong>Applying MLOps to Overcome Reproducibility Barriers in ML&lt;/strong>&lt;/a> under the mentorship of Professor &lt;a href="https://ucsc-ospo.github.io/author/fraida-fund/" target="_blank" rel="noopener">Fraida Fund&lt;/a> and &lt;a href="https://ucsc-ospo.github.io/author/mohamed-saeed/" target="_blank" rel="noopener">Mohamed Saeed&lt;/a>.&lt;/p>
&lt;p>This project aims to address the &lt;strong>reproducibility problem&lt;/strong> in machine learning—both in core ML research and in applications to other areas of science.&lt;/p>
&lt;p>The focus is on making large-scale ML experiments &lt;strong>reproducible on &lt;a href="https://www.chameleoncloud.org/" target="_blank" rel="noopener">Chameleon Cloud&lt;/a>&lt;/strong>. To do this; we developed &lt;a href="https://github.com/A7med7x7/ReproGen" target="_blank" rel="noopener">&lt;strong>ReproGen&lt;/strong>&lt;/a>, a template generator that produces ready-to-use, reproducible ML training workflows. The goal: is to make the cloud easy for researchers setting up experiments without the worry about the complexity involved in stitching everything together.&lt;/p>
&lt;hr>
&lt;h2 id="progress-since-mid-report">Progress Since Mid-Report&lt;/h2>
&lt;h3 id="migration-from-cookiecutter-to-copier">Migration from Cookiecutter to Copier&lt;/h3>
&lt;p>we initially used &lt;a href="https://www.cookiecutter.io/" target="_blank" rel="noopener">Cookiecutter&lt;/a> for template generation as a templating engine, but it lacked features we were interested in (e.g., conditional questions). we switched to &lt;a href="https://copier.readthedocs.io/en/stable/" target="_blank" rel="noopener">Copier&lt;/a>, which provides more flexibility and better matches our use case.&lt;/p>
&lt;h3 id="support-for-multiple-setup-modes">Support for Multiple Setup Modes&lt;/h3>
&lt;p>We now offer &lt;strong>two setup modes&lt;/strong>, designed to serve both beginners and users who want advanced options/customization:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Basic Mode&lt;/strong> – minimal prompts (project name, repository link, framework).&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Advanced Mode&lt;/strong> – detailed control (compute site, GPU type, CUDA version, storage site, etc.).&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>this ensures accessibility for new users, while still enabling fine-grained control for users.
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="prompting" srcset="
/report/osre25/nyu/mlops/09182025-alghali/image2_hu192805bf2f3285f5d80677238d9527e7_1255822_c0169673360dadfbcd30a72263676479.webp 400w,
/report/osre25/nyu/mlops/09182025-alghali/image2_hu192805bf2f3285f5d80677238d9527e7_1255822_416b0bbcc859df3cd794d760ce0308c8.webp 760w,
/report/osre25/nyu/mlops/09182025-alghali/image2_hu192805bf2f3285f5d80677238d9527e7_1255822_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/09182025-alghali/image2_hu192805bf2f3285f5d80677238d9527e7_1255822_c0169673360dadfbcd30a72263676479.webp"
width="760"
height="448"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="automated-credential-generation">Automated Credential Generation&lt;/h3>
&lt;p>previously, users had to manually generate application credentials (via Horizon OpenStack UI). now, we provide scripts that can generate two types of credentials programmatically—&lt;strong>Swift&lt;/strong> and &lt;strong>EC2&lt;/strong>—using &lt;strong>Chameleon JupyterHub credentials&lt;/strong> with &lt;code>python-chi&lt;/code> and the &lt;code>openstack-sdk&lt;/code> client.&lt;/p>
&lt;h3 id="automatic-readmemd-generation">Automatic README.md Generation&lt;/h3>
&lt;p>each generated project includes a &lt;strong>customized README.md&lt;/strong>, containing setup guidance and commands tailored to the user’s configuration.&lt;/p>
&lt;h3 id="bug-fixes-and-ux-enhancements">Bug Fixes and UX Enhancements&lt;/h3>
&lt;p>Alongside major features, we implemented numerous smaller changes and fixes to improve the reliability and user experience of the tool.&lt;/p>
&lt;hr>
&lt;h2 id="deliverables">Deliverables&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;a href="https://github.com/A7med7x7/ReproGen" target="_blank" rel="noopener">&lt;strong>ReproGen GitHub Repository&lt;/strong>&lt;/a>: source code for the template generator.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://github.com/A7med7x7/ReproGen/tree/mlflow-replay" target="_blank" rel="noopener">&lt;strong>mlflow-replay branch&lt;/strong>&lt;/a>: explore a past experiment, artifacts, and logged insights.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;a href="https://github.com/A7med7x7/ReproGen/tree/training-demo" target="_blank" rel="noopener">&lt;strong>LLM-Demo branch&lt;/strong>&lt;/a>: hands-on demo to track fine-tuning of an LLM using infrastructure generated by ReproGen.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="next-steps">Next Steps&lt;/h2>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Compatibility Matrix&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>the tool and the generated setup both depend on software dependencies that required paying attention to compatibility. in all level Hardware, OS, Drivers, Computing Platforms, core and 3rd-party libraries. writing a documentation as a start to help future debugging and adding pieces without breaking what is there. .&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Maintain Docker Images&lt;/strong>&lt;/p>
&lt;p>so far we have a cpu and GPU docker images for multiple most frequently used framework.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>CPU based image&lt;/strong>: for data science workload (Scikit-Learn)&lt;/li>
&lt;li>&lt;strong>GPU-Nvidia Variant&lt;/strong>: for Deep Learning workload on Nvidia Machines (Pytorch, Lightning, TensorFlow)&lt;/li>
&lt;li>&lt;strong>GPU-AMD Variant&lt;/strong>: for Deep Learning workload on AMD Machines (Pytorch, Lightning, TensorFlow)
adding more variants for more frameworks + Enhancing the experience of the existing images is recommended.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ol>
&lt;hr>
&lt;h2 id="reflection">Reflection&lt;/h2>
&lt;p>When I first joined SoR 2025, I had a problem crystallizing the idea of how I can practically achieve reproducibility and package a tool that would maximizes the chance of reproducing experiment build using it. throughout the journey my mentors took me under their wings and helped me to understand the &lt;strong>reproducibility challenges in ML&lt;/strong>, my Mentor Professor &lt;a href="https://ucsc-ospo.github.io/author/fraida-fund/" target="_blank" rel="noopener">Fraida Fund&lt;/a> wrote materials that saved me a lot of time to familiarize my self with the &lt;a href="chameleoncloud.org">testbed&lt;/a>,important Linux tools and commands, and even getting to have hand on practice how &lt;a href="https://teaching-on-testbeds.github.io/mltrain-chi/" target="_blank" rel="noopener">large model training&lt;/a> happen with MLflow tracking server system is done in the cloud. and &lt;a href="https://ucsc-ospo.github.io/author/mohamed-saeed/" target="_blank" rel="noopener">Mohamed Saeed&lt;/a>. who took the time reviewing my presentation pushing me to do my best. I&amp;rsquo;m forever thankful in the way they shaped the project and my personal growth. this hands-on experience help me viewing &lt;strong>MLOps , cloud APIs, and workflow design&lt;/strong> in different lenses, and I’m proud to have contributed a tool that can simplify help reproducible research for others.&lt;/p></description></item><item><title>Midterm Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/07292025-alghali/</link><pubDate>Wed, 30 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/07292025-alghali/</guid><description>&lt;h3 id="refresher-about-the-project">Refresher about the Project&lt;/h3>
&lt;p>Hi everyone! for the last month I have been working with my mentors Professor &lt;a href="https://ucsc-ospo.github.io/author/fraida-fund/" target="_blank" rel="noopener">Fraida Fund&lt;/a>, and &lt;a href="https://ucsc-ospo.github.io/author/mohamed-saeed/" target="_blank" rel="noopener">Mohamed Saeed&lt;/a> on our Project &lt;a href="https://ucsc-ospo.github.io/project/osre25/nyu/mlops/" target="_blank" rel="noopener">Applying MLOps to overcome reproducibility barriers in machine learning research&lt;/a> As a refresher, our goal is to build a template generator for a reproducible machine learning training workflows at the Chameleon testbed. We want to provide our users with the necessary environment configuration in a handy way. so they won&amp;rsquo;t be overwhelmed with all the intricate details of setting the environment. This will allow for validation and further development of their setup.&lt;/p>
&lt;hr>
&lt;h3 id="what-we-have-done-so-far">What we have done so far&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="userflow" srcset="
/report/osre25/nyu/mlops/07292025-alghali/userflow_hu8aae690c470ebe5647870c6d86c96c68_71910_d0aee31c44beeded617d15565a3078b7.webp 400w,
/report/osre25/nyu/mlops/07292025-alghali/userflow_hu8aae690c470ebe5647870c6d86c96c68_71910_23aab3e41951725ceb2ba1683e8a5455.webp 760w,
/report/osre25/nyu/mlops/07292025-alghali/userflow_hu8aae690c470ebe5647870c6d86c96c68_71910_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/07292025-alghali/userflow_hu8aae690c470ebe5647870c6d86c96c68_71910_d0aee31c44beeded617d15565a3078b7.webp"
width="760"
height="307"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The current workflow begins in JupyterHub, where the user provides basic details such as project name, site, and node type. the notebooks handle key setup tasks, like creating storage buckets, provisioning and configuring a server with GPU support, and mounting buckets locally via rclone. Once the host environment is ready, the user will SSH that machine, generates the necessary variables via a script and launches a containerized virtual lab that integrates Jupyter and MLflow. Inside the container, users authenticate with GitHub, connect or initialize their repositories, and can immediately begin training models, with all metrics, artifacts, and environment details logged for reproducibility.&lt;/p>
&lt;p>The progress on the project so far is as follows:&lt;/p>
&lt;h4 id="we-finalized-the-selection-of-frameworks-and-storage-options">We finalized the selection of frameworks and storage options.&lt;/h4>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="results" srcset="
/report/osre25/nyu/mlops/07292025-alghali/setup_huf1d9f9b29ea3e918ebffad4d45a90b19_52037_cc94f8d2983a972d5d551a1fd1b51c86.webp 400w,
/report/osre25/nyu/mlops/07292025-alghali/setup_huf1d9f9b29ea3e918ebffad4d45a90b19_52037_bd2f06761e3836b650d87a84b3ed4d00.webp 760w,
/report/osre25/nyu/mlops/07292025-alghali/setup_huf1d9f9b29ea3e918ebffad4d45a90b19_52037_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/07292025-alghali/setup_huf1d9f9b29ea3e918ebffad4d45a90b19_52037_cc94f8d2983a972d5d551a1fd1b51c86.webp"
width="760"
height="346"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Artifacts are now logged directly from the MLflow server to the Chameleon object store, without relying on a database backend or an intermediate MinIO S3 layer.&lt;/p>
&lt;h4 id="different-jupyter-lab-images-for-each-framework">Different jupyter lab images for each framework.&lt;/h4>
&lt;p>We’ve started with the top ML frameworks — PyTorch Lightning, Keras/TensorFlow, and Scikit-Learn. Each framework now has its own image, which will later be tailored to the user’s selection.&lt;/p>
&lt;h4 id="github-cli-and-hugging-face-integration-inside-the-container">Github CLI and Hugging Face integration inside the container.&lt;/h4>
&lt;p>The Jupyter container now integrates both the GitHub CLI and Hugging Face authentication. Users can manage their code repositories via GitHub CLI commands and authenticate with Hugging Face tokens to download/upload models and datasets. This eliminates the need for manual credential setup and streamlines ML experimentation within the environment.&lt;/p>
&lt;h4 id="custom-logging-utility">Custom Logging Utility&lt;/h4>
&lt;p>To ensure robust tracking of code versioning and environment details, we added a custom logging utility.&lt;br>
These logs are stored alongside metrics and model artifacts in MLflow, ensuring every experiment is fully documented and reproducible. summary of the functionalities:&lt;/p>
&lt;hr>
&lt;h5 id="log_git--captures-code-versioning">&lt;code>log_git()&lt;/code> — Captures Code Versioning&lt;/h5>
&lt;p>Uses Git commands (via subprocess) to log:&lt;/p>
&lt;ul>
&lt;li>Current branch name&lt;/li>
&lt;li>Commit hash&lt;/li>
&lt;li>Repository status (clean or dirty)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Example Output:&lt;/strong>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-text" data-lang="text">&lt;span class="line">&lt;span class="cl">commit: a7c3e9d
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">branch: main
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">status: dirty (1 file modified)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"># and git diff output
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h5 id="log_python-tracks-the-python-environment">&lt;code>log_python()&lt;/code>— Tracks the Python Environment&lt;/h5>
&lt;ul>
&lt;li>
&lt;p>Platform information + Python environment info (version)&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Exports a full pip freeze list to a .txt file&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Saved as an MLflow artifact to guarantee exact package version reproducibility&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Example Output (pip freeze extract):&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-txt" data-lang="txt">&lt;span class="line">&lt;span class="cl">numpy==1.26.4
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">pandas==2.2.1
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">scikit-learn==1.4.2
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">torch==2.2.0
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h5 id="log_gpu---records-gpu-information">&lt;code>log_gpu()&lt;/code> - Records GPU Information&lt;/h5>
&lt;ul>
&lt;li>
&lt;p>Detects available GPU devices&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Collects details using NVIDIA’s pynvml or AMD’s ROCm tools&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Logs:&lt;/p>
&lt;/li>
&lt;li>
&lt;p>GPU name&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Driver version&lt;/p>
&lt;/li>
&lt;li>
&lt;p>CUDA/ROCm version&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Captures gpu-type-smi output for deeper inspection&lt;/p>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;p>These utilities ensure that each run can be traced back with:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>The exact code version&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The full Python environment&lt;/p>
&lt;/li>
&lt;li>
&lt;p>The hardware details used&lt;/p>
&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h3 id="initial-customizable-template">Initial customizable template&lt;/h3>
&lt;p>We’ve prototyped an initial customizable template using Cookiecutter. it provides an interactive CLI, users provide some key project details (e.g., project name, frameworks, GPU type and integrations if any). Cookiecutter then generates a ready-to-use project structure with pre-configured integrations, reducing manual setup and ensuring consistency across environments.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="template generator"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/07292025-alghali/generator.gif"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The user will have notebooks to communicate with chameleon testbed resources, containerized environment and custom training scripts to plug their code.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="emelents" srcset="
/report/osre25/nyu/mlops/07292025-alghali/elements_huf8c1a359b014b199be1f96460f6453ca_50752_d71a4a6bed166f1ba25e0480abe6d891.webp 400w,
/report/osre25/nyu/mlops/07292025-alghali/elements_huf8c1a359b014b199be1f96460f6453ca_50752_0451200eb97ac154443b7261da58399a.webp 760w,
/report/osre25/nyu/mlops/07292025-alghali/elements_huf8c1a359b014b199be1f96460f6453ca_50752_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/07292025-alghali/elements_huf8c1a359b014b199be1f96460f6453ca_50752_d71a4a6bed166f1ba25e0480abe6d891.webp"
width="760"
height="262"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="whats-next">What’s Next&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Template Generation via Config + interactive widgets&lt;/strong>&lt;br>
We are exploring different ways to generate experiment templates using configuration files and interactive widgets in jupyter notebooks. This would let users quickly customize logging setups and considered to be more user-friendly.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>AMD-Compatible Images&lt;/strong>&lt;br>
Extend support by building and testing Docker images optimized for AMD GPUs. Up to now, our development efforts has focused on NVIDIA GPUs using CUDA-based images&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>End-to-End Lifecycle Example&lt;/strong>&lt;br>
Provide a larger example demonstrating the entire ML workflow:&lt;/p>
&lt;ul>
&lt;li>Data preparation&lt;/li>
&lt;li>Training with GPU logging&lt;/li>
&lt;li>Tracking metrics, artifacts, and environment info in MLflow&lt;/li>
&lt;li>Model evaluation and logging&lt;/li>
&lt;li>Reproducing results on different hardware backends&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;p>Working on this project so far has been both challenging and eye-opening. I’ve seen how many moving parts need to come together for a smooth workflow. The support from my mentors has been key in helping me turning challenges into real progress.&lt;/p>
&lt;p>Thank you for following along — I’m looking forward to sharing more concrete results soon.&lt;/p></description></item><item><title>Applying MLOps to overcome reproducibility barriers in machine learning research</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/06212025-alghali/</link><pubDate>Sun, 22 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/06212025-alghali/</guid><description>&lt;h3 id="about-the-project">About the Project&lt;/h3>
&lt;p>Hello! I&amp;rsquo;m Ahmed, an undergraduate Computer Science student at the University of Khartoum I&amp;rsquo;m working on making machine learning research more reproducible for open access research facilities like &lt;a href="chameleoncloud.org">Chameleon testbed&lt;/a>, under the project &lt;a href="https://ucsc-ospo.github.io/project/osre25/nyu/mlops/" target="_blank" rel="noopener">Applying MLOps to overcome reproducibility barriers in machine learning research&lt;/a>, mentored by Prof. &lt;a href="https://ucsc-ospo.github.io/author/fraida-fund/" target="_blank" rel="noopener">Fraida Fund&lt;/a> and &lt;a href="https://ucsc-ospo.github.io/author/mohamed-saeed/" target="_blank" rel="noopener">Mohamed Saeed&lt;/a>. as part of this project my &lt;a href="https://docs.google.com/document/d/146PutdVy7cWSf_Gn8qcn0Ba2llMHjNtHIQzZ5a-xRvQ/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a> aims to build a template generator that generates repositories for reproducible model training on the Chameleon testbed.&lt;/p>
&lt;h3 id="reproducibility">Reproducibility&lt;/h3>
&lt;blockquote>
&lt;p>&lt;em>We argue that unless reproducing research becomes as vital and mainstream part of scientific exploration as reading papers is today, reproducibility will be hard to sustain in the long term because the incentives to make research results reproducible won’t outweigh the still considerable costs&lt;/em>&lt;/p>
&lt;p>— &lt;a href="https://www.chameleoncloud.org/media/filer_public/25/18/25189b96-c3a2-4a55-b99b-c25322fe6682/reproducibility_on_chameleon-3.pdf" target="_blank" rel="noopener">Three Pillars of Practical Reproducibility Paper&lt;/a>&lt;/p>
&lt;/blockquote>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Acadamic code quality" srcset="
/report/osre25/nyu/mlops/06212025-alghali/codquality_huc392b48b950e52e3828e898b495a387e_63844_1883a01619446991471adb625dc1a04c.webp 400w,
/report/osre25/nyu/mlops/06212025-alghali/codquality_huc392b48b950e52e3828e898b495a387e_63844_a0629a8267968adb7dca83065a454987.webp 760w,
/report/osre25/nyu/mlops/06212025-alghali/codquality_huc392b48b950e52e3828e898b495a387e_63844_1200x1200_fit_q75_h2_lanczos.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/06212025-alghali/codquality_huc392b48b950e52e3828e898b495a387e_63844_1883a01619446991471adb625dc1a04c.webp"
width="733"
height="646"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>By Reproducibility in science we refer to the ability to obtain consistent results using the same methods and conditions as the previous study. in simple words if I used the same data and metholodgy that was used before, I should obtain the same results. this principle is mapped to almost every scientific field including both Machine Learning research in science and core Machine Learning.&lt;/p>
&lt;h3 id="challenges-in-reproducibility">Challenges in Reproducibility&lt;/h3>
&lt;p>The same way the famous paper about the &lt;a href="https://www.nature.com/articles/d41586-019-00067-3" target="_blank" rel="noopener">repoducibility crisis in science&lt;/a> was published in in 2016, similar discussions have been published discussing this in machine learning research setting, the &lt;a href="https://ojs.aaai.org/index.php/AAAI/article/view/11503" target="_blank" rel="noopener">paper state of the art reproducibility in artificial intelligence&lt;/a> after analayzing 400 hundereds papers from top AI conferences, it was found that around 6% shared code, approximately 33% shared test data. In contrast, 54% only shared a pseudocode (summary of the algorithm).&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Percentage of papers documenting each variable for the three factors" srcset="
/report/osre25/nyu/mlops/06212025-alghali/variables_hu1dd3560d8f29bff068e6ba2a71eed30f_236032_98f72f91d5f4040ac93d46a70ece1f4c.webp 400w,
/report/osre25/nyu/mlops/06212025-alghali/variables_hu1dd3560d8f29bff068e6ba2a71eed30f_236032_af62a4672817798441065a29b632ce1d.webp 760w,
/report/osre25/nyu/mlops/06212025-alghali/variables_hu1dd3560d8f29bff068e6ba2a71eed30f_236032_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/06212025-alghali/variables_hu1dd3560d8f29bff068e6ba2a71eed30f_236032_98f72f91d5f4040ac93d46a70ece1f4c.webp"
width="760"
height="312"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The lack of software dependency management, proper version control, log tracking, and effective artifacts sharing made it very difficult to reproduce research in machine learning.&lt;/p>
&lt;p>Reproducibility in machine learning is largely supported by MLOps practices which is the case in the industry where the majority of researchers are backed by software engineers who are responsible of setting experimental environments or develop tools that streamline the workflow.However, in academic settings reproducibility remains a great challenge, researchers prefer to focus on coding, and worry a little about the the complexities invloved in configuring their experimental environment,As a result, the adaptation and standardization of MLOps practices in academia progress slowly. The best way to ensure a seamleas experience with MLOps, is to make these capabilities easily accessible to the researchers&amp;rsquo; workflow. by developing a tool that steamlines the process of provisioning resources, enviornment setup, model training and artifacts tracking, that ensures reproducible results.&lt;/p>
&lt;h3 id="proposed-solution">Proposed Solution&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Solution Architecture" srcset="
/report/osre25/nyu/mlops/06212025-alghali/Design_hue0a2172c7dadd98f1563084aefb8ce3c_266216_eca83abc0b11e0d295efffaa464eaf53.webp 400w,
/report/osre25/nyu/mlops/06212025-alghali/Design_hue0a2172c7dadd98f1563084aefb8ce3c_266216_4abd128ad260ffc60e4a7ebd623e4e32.webp 760w,
/report/osre25/nyu/mlops/06212025-alghali/Design_hue0a2172c7dadd98f1563084aefb8ce3c_266216_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/nyu/mlops/06212025-alghali/Design_hue0a2172c7dadd98f1563084aefb8ce3c_266216_eca83abc0b11e0d295efffaa464eaf53.webp"
width="760"
height="547"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>We want the researchers to spin up ML research instances/bare metal on Chameleon testbed while keeping the technical complexity involved in configuring and stitching everything together abstracted, users simply answer frew questions about their project info, frameworks, tools, features and integrations if there are any, and have a full generated,reproducible project. it contains a provisioning/infrastracture config layer for provisioning resources on the cloud, a dockerfile to spin up services and presistent storage for data,the ML tracking server system that logs the artifacts, metadata, environment configuration, system specification (GPUs type) and Git status using Mlflow, powered by a postgresSQL for storing metadata and a S3 Minio bucket to store artifacts.ML code at its core is a containarized training environment backed by
persistent storage for the artifacts generated from the experiment and the datasets and containarization of all these to ensure reproducibility.we aim to make the cloud experience easier, by dealing with the configuration needed for setting up the environment having a 3rd party framework, enabling seamless access to benchmarking dataset or any necessary components from services like Hugging face and GitHub as an example will be accessible from the container easily. for more techincal details about the solution you can read my propsal &lt;a href="https://docs.google.com/document/d/1ilm-yMEq-UTiJPGMl8tQc3Anl5cKM5RD2sUGInLjLbU" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;p>By addressing these challenges we can accelerate the scientific discovery. this not benefits those who are conducting the research but also the once building on top of it in the future. I look forward to share more updates as the project progresses and I welcome feedback from others interested in advancing reproducibility in ML research.&lt;/p></description></item></channel></rss>