Fraida Fund | UCSC OSPO

Applying MLOps to overcome reproducibility barriers in machine learning research

Sat, 01 Mar 2025 00:00:00 +0000

Topics: machine learning, MLOps, reproducibility
Skills: Python, machine learning, GitOps, systems, Linux, data, Docker
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fraida Fund and Mohamed Saeed

Project Idea Description

Reproducibility remains a significant problem in machine learning research, both in core ML and in the application of ML to other areas of science. In many cases, due to inadequate experiment tracking, dependency capturing, source code versioning, data versioning, and artifact sharing, even the authors of a paper may find it challenging to reproduce their own study several years later. This makes it difficult to vaidate and build on previous work, and raises concerns about its trustworthiness.

In contrast, outside of academic research, MLOps tools and frameworks have been identified as a key enabler of reliable, reproducible, and trustworthy machine learning systems in production. A good reference on this topic is:

Firas Bayram and Bestoun S. Ahmed. 2025. Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps Approach. ACM Comput. Surv. 57, 5, Article 121 (May 2025), 35 pages. https://doi.org/10.1145/3708497

This project seeks to bridge the gap between widely adopted practices in industry and academic research:

by making it easier for researchers and scientists to use MLOps tools to support reproducibility. To achieve this, we will develop starter templates and recipes for research in computer vision, NLP, and ML for science, that have reproducibility “baked in” thanks to the integration of MLOps tools and frameworks. Researchers will launch these templates on open access research facilities like Chameleon.
and, by developing complementary education and training materials to emphasize the important of reproducibility in ML, and how the tools and frameworks used in the starter templates can support this goal.

Writing a successful proposal for this project

A good proposal for this project should -

demonstrate a good understanding of the current barriers to reproducibility in machine learning research (specific examples are welcome),
describe a “base” starter template, including the platforms and tools that will be integrated, as well as specific adaptations of this template for computer vision, NLP, and ML for science,
explain the “user flow” - how a researcher would use the template to conduct an experiment or series of experiments, what the lifecycle of that experiment would look like, and how it would be made reproducible,
include the contributor’s own ideas about how to make the starter templates more usable, and how to make the education and training materials relatable and useful,
and show that the contributor has the necessary technical background and soft skills to contribute to this project. In particular, the contributor will need to create education and training materials that are written in a clear, straightforward, and concise manner, without unncessary jargon. The proposal should show evidence of the contributor’s writing abilities.

Github link

There is no pre-existing Git repository for this project - at the beginning of the summer, the contributor will create a new repository in the Teaching on Testbeds organization, and the project materials will “live” there.

Data leakage in applied ML: reproducing examples of irreproducibility

Wed, 21 Feb 2024 00:00:00 +0000

Topics: applied machine learning, data leakage, reproducibility
Skills: Python, data analysis, machine learning
Difficulty: Medium
Size: Large (350 hours)
Mentors: Fraida Fund and Mohamed Saeed

Project Idea Description

Data leakage has been identified as a major cause of irreproducibility of a paper’s findings, when machine learning techniques are applied to problems in science. Data leakage includes errors such as:

pre-processing before splitting into training/test sets
feature selection before splitting into training/test sets
duplicated data points in both training and test sets
temporal leakage (e.g. shuffled K-fold cross validation with temporal data)
group leakage (e.g. shuffled K-fold cross validation with data that has group structure)

and leads to an overly optimistic evaluation of model performance, such that the finding may no longer be the same when the error is corrected.

Despite the seriousness of this problem, data leakage is often not covered in introductory machine learning courses, and many users of machine learning across varied science domains are unaware of it. Even those who have learned “rules” for avoiding data leakage (e.g. “never do feature selection on the test set”) may not understand the reasons for these “rules”, and how important they are for ensuring that the final result is valid and reproducible.

The goal of this project is to create learning materials demonstrating how instances of data leakage invalidate a result. These materials should be easily adoptable by instructors teaching machine learning in a wide variety of contexts, including those teaching a non-CS audience. To achieve this, the project proposes to re-implement published results that have been affected by data leakage, and package these implementations along with supporting material in a format suitable for use in classrooms and by independent learners. For each “irreproducible result”, the “package” should include -

a re-implementation of the original result
an explanation of the data leakage problem affecting the result, with an implementation of a “toy example” on synthetic data
a re-implementation of the result without the data analysis error, to show how the finding is affected
and examples of exam or homework questions that an instructor adopting this package may use to assess understanding.

Writing a successful proposal for this project

A good proposal for this project should include, for at least a few “types” of data leakage mentioned above -

a specific published result that could be used as an exemplar (you may find ideas among the review papers listed here)
a brief description of the details of the experiment that will reproduce that result (e.g. what data is used, what machine learning technique is used, what are the hyperparameters used for training)
and an explanation of why this result is suitable for this use (it uses a publicly available dataset, a machine learning technique that is familiar and accessible to students in an introductory course, the paper has sufficient detail to reproduce the result, etc.)

The contributor will need to create learning materials that are written in a clear, straightforward, and concise manner, without unncessary jargon. The proposal should show evidence of the contributor’s writing abilities.

Github link

To get a sense of the type of code you would be writing, here is an example of a learning module related to data leakage (however, it is not in the format described above): Beauty in the Classroom

Project Deliverables

“Packages” of learning materials for teaching about common types of data leakage
Trovi artifacts for “playing back” each of the “packages”

Evaluating congestion controls past and future

Wed, 21 Feb 2024 00:00:00 +0000

Topics: computer networks, congestion control, reproducibility
Skills: Python, Bash scripting, Linux, computer network performance evaluation
Difficulty: Medium
Size: Large (350 hours)
Mentors: Fraida Fund and Ashutosh Srivastava

Project Idea Description

In computer networks, congestion control protocols play an outsize role in determining our experience with networked applications. New congestion control algorithms are regularly proposed by researchers to improve throughput and latency performance, adapt to new types of networks, and align more closely with the needs of new applications.

However, our understanding of the benefits of a new congestion control protocol depends to a large extent on the evaluation - the network topology, the network delay and throughput, the type of flow, the type of competing traffic - and there is no single standard way to evaluate a congestion control protocol. The Pantheon project (which is no longer supported) sought to fill this gap somewhat and address the problem of reproducibility of congestion control results, but their approach is not easily adapted to evaluation scenarios representative of new types of applications or networks. Nor is it capable of representing the evaluation scenarios in most published results related to congestion control.

The goal of this project, therefore is to create an evaluation suite for congestion control protocols that can be used to reproduce existing congestion control results in the academic literature, and to evaluate new protocols under similar evaluation conditions, and to be easily extended to new scenarios. An “evaluation scenario” includes:

a Python notebook to realize the network topology on the FABRIC and/or Chameleon testbed, and configure the network characteristics,
scripts to generate the data flow(s) needed for the evaluation,
and scripts to capture data from the experiment and visualize the results.

Writing a successful proposal for this project

To write a good proposal for this project, you should review the most influential papers on TCP congestion control, and especially those related to TCP protocols that are available in the Linux kernel.

Use your findings to explain what your proposed evaluation suite will include (what network topologies, what flow generators), and justify this with reference to the academic literature. Also indicate which specific results you expect to be able to reproduce using this suite (e.g. include figures from influential papers showing evaluation results! with citation, of course).

You can also take advantage of existing open source code that reproduces a congestion control result, e.g. Replication: When to Use and When Not to Use BBR, or Some of the Internet may be heading towards BBR dominance: an experimental study.

Github link

There is no pre-existing Git repository for this project - at the beginning of the summer, the contributor will create a new repository for this project.

Project Deliverables

“Packages” of evaluation scenarios that can be used to evaluate a congestion control algorithm implemented in the Linux kernel
Trovi artifacts for realizing each evaluation scenario on Chameleon

Reproducible Evaluation of Multipath Network Protocols

Thu, 16 Feb 2023 00:00:00 +0000

Lead Mentor: Ilknur Aydin

As mobile devices with dual WiFi and cellular interfaces become widespread, network protocols have been developed that utilize the availability of multiple paths. However, the relative effectiveness of these protocols is highly dependent on the characteristics of the network (including the relationship between the two paths, which are often not independent). Researchers typically evaluate a multipath protocol for a small set of network scenarios, which vary from one publication to the next. It is therefore difficult to get a good picture of how different protocols perform in a range of settings.

Framework for repeatable, direct comparison of multipath transport protocols

Topics: Computer networks, wireless systems
Skills: Linux, networking, data analysis and visualization, writing
Difficulty: Large
Size: 350 hours
Mentor(s): Ilknur Aydin and Fraida Fund

In single-path congestion control, the Pantheon work created a reference set of executable benchmarks that researchers could use to evaluate novel congestion control designs against existing work in a wide range of the scenarios. This project seeks to achieve something similar for multipath protocols, using publicly available networking testbeds like FABRIC. For this project, the participant will:

Prepare a set of network benchmarks for multipath protocols, using live network links, real link traces, and emulated scenarios
Develop an experiment using the benchmarks to evaluate existing multipath protocol implementations
Prepare materials that researchers can use to evaluate novel multipath protocols against the others in the benchmark

Teaching Computer Networks with Reproducible Research

Wed, 18 Jan 2023 00:00:00 +0000

Lead Mentor: Fraida Fund

In the field of computer networks and wireless communication systems, the availability of open access networking and cloud computing testbeds (GENI, CloudLab, Chameleon, FABRIC, and others) has been transformative in promoting reproducible research and in making high-quality experiential learning available to students and educators at a wide range of colleges and universities. This project seeks to unite research and education use of these testbeds by developing new ways of using reproducible research to teach computer networks and related topics.

Bringing foundational results into the classroom

Topics: Computer networks, reproducibility, education
Skills: Linux, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD

To make foundational results from computer networks more concrete, this project seeks to reproduce a selection of key results and package them for use as interactive classroom demonstrations. (An example of a “foundational” result might be the result from the 1980s that motivates congestion control by showing how congestion collapse occurs when the network is under heavy load.) This involves:

Reproducing the original results on an open-access testbed
Packaging the materials for use as a classroom demo, with interactive elements
Creating assessment questions and sample “solutions” related to the materials, that instructors may use in homework assignments or exams

Developing a “classroom competition” for adaptive video delivery policies

Topics: Computer networks, adaptive video, reproducibility, education
Skills: Linux, Python, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contributor(s): Srishti Jaiswal

A carefully designed competition can be a fun and exciting way for students to challenge themselves and gain “ownership” of a new topic. This projects builds on an existing open source reproducible result for adaptive video delivery, and will challenge students to extend this work and design their own adaptive video policies for head-to-head competition against their classmates. This includes:

Packaging the result to make it easier for students to reproduce and then build on the original work
Implementing other adaptive video policies from the literature, so that students can use them as a baseline
Developing different network settings (using live link traces and emulated link patterns) in which student submissions may be evaluated
Developing an evaluation framework for scoring student submissions on different criteria and in different network settings, and making the results available in a leaderboard format

Using Reproducibility in Machine Learning Education

Wed, 18 Jan 2023 00:00:00 +0000

Lead Mentor: Fraida Fund

The computer science and engineering classroom is as essential part of the reproducibility “ecosystem” - because of broad reach and potential for big impact, and because for many students, the classroom is their first exposure to research in their field. For machine learning in particular, reproducibility is an important element of the research culture, and can be a valuable part of any introductory or advanced courses in the field. These projects will develop highly interactive open educational resources, that may be adopted by instructors of graduate or undergraduate machine learning courses to incorporate more instruction about reproducibility and reproducible research.

Introducing “levels” of reproduction and replication in ML

Topics: Machine learning, reproducibility, education
Skills: Python, machine learning, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contributor(s): Mohamed Saeed

In machine learning, replicating a published result to confirm the validity of the experimental results and the broader conclusions of the paper can take several forms, with increasing levels of effort:

using authors’ code and pre-trained weights, run the model on the same benchmarks as the original paper
training a model using authors’ code and published hyperparameters,
training a model using authors’ code and a new hyperparamter search,
validating the authors’ code e.g. with unit tests, in addition to training,
re-implementing the model,
designing additional experiments to validate that the suggested mechanism is in fact responsible for the result,
and more.

This project will develop interactive materials (using one or more exemplar published results) to illustrate and to highlight relevant aspects and pitfalls of each of these “levels” of reproduction and replication.

Packaging existing reproducible results for the ML classroom

Topics: Machine learning, reproducibility, education
Skills: Python, machine learning, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contribuor(s): Shekhar, Jonathan Edwin

The goal is to make it easier for instructors to expose students to state-of-the-art research in the classroom. This project will work with an existing set of recent reproducible results in machine learning, and will package them for easier consumption by students and more effective use in the classroom. This may include, but is not necessarily limited to:

Re-validating the result and re-packaging along with computational environment on an open access testbed
Creating tutorial material around the result, including interactive visualizations to demonstrate key elements of the work
Creating one-click demos for applying the model/technique to a new test sample
Curating test samples to highlight important advantages and limitations of the result
Creating assessment questions and sample “solutions” that instructors may use to “assign” the work to students