SummerofReproducibility23 | UCSC OSPO

Final Blog on Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Fri, 20 Oct 2023 00:00:00 +0000

Hello Again!

I’m excited to present my final blog post summarizing the progress and achievements made over the 2023 Summer of Reproducibility Fellowship.I will be sharing the work I’ve created for the Teaching Computer Networks with Reproducible Research: Developing a ‘classroom competition’ for adaptive video delivery.

Recap of the Journey

In my mid-term evaluation, I discussed the initial milestones and challenges I encountered during this program. At that point, I studied the key figures from the research paper ‘Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming’. My primary objectives were to ensure compatibility with both Python 2 and Python 3 and to incorporate an ‘Estimated Download Rate’ metric into the output file generated by the adaptive video client. Furthermore, I expanded the project to include two crucial visualizations: buffer occupancy vs. time and estimated download rate vs. time.

Final Project Progress

In the final weeks of my internship, I worked towards my ultimate goal, which was to reproduce existing work and create a clear guide for future students. I aimed to enable them to build upon and improve this work. To achieve this, I created a new experiment using an existing one,

which I titled “Compare Adaptive Video Policies”

This experiment compares two policies: rate-based (basic) policy and buffer-based (Netflix) policy. In the experiment, I covered the following key aspects:

How Both Policies Work: I detailed the workings of both the rate-based and buffer-based policies, explaining how each policy selects the next bitrate, among other relevant information.

Instructions for Execution of Policies: After conducting several experiments with different settings, I determined the most appropriate settings for this experiment. These settings have been added to the instructions for executing both policies, with a focus on ensuring similar “high” network rates, “low” data rates, similar durations of the “high” data rate before the interruption, and similar durations of the “interruption.” This setup allows for an easy and clear comparison of the two policies.

Discussion Part: In the discussion section, I addressed the differences that students can observe after conducting the experiment and visualising the graphs and videos.

In conclusion, I would like to thank my mentor, Fraida Fund, who has given me excellent guidance and would like to express my gratitude to OSRE23, where I have learned so much. This experience has been amazing for my personal and professional growth.

Final Blog on Using Reproducibility in Machine Learning Education

Wed, 18 Oct 2023 00:00:00 +0000

Welcome back!

In my final blog post for the 2023 Summer of Reproducibility Fellowship, I’ll be sharing my experiences and the materials I’ve created for the Using Reproducibility in Machine Learning Education project. As a quick reminder, my mentor Fraida Fund and I have been working on developing interactive open-source educational resources that teach reproducibility and reproducible research in machine learning. You can find my proposal here.

In this post, I’ll give you a rundown of my experience and share the materials I’ve created. If you haven’t checked out my previous blog posts, definitely take a look before diving into this one. Let’s get started!

Why is this project important 🤔

Reproducibility is an essential aspect of scientific research, and it’s becoming increasingly important in the field of computer science. However, most efforts to promote reproducibility in education focus on students who are actively involved in research, leaving a significant gap in the curriculum for introductory courses. Our project aims to address this issue by incorporating reproducibility experiences into machine learning education.

Why Reproducibility Matters in Education 🎓

There are two primary reasons why we believe reproducibility belongs in the computer science classroom. Firstly, it allows students to experience the process of reproducing research firsthand, giving them a deeper understanding of the scientific method and its importance in the field. This exposure can inspire students to adopt reproducible practices in their future careers, contributing to a more transparent and reliable scientific community.

Source: Fund, Fraida. “We Need More Reproducibility Content Across the Computer Science Curriculum.” Proceedings of the 2023 ACM Conference on Reproducibility and Replicability. 2023.

Secondly, as shown in the figure, involving students in reproducibility efforts can have a significant impact on the reproducibility ecosystem itself. Students can create reproducibility artifacts, such as replicable experiments or data analysis, that can be used by other researchers, including authors and graduate students. Additionally, students can consume reproducibility artifacts created by the research community, provide feedback, and suggest improvements. Authors appreciate this type of engagement, as it adds value to their work and promotes open science.

Focusing on Machine Learning 🧐

Given the growing interest in machine learning and its relevance to reproducibility, our project decided to focus on this area. Machine learning already has a strong culture of reproducibility, with initiatives like Papers with Code and the ML Reproducibility Challenge. These efforts encourage researchers to share their code and reproduce recent machine learning papers, validating their results. By leveraging these existing resources, we can create learning materials that utilize real-world examples and foster hands-on reproducibility experiences for students.

The Interactive Notebooks 📖

We have created two learning materials that focus on machine learning and reproducibility. The first material looks at a paper titled “On Warm Starting Neural Network Training” by Jordan T. Ash and Ryan P. Adams. This paper discusses the concept of warm-starting, which involves using weights from a previously trained model on a subset of the dataset to train a new model. The authors compare the performance of warm-started models with randomly initialized models and find that the warm-started models perform worse as shown in the below figure.

Our material takes students through the process of identifying the different claims made in the paper and finding the corresponding experiments that support them. They will also learn how to use open-source code and available data to reproduce these experiments and understand the computational complexity associated with reproducing each experiment. This material can be found on both github and chameleon where you can use chameleon to run the material on the required resources.

The second material examines the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” by Dosovitskiy et al., which introduces a novel way of applying the transformer architecture, which was originally designed for natural language processing, to image recognition tasks. The paper shows that transformers can achieve state-of-the-art results on several image classification benchmarks, such as ImageNet, when trained on large-scale datasets as shown in the following table.

Our material guides students through the process of understanding which claims can and cannot be validated based on the available datasets and how complex it can be to validate each claim. Additionally, they will learn how to use pre-trained models to replicate computationally expensive experiments. Again this material can be on both github and chameleon.

Both materials are designed to be easy to understand and interactive, allowing students to engage with the content and gain a deeper understanding of the concepts. Instructors can use these materials to assess their students’ understanding of machine learning and reproducibility.

Reflecting on the Journey

As we wrap up our journey of creating beginner-friendly learning materials for machine learning using reproducibility, it’s time to reflect on the rewarding experiences and valuable lessons learned along the way. Our deep dive into the world of machine learning and reproducibility not only enriched our knowledge but also provided us with an opportunity to contribute to the community at the UC Open Source Symposium 2023 at UCSC.

The symposium was a memorable event where we presented our work in a poster session. The diversity of the audience, ranging from professors and researchers to students, added depth to our understanding through their valuable feedback and insights. It was intriguing to see the potential applications of our work in various contexts and its capacity to benefit the broader community.

This project has been a personal journey of growth, teaching me much more than just machine learning and reproducibility. It honed my skills in collaboration, communication, and problem-solving. I learned to distill complex ideas into simple, accessible language and create engaging, interactive learning experiences. The most fulfilling part of this journey has been seeing our work come alive and realizing its potential to positively impact many people. The gratification that comes from creating something useful for others is unparalleled, and we are thrilled to share our materials with the world.

Your time and interest in our work are greatly appreciated! Hope you enjoyed this blog!

Learning Machine Learning by Reproducing Vision Transformers

Fri, 06 Oct 2023 00:00:00 +0000

Hello again!

In this blog post, I will be discussing the second material I created for the 2023 Summer of Reproducibility Fellowship. As you may recall from my first post, I am working on the Using Reproducibility in Machine Learning Education project with Fraida Fund as my mentor. My goal is to create interactive open-source educational resources that teach reproducibility and reproducible research in machine learning (ML), as outlined in my proposal.

In this post, I will share with you my second material, and how it can be helpful in machine learning class to teach students about vision transformers and reproducibility at the same time. If you haven’t seen my first work, be sure to check out my previous blog post. Without further ado, let’s dive in!

Reproducing “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”

This material is a reproduction of Dosovitskiy et al.‘s 2020 paper, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. This paper introduces the Vision Transformer (ViT), a novel architecture that applies the transformer model, originally designed for natural language processing tasks, to image recognition. The ViT model achieves state-of-the-art performance on several image classification benchmarks, demonstrating the potential of transformers for computer vision tasks.

The figure illustrates the key idea behind ViT, which is to treat an image as a sequence of patches, similar to how a transformer treats a sentence as a sequence of words. Each patch is flattened into a vector and fed into the transformer encoder, which learns to capture the complex relationships between these patches. The resulting representation is then fed into an MLP head, which produces a final prediction for the image. This approach allows ViT to handle large input images and capture both global context and fine-grained details. ViT models can also be pre-trained on large datasets and fine-tuned on smaller datasets for improved performance.

To reproduce this paper, I followed a systematic approach to ensure reliable results:

Critically analyze the paper’s qualitative and quantitative claims.
Identify the necessary experiments to verify each claim.
Determine the required data, code, and hyperparameters for each experiment.
Utilize pre-trained models for validating claims that require high computational resources.
Investigate resources shared by the authors, such as code, data, and models.
Assess the feasibility of verifying different types of claims.
Design new experiments for validating qualitative claims when certain models or datasets are unavailable.

I utilized Chameleon as my platform for conducting and documenting my reproduction experiments. Chameleon is a large-scale, reconfigurable experimental environment that supports computer science systems research. It enables users to create and share Jupyter notebooks capable of running Python code on Chameleon’s cloud servers. For this work, a GPU with 24GB or more memory is required to run the notebooks on GPU, which Chameleon offers in its variety of GPUs.

I have set up a GitHub repository where you can access all of my reproduction work. The repository contains interactive Jupyter notebooks that will help you learn more about machine learning and the reproducibility of machine learning research. These notebooks provide a hands-on approach to understanding the concepts and techniques presented in my reproduction work.

Challenges

Reproducing a paper can be a challenging task, and I encountered several obstacles during the process, including:

The unavailability of pretraining datasets and pretrained models
Inexact or unspecified hyperparameters
The need for expensive resources for some hyperparameters
The use of different frameworks for baseline CNNs and Vision Transformers

These issues posed significant difficulties in replicating the following table, a key result from the Vision Transformer paper that demonstrates its superiority over prior state-of-the-art models.

To overcome these challenges, I used the same models mentioned in the paper but pretrained on different datasets, experimented with various hyperparameter combinations to achieve the best results, and wrote my own code to ensure that both the baseline and Vision Transformer were fine-tuned using the same framework. I also faced other challenges, which I discussed in my notebooks along with the solutions I applied.

How to use this material?

This material consists of a series of notebooks that guide you through the paper, its claims, experiments, and results. You will learn how to analyze, interpret, and validate the authors’ claims. To get started, I recommend briefly skimming the original paper to gain an understanding of the main ideas and public information. This will help you see how the authors could have been more transparent and clear in certain sections. The notebooks provide clear instructions and explanations, as well as details on how I addressed any missing components.

Conclusion

In this blog post, I’ve walked you through the contents of this material and the insights users can gain from it. This material is particularly intriguing as it replicates a paper that has significantly influenced the field of computer vision. The interactive nature of the material makes it not only educational but also engaging and enjoyable. I believe users will find this resource both fun and beneficial.

I hope you found this post informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for reading and stay tuned for more updates!

Mid Term Blog : Using Reproducibility in Machine Learning Education: Reproducibility with Incomplete Methodology Descriptions

Fri, 04 Aug 2023 00:00:00 +0000

Hey,

I am Shekhar and I am one of several students who are working on developing materials for reproducibility in machine learning education, under the mentorship of Fraida Fund. My Proposal aims to develop interactive educational materials about reproducibility in machine learning, for use in graduate and undergraduate classes. Our goal is to help students and researchers (1) understand some of the challenges they may face when trying to reproduce someone else’s published result, and (2) in their own publications, to specify the methodology so that the result will be more easily reproduced by others.

Motivation

My work is inspired by my participation in the 2022 Machine Learning Reproducibility Challenge, where I was reproducing a result related to bias in hate speech classifiers. The paper seemed at first to have complete methodology details. However, when I tried to implement their approach based on the description of the paper, I realized some important details were missing - for example, in the part where they replaced swear words in the text with other words having similar meaning. I wasn’t able to identify the exact list of swear words they used, or what approach they followed if the selected replacement was also a swear word. The choices I made when the authors’ approach was left ambiguous had a significant impact on the magnitude of the final result.

Milestones and Accomplishments

To inform researchers and students about this problem, I created a fictitious machine learning research paper, and a sequence of accompanying Python notebooks to highlight various choices that can be made to fill in the gaps, and explore how these choices can impact the overall results of the research. Our “research paper” is about the impact of data augmentation on few-shot learning for intent classification. We implemented a basic data augmentation strategy with synonym replacement using the HWU64 dataset and a BERT classifier, and the results suggest that synonym replacement as a data augmentation technique leads to only minor improvement in accuracy. In the fictitious paper, we left some of the methodology details ambiguous. When reproducing the results using the accompanying notebooks, the reader follows a “Choose Your Own Adventure” format, selecting a path through a tree, where each node represents ambiguous methodology details and branches out to different choices that are made at that instance. The leaf nodes will represent the final results, providing insights into the magnitude of the differences resulting from each node selection. Some of the choices that the reader makes are -

what subset of the source dataset to use.
some of the details of data pre-processing.
some of the details of the synonym replacement data augmentation strategy.
some training hyperparameters and the details of the hyperparameter search.

During the first phase of our project, we have implemented an initial draft of these notebooks, to explore various scenarios and see their impact on results. Next, we will further develop the interactive educational material around them.

Challenges

During the first half of the project, I faced two main challenges. First, I had to come up with a hypothetical research scenario that was realistic, yet easy for students without much expertise to understand. Attaining the right balance was essential to make it engaging and educational. The second challenge was to deliberately leave some details unclear in a realistic way while ensuring that the choices based on that ambiguity had a significant impact on the results. Fortunately, I had the guidance and support of my mentor, which allowed me to successfully tackle these challenges.

Throughout this project, I faced various challenges and obstacles, but it turned out to be an incredible learning experience. I had the opportunity to dive deep into the domains of few-shot learning and meta-learning, which were entirely new to me. Moreover, I was able to find ambiguous methodologies present in academic papers and explore diverse scenarios related to them. Looking ahead, I am eager to continue working on this project throughout the summer, as it promises further learning and personal growth.

Introducing Levels of Reproduction and Replication in Machine Learning

Wed, 02 Aug 2023 00:00:00 +0000

Hello again,

I am Mohamed Saeed and this is my second blog post for the 2023 Summer of Reproducibility Fellowship. As you may recall from my previous post, I am working on the Using Reproducibility in Machine Learning Education project with Fraida Fund as my mentor. My goal is to create interactive open educational resources that teach reproducibility and reproducible research in machine learning (ML) as I proposed.

In this post, I will share with you some of the progress I have made so far, as well as some of the challenges I have faced and how I overcame them. I will also highlight some of the specific accomplishments that I am proud of and what I plan to do next.

Reproducing “On Warm Starting Neural Network Training”

This material is a reproduction of the paper “On Warm Starting Neural Network Training” by Jordan T. Ash and Ryan P. Adams (2020). This paper investigates the effect of warm-starting neural networks, which means using the weights of previous models trained on a subset of the data, to train on a new dataset that has more data.

The figure illustrates how the new model uses the weights from the previous model as its initial values. This allows the new model to train on both the “Original” data, which it has already seen, and the new data, which it has not encountered before. In contrast, the randomly initialized model treats the entire data as unfamiliar and starts from scratch.

The paper also shows that this method can lead to lower test accuracy than starting from scratch with random weights, even though the training loss is similar. The paper also proposes a simple way to improve the test accuracy of warm-starting by adding some noise to the previous weights.

To reproduce this paper, I followed a systematic approach that ensured reliable results. This approach involved:

Reading the paper and its main claims carefully.
Finding out what resources the authors shared, such as code, data, and models.
Looking for additional materials online that could help me save time and fill in the gaps left by the authors.
Setting up the environment and dependencies needed to run the code smoothly.
Writing code and updating any outdated functions that might cause errors.
Running the code and verifying that it matched the results reported in the paper.
Analyzing and interpreting the results and comparing them with the paper’s findings.

I used Chameleon as my platform for running and documenting my reproduction experiments. Chameleon is a large-scale, reconfigurable experimental platform that supports computer science systems research. It allows users to create and share Jupyter notebooks that can run Python code on Chameleon’s cloud servers.

I created a GitHub repository where you can find all related to my reproduction work in the form of interactive jupyter notebooks that will help you learn more about machine learning and reproducibility of machine learning research.

Challenges

Reproducing a paper is not an easy task. I faced several challenges along the way. One of the biggest challenges was the lack of code and pretrained models from the authors. This is a common problem for many reproducibility projects. Fortunately, I found a previous reproducibility publication for this paper on ReScience journal. I used some of their code and added some new functions and modifications to match the original paper’s descriptions. I also encountered other challenges that I discussed in the notebooks with the solutions that I applied.

How to use this material?

This material is a series of notebooks that walk you through the paper and its claims, experiments, and results. You will learn how to analyze, explain, and validate the authors’ claims. To get started, I suggest you skim the original paper briefly to get the main idea and the public information. This will help you understand how the authors could have been more clear and transparent in some sections. I have given clear instructions and explanations in the notebooks, as well as how I dealt with the missing components. You can use this material for self-learning or as an assignment by hiding the final explanation notebook.

Conclusion and Future Work

In this blog post, I have shared with you some of my work on reproducing warm starting neural network training. I have learned a lot from this experience and gained a deeper understanding of reproducibility and reproducible research principles in ML.

I am very happy with what I have achieved so far, but I still have more work to do. I am working on reproducing the Vision Transformer: An Image is Worth 16x16 Words paper by Alexey Dosovitskiy et al. This time my approach is to use the available pretrained models provided by the authors to verify the claims made in the paper. However, there are some challenges that I face in reproducing the paper. For example, some of the datasets and code that the authors used are not publicly available, which makes it hard to replicate their experiments exactly. These challenges are common in reproducing research papers, especially in computer vision. Therefore, it is important to learn how to deal with them and find ways to validate some of the claims.

I hope you enjoyed reading this blog post and found it informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for your attention and stay tuned for more updates!

ScaleBugs: Reproducible Scalability Bugs

Wed, 02 Aug 2023 00:00:00 +0000

Introduction

As part of the Scalebugs Project, we have worked on building a dataset of reproducible scalability bugs. To achieve this, we go through existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. Workloads are designed to reproduce these scalability bugs by triggering some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance.

So far we have worked on packaging the buggy and fixed versions of scalability systems, a runtime environment that ensures reproducibility, and the workloads used to trigger the symptoms of the bug inside docker containers. By packaging these versions together, we are simplifying the process of deployment and testing. This enables us to switch between different versions efficiently, aiding in the identification and comparison of the bug’s behavior. For each scalability system, we have carefully built a runtime environment that is consistent and reproducible. This approach ensures that each time we run tests or investigations, the conditions remain identical.

New Terms

In order to make sense of the various bug reports, we had to learn some terminologies associated with scalability systems:

Clusters: Clusters are groups of related or connected items, often found in various fields such as computer science, data analysis, or even social sciences. For example, in data analysis, clusters might represent groups of data points with similar characteristics, making it easier to understand patterns or trends in the data.

Cluster Membership: Cluster membership refers to the process of determining which items or entities belong to a particular cluster. This task can be done based on various criteria, such as similarity in attributes, spatial proximity, or shared characteristics.

Locks: In computer programming, locks are mechanisms used to manage access to shared resources, such as files, data structures, or hardware devices. When multiple processes or threads need to access a shared resource simultaneously, locks ensure that only one process or thread can access it at a time, preventing data corruption or conflicts.

Lock Contentions: Lock contention occurs when multiple processes or threads attempt to acquire the same lock simultaneously. When this happens, one process or thread must wait until the lock becomes available, leading to potential delays and reduced performance.

Critical Paths: In project management or process analysis, a critical path is the longest chain of dependent tasks that determines the overall duration of the project or process. Any delay in tasks along the critical path will directly impact the project’s completion time.

Tokens: Tokens can have various meanings depending on the context. In computer programming, tokens are the smallest units of source code recognized by a compiler or interpreter. In cryptography, tokens can represent digital certificates or authentication data used for secure communication.

Nodes: In the context of network theory or graph theory, nodes are individual points or entities that form a network or graph. In a computer network, nodes can be devices like computers or routers, and in a social network, nodes can represent individuals or entities.

Peers: Peers are entities within a network that have the same status or capabilities. In peer-to-peer networks, each node can act as both a client and a server, enabling direct communication between nodes without relying on a central server.

Gossipers, Gossip Protocol: In distributed systems, gossipers are nodes that share information with each other using the gossip protocol. The gossip protocol involves randomly selecting peers and exchanging information in a decentralized manner, allowing information to spread quickly across the network.

Threads: Threads are the smallest units of execution within a process in computer programming. Multiple threads can run concurrently within a single process, enabling multitasking and parallel processing. Threads can share the same resources within the process, making them more lightweight than separate processes. However, proper synchronization is essential to prevent data corruption or conflicts when multiple threads access shared resources.

Flush and Writes Contention: This refers to a situation where simultaneous operations involving data flushing (saving data to a storage medium) and data writing (updating or adding data) are causing conflicts or delays. This contention can arise when multiple processes or threads attempt to perform these operations concurrently, leading to performance bottlenecks or potential data integrity issues.

Accomplishments

We have been able to build docker containers for the following scalability bugs:

IGNITE 12087

This bug stems from the resolution of the IGNITE-5227 issue (another bug), which has led to a significant decline in the performance of a particular operation. Prior to addressing IGNITE-5227, the insertion of 30,000 entries displayed remarkable efficiency, completing in roughly 1 second. However, post the resolution, executing the same insertion process for 30,000 entries witnessed a considerable slowdown, taking approximately 130 seconds – a performance degradation of nearly 100 times.

CASSANDRA 14660

This bug is related to how clusters work together and how a lock is causing conflicts with the critical path. The issue arises from a method call that uses O(Peers * Tokens) resources while contending for a lock, which is causing problems in the write path. The lock is used to protect cached tokens that are essential for determining the correct replicas. The lock is implemented as a synchronized block in the TokenMetadata class.

How was this fixed?

It was fixed by reducing the complexity of the operation to O(Peers) taking advantage of some properties of the token list and the data structure.

CASSANDRA 12281

This bug is also related to how clusters work together and a lock conflict. The issue arises when a specific method is trying to access a lot of resources (O(Tokens^2)) while contending for a read lock. As reported, a cluster with around 300 nodes has around 300 * 256 (assuming the default number of tokens) tokens, thus joining a new member reportedly is taking more than 30 mins. This happens because due to the long execution time here, this lock makes every gossip message delayed, so the node never becomes active.

How was this fixed?

The granularity of the lock is decreased, meaning that the expensive function calls now do not take the problematic read lock and simply use a synchronized block, synchronizing on a specific field, that does the job much better.

HA16850

This is a bug related to obtaining thread information in the JvmMetrics package. When obtaining thread information, the original buggy version used MXBeans to obtain thread information. The call uses an underlying native implementation that holds a lock on threads, preventing thread termination or creation. This means that the more threads that we have to obtain information for, the longer the function call will hold a lock. The result is that the execution time scales on the number of active threads O(threads).

How was this fixed?

Developers utilized a ThreadGroup to keep track of obtaining metrics for threads. The result is that there is no lock held for every thread.

CA13923

This issue revolves around conflicts between the “flush” and “writes” processes. The main problem is that during the “flush” process, a resource-intensive function called “getAddressRanges” is invoked. This function has a high computational cost and its complexity is O(Tokens^2). In other words, the time it takes to complete this function grows quickly as the number of “tokens” increases. This situation is causing challenges and delays in the overall process.

How was this fixed?

This function call affected many paths and they made sure no one calls getAddressRanges in critical paths.

Challenges

Demanding Memory Requirements: Running certain builds consumes a significant amount of memory. This places a strain on system resources and can impact the overall performance and stability of the process.

Little Issues Impacting Execution: Often, seemingly minor details can obstruct the successful execution of a build. Resolving such issues requires thorough investigation and extensive research into similar problems faced by others in the past.

Complexities of Scalability Bugs: Identifying the underlying causes of scalability-related bugs is intricate. These bugs exhibit unique characteristics that can complicate the process of pinpointing and comprehending their root origins.

What is Docker? ( For those who don’t know about it )

Docker is a platform that facilitates the containerization of applications, leading to consistent and efficient deployment across diverse environments. Its benefits include portability, resource efficiency, isolation, and rapid development cycles. DockerHub complements Docker by providing a centralized hub for sharing and accessing container images, fostering collaboration and ease of use within the Docker ecosystem.

More about docker https://docs.docker.com/get-started/overview/

Mid-term blog post for Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Tue, 01 Aug 2023 00:00:00 +0000

Hello!

I am Srishti Jaiswal and this is my second blog post for the 2023 Summer of Reproducibility Fellowship.

Introduction

As I reach the halfway mark of my internship journey, I have had the incredible opportunity to work on a project that revolves around reproducing an adaptive video research result using cloud-based experimentation. This blog post delves into my exciting work so far, the significant milestones achieved, specific accomplishments to celebrate, and the challenges overcome. Utilizing CloudLab and FABRIC, I embarked on a journey to reproduce essential figures from the research paper Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming, ensure Python2 and Python3 compatibility and incorporate an Estimated Download Rate column in the log file produced by the video client. Let’s explore the details of this captivating internship experience.

Major Milestones Reached

Here are the milestones we have reached so far:

Familiar with CloudLab and Fabric Testbeds: I learned how to run an adaptive video experiment, which is the jumping-off point for my project, on the CloudLab and FABRIC platforms.
Python2 and Python3 Compatibility: My first task was to port an existing open-source code base developed for Python2 (which is no longer supported) so that it can run in Python3. Now code is running successfully in both versions for all the policies of the existing open source, i.e. Basic, Netflix and Sara. Fixed issue#1 .
Estimated Download Rate for Basic Policy: To make it easier for users to understand and visualize how the adaptive video policy works, I added an additional metric, “Estimated Download Rate”, to the output file produced by the adaptive video client.
Graphing Buffer Occupancy and Estimated Download Rate: I extended the existing experiment to show two additional visualizations that are important for understanding how the adaptive video client works: buffer occupancy vs time and estimated download rate vs time.

Overcoming Challenges

I encountered several challenges throughout this project, especially as it was my first time working independently on a research paper as a third-year engineering student. However, with my mentor’s guidance and support, I persevered and learned to tackle each obstacle with determination.

One significant challenge was porting the entire code from Python2 to Python3. This transition resulted in numerous errors, and I often found it challenging to pinpoint where the mistakes occurred. To overcome this, I adopted a step-by-step approach, fixing errors one by one and verifying them using Python2 for comparison.

Understanding the complex codebase was another hurdle that led to moments of feeling stuck in an infinite loop. But every time I faced such situations, I sought my mentor’s advice, and together, we made strategic changes to overcome these challenges.

I am immensely grateful for my mentor’s expertise and support throughout this internship. Her guidance played a crucial role in helping me navigate through the challenges and grow both professionally and personally. I eagerly look forward to the rest of the journey, knowing that I can continue making meaningful contributions to this research project with her inspiring mentorship.

Future Prospects

As the second half of my internship approaches, I am eager to refine further and expand our experimentation. Our main aim is to reproduce the existing work and provide a clear guide for other students to do the same for this, I have to create a framework that helps them improve and build upon this work.

I hope you enjoyed reading this blog post.If you have any questions or feedback, please feel free to contact me. Thank you for your attention and stay tuned for more updates!

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

Tue, 01 Aug 2023 00:00:00 +0000

We are currently midway into the OSRE 2023 program and the following post lists the progress that I have made on the project so far. As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time our overall goal was to enumerate the effect of sequence data quality on execution times. Towards that end, we decided to first identify suitable datasets from the two commmonly available -omics data modalities - transcriptomics and genomics. Albrecht et al. [1] developed seqQscorer to automate the quality control step of NGS data analysis through predictive modeling. They have also published the list of ENCODE datasets used for training the models. Quality label has been assigned as 0 for released files or 1 for revoked files. Based on the guidelines set forth by ENCODE’s Data Coordination Centre (DCC) a comprehensive manual annotation of the data was done by scientists and the resulting quality variable “status” was published to serve as an indication of the quality of the data. The following steps outline the process of generating the data table for building the machine learning models.

Step 1: Programmatically accessed 86 (34 released ; 34 revoked) RNA-seq files from ENCODE database. All the fastq files were single ended.
Step 2: Programmatically accessed 288 (144 released ; 144 revoked) DNA-seq files from ENCODE database. All the fastq files were paired ended.
Step 3: Implemeted the STAR aligner for RNA-seq and the BWA aligner for DNA seq. The resulting outputs contained the alignment times for both the “revoked” and “released”.
Step 4: Ran statistical tests to determine whether there is any significant differences in the runtimes of both types of files.

Currently I am running the FASTQC tool to extract data quality metrics for the same set of files as discsussed above. Once I have collected those metrics, I can start building regression models to determine whether there is any significant impact of data quality on execution time. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Through our analysis we aim to develop a reproducible ML model that will give the user an estimate of the runtime based on the raw FATSQ file as input.

References

[1] Albrecht, S., Sprang, M., Andrade-Navarro, M.A. et al. seqQscorer: automated quality control of next-generation sequencing data using machine learning. Genome Biol 22, 75 (2021). https://doi.org/10.1186/s13059-021-02294-2

Halfway Through GSOC: My Experience and Learnings

Mon, 17 Jul 2023 00:00:00 +0000

Hello there! I’m Jonathan Edwin, all the way from the beautiful archipelago of Indonesia. This year, I got the exciting chance to jump on board the 2023 Summer of Reproducibility initiative. It’s been quite the adventure! Right now, I’m pouring my energy into a fascinating project titled Using Reproducibility in Machine Learning Education project. I’m thrilled to be able to make my own little mark on it.

For those of you who are not familiar with what I’m working on, let me shed some light. My project, as part of the “Using Reproducibility in Machine Learning Education” initiative under guidance of Fraida Fund, focuses on creating educational resources that center around reproducing some key machine learning techniques. These include Cutout data augmentation, U-Net, and Siamese networks, to name a few. The end product will be a series of interactive Jupyter notebooks that provide step-by-step guidance for students, helping them not only understand these complex models but also gain hands-on experience in achieving research reproducibility.

Progress and Challenges

Embarking on this project, I dove headfirst into the world of Cutout data augmentation, immersing myself in the many experiments outlined in the foundational paper. This initial study proved to be an intricate blend of multiple datasets, two network architectures, and a performance evaluation of models with and without Cutout data augmentation. Additionally, it included the exploration of these models in combination with other data augmentation techniques.

One of our main objectives has been to help students visualize how the model interacts with the data, and for this, we’ve been leveraging a tool called Grad-CAM. The initial paper provided a rich landscape for exploration and learning, leading us to segment our journey into five interactive Jupyter notebooks - Introduction, CutOut, ResNet, WideResNet, Regularization, and Grad-CAM.

I’m excited to share that, as we’ve hit the mid-term milestone, I’ve managed to make significant strides and completed the notebooks up to the WideResNet section. It’s been a journey full of learning and growth, overcoming various challenges along the way - understanding the intricacies of the experiments, deconstructing complex architectures, and distilling all this into digestible, interactive notebooks for students. Despite the challenges, the process has been incredibly rewarding. As we gear up for the next half of the project, I’m eager to tackle the remaining sections and share my work with the community.

Learnings and Skills Gained

Embracing the Iterative Process of Open Source Development: My initial foray into open source development had me writing and running code in one environment, then copying parts of it to another environment and pushing it from there to GitHub. This occasionally led to mistakes during the code migration. However, I’ve since learned to write or change a little bit of code, run the new version directly from GitHub, catch errors, and improve. In open source development, the end goal is to ensure everything works flawlessly, even if it involves several iterations. This is especially true considering the code from GitHub might directly run on platforms like Chameleon or Google Colab.

Understanding the Distinction between Reproducing Experiments and Crafting Educational Content: There’s a stark difference between merely reproducing an experiment from a research paper and creating an educational resource around that experiment. The former generally involves cloning and running the code, verifying it against the claims in the paper with minimal modifications. The latter, however, necessitates adapting and simplifying the code, regardless of the learner’s skill level, to ensure their comprehension. It’s about carefully guiding learners through each step for a more profound understanding.

The Power of ‘Show, Don’t Tell’: This priceless lesson was imparted by my mentor, Ms. Fraida Fund. Rather than telling me what to do when I erred or needed to learn something new, she demonstrated the correct way first-hand. This hands-on approach made understanding far easier. This principle is also reflected in the creation of our notebooks. For instance, we chose to include the Grad-CAM notebook. Although not directly referenced in the paper, it offers students a clear visual understanding of the impact of the Cutout technique, embodying the “show, don’t tell” philosophy.

Next Steps

As we step into the second half of this thrilling journey, our primary goal is to complete the remaining sections of our Cutout project. We’re setting our sights on the final notebook - Grad-CAM. The Grad-CAM notebook will offer a visual exploration of how our models interpret and interact with data, thereby solidifying the students’ understanding of Cutout data augmentation. So, stay tuned for more as we plunge into these fascinating topics!

Conclusion

Looking back, my time with the Summer of Reproducibility initiative has been nothing short of a profound learning experience. Working on the “Using Reproducibility in Machine Learning Education” project has been both challenging and rewarding, and I am incredibly grateful for this opportunity.

I’ve gained valuable insights into open-source development, delved deeper into the intricacies of machine learning techniques, and experienced firsthand the transformative power of a ‘show, don’t tell’ teaching approach. Moreover, I’ve learned that the creation of educational resources requires a delicate balance between preserving the essence of original research and adapting it to foster easy understanding.

As we press forward, I’m excited about the prospects of the coming weeks. The completion of the Grad-CAM notebook lies ahead, marking the final pieces of our Cutout project. Beyond this project, the skills and lessons I’ve acquired during this initiative will undoubtedly guide me in future endeavours.

I can confidently say that my GSOC journey has been a remarkable chapter in my growth as a developer and researcher. Here’s to more learning, more coding, and more breakthroughs in the future!

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Wed, 12 Jul 2023 00:00:00 +0000

As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim, Martin Putra and collaborator Charis Christopher Hulu (another OSRE fellow) aims to analyze large-scale sequencing datasets in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times.
Recent advancements in Next-Generation Sequencing (NGS) technologies have resulted in massive amounts of nucleotide sequence data and automated genomic workflows to streamline analysis and data interpretation. The success of NGS-driven research has also led to a sudden increase in data of varying size and complexity, making it more time-consuming for researchers to test hypotheses. Analyzing high-throughput genomic data requires a step-by-step execution of dedicated tools - also known as workflows. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Prior work in this area has suggested that the runtimes of genomic workflows get affected due to qualitative differences in the data. Additionally, there is very little consensus on what constitutes “input quality” regarding data from large genomic experiments. In this proposal, we hypothesize that genomic data quality significantly impacts the genomic workflows’ execution time. We aim to leverage machine learning techniques to extract predictive features from quality control tools that robustly predict workflow execution time.

Using Reproducibility in Machine Learning Education

Mon, 05 Jun 2023 00:00:00 +0000

I am Jonathan Edwin, coming from Indonesia, and I am extremely thrilled to be involved in the 2023 Summer of Reproducibility initiative. I am actively contributing to the project by making valuable contributions to the Using Reproducibility in Machine Learning Education project.

As part of the Using Reproducibility in Machine Learning Education my proposal under the mentorship of Fraida Fund aims to develop educational resources focusing on reproducing and replicating fundamental machine-learning techniques, such as Cutout data augmentation, U-Net, and Siamese networks. The project aims to provide students with a hands-on learning experience that enhances their understanding of the models and their underlying principles while imparting valuable skills in ensuring research reproducibility. The project will involve the creation of a series of interactive Jupyter notebooks covering the selected papers, guiding students through reproducing results, and focusing on best practices for ensuring reproducibility. Upon completion, the notebooks will provide a comprehensive and accessible learning experience for students while emphasizing the importance of reproducibility in machine learning education. The proposal also identifies potential challenges associated with the project and proposed solutions to address them. Challenges include incompatibility issues with the original code and current frameworks or environments, difficulty in reproducing the exact results due to factors such as randomness or lack of specific details in the paper, and ensuring that the interactive elements in the Jupyter Notebooks are engaging and effective in teaching reproducibility concepts.

Introducing Levels of Reproduction and Replication in Machine Learning

Thu, 01 Jun 2023 00:00:00 +0000

Greetings everyone,

I am Mohamed Saeed and I am delighted to be part of the 2023 Summer of Reproducibility program, where I am contributing to the Using Reproducibility in Machine Learning Education project.

My proposal was accepted, and I am fortunate to have Fraida Fund as my mentor. The objective of my project is to develop highly interactive open educational resources that can be utilized by instructors teaching graduate or undergraduate machine learning courses. These resources will focus on integrating instruction on reproducibility and reproducible research principles.

Understanding and practicing reproducibility in machine learning (ML) research is of utmost importance in today’s scientific and technological landscape. Reproducibility ensures the reliability, transparency, and credibility of ML findings and discoveries. By learning the principles of reproducibility, students from different levels can validate research results, test introduced methodologies, and understand level of reproducibilty of research.

My contribution will involve developing interactive educational resources that encompass code examples, writing exercises, and comprehensive explanations of key concepts of reproducing ML research. These resources will be carefully crafted to assist students at various levels of expertise. Our aim is for these resources to be widely adopted by instructors teaching graduate or undergraduate machine learning courses, as they seek to enhance the understanding of reproducibility and reproducible research principles.

I think this is a great opportunity to learn more about ML research reproducibility. I’ll be posting regular updates and informative blogs throughout the summer, so stay tuned!

ScaleBugs: Reproducible Scalability Bugs

Thu, 01 Jun 2023 00:00:00 +0000

Hello! As part of the ScaleBugs project our proposals (proposal from Goodness Ayinmode and proposal from Zahra Nabila Maharani) under the mentorship under the mentorship of Cindy Rubio González,Haryadi S. Gunawi and Hao-Nan Zhu aims to build a dataset of reproducible scalability bugs by analyzing bug reports from popular distributed systems like Cassandra, HDFS, Ignite, and Kafka. For each bug report, we will analyze whether the reported bug is influenced by the scale of the operation, such as the number of nodes being used or a number of requests. The resulting dataset will consist of bug artifacts containing the buggy and fixed versions of the scalability system, a reproducible runtime environment, and workload shell scripts designed to demonstrate bug symptoms under different scales. These resources will help support research and development efforts in addressing scalability issues and optimizing system performance.

Using Reproducibility in Machine Learning Education: Reproducibility with Incomplete Methodology Descriptions

Sat, 27 May 2023 00:00:00 +0000

Hey,

I am Shekhar and I am one of several students who are working on the project Using Reproducibility in Machine Learning Education under the mentorship of Fraida Fund. My Proposal aims to develop interactive educational materials about reproducibility in machine learning, for use in graduate and undergraduate classes. My project is inspired by my experience in the Machine Learning Reproducibility Challenge, where I found that a major challenge for reproducibility was that some details were left ambiguous in the paper I was trying to reproduce. For my project, I will develop an interactive tutorial to help demonstrate how if the methodology details are not fully specified in a publication, then someone trying to reproduce the result will have to make choices that may not match the authors’, and these choices will affect whether or not the final result is validated.

Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Tue, 23 May 2023 00:00:00 +0000

As part of the Teaching Computer Networks with Reproducible Research project my proposal under the mentorship of Fraida Fund aims to develop a classroom competition for adaptive video delivery policies, leveraging an existing open-source reproducible result. The competition will challenge students to extend the original work and design their adaptive policies for head-to-head competition against their classmates.The project will involve packaging the existing result for easy reproducibility and building on it by implementing other adaptive video policies from the literature, developing different network settings for evaluating student submissions, and creating an evaluation framework for scoring submissions based on various criteria (so that competition remains fair and unbiased). The deliverables include a functional submission and evaluation process, an evaluation framework, and documentation and materials for course instructors to use in the classroom.