osre23 | UCSC OSPO

These 4 new features will change the way you use OpenROAD

Sun, 29 Oct 2023 00:00:00 +0000

Introduction

Welcome to the final blog post for my GSoC’23! Once again, my name is Jack and I am working under the open-source electronic design automation project - OpenROAD. We are a fast growing leading open-source foundational application for semiconductor digital design, as evidenced from our consistent star growth since inception. You may check us out at this link. Allow me to share the four significant contributions I made in this GSoC project.

1) Improving Ease of Installation

Firstly, OpenROAD is now able to support multiple operating systems. This is essential as one of our primary goals is to democratise chip implementation. And installation is often one of the hardest steps to get right, so that was one of our priorities. Today, we have provided options for different types of installation:

Prebuilt binaries: Local installations can often be riddled with incompatibilities or unexpected bugs, as well as taking a long compilation time. We sidestepped this by providing semi-regular updates to OpenROAD binary, reducing the time to installation.
Docker: Echoing previous concerns, we also enabled Docker installation for 9 major operating systems. Docker is extremely flexible and runs on many operating systems (as long as it is supported by Docker).

With these changes, we have observed 10% reduction of installation related Github issues posted on a weekly basis.

Figure 1: Supported OS matrix

2) Filling Missing Documentation

Next, we have made considerable improvements to over 20 tool-specific documentations, introducing consistent formatting styles for each page. We introduce default values and datatypes to allow users to use the tools with greater ease.

Figure 2: Helpful documentation defaults and datatype

Rather than having all arguments for a function under a common table, we separated out into developer arguments and developer commands. This is to further make our documentation more beginner-friendly to read, while not alienating our technical userbase. We have also added sections for example scripts and regression test, so as to help onboard newcomers to each tool of the flow.

Figure 3: Useful developer commands, example scripts, and regression test instructions

3) Extensible Documentation Framework

Thirdly, we have introduced extensible documentation frameworks. Now, what do we mean by extensible? It means we have created an infrastructure which is easy to use for developers, and allows for greater maintanability. Our goal is to create something that requires minimal changes to add content for documentation.

So, how did we do this?

We introduced 4 initiatives, namely: the warning/error messages glossary. We noticed that people were searching for error and warning messages, but our documentation did not have them. So we added a page where all the error/warning messages along with relevant code line number can be generated automatically. On top of that, developers can add useful debug information to help the end user.

Figure 4: Warning/Error messages glossary.

Next, we also introduced automatically generated Doxygen pages, which integrates nicely into our C++/Tcl source code framework. This automatic generation will make it much more convenient for developers to just insert comments into their source code, and allow Doxygen to generate documentation automatically.

Figure 5: Doxygen pages.

Next, we introduced cloud-based packaging. It is important that our framework is able to runnable on cloud, and the ever-popular notebook format. Our Colab based notebook was created with this in mind, and allows for easy transfer to other notebook providers with some modifications. Check out the notebooks here!

Figure 6: Google Colab can now run OpenROAD scripts.

Lastly, we have the changelog workflow which can be triggered manually. For our open-source project, we have chosen not to do software releases. This means it can be difficult to track the changes between commit numbers. Adding this workflow can help newcomers track the changes easier, by month.

Figure 7: Sample output of github changelog

4) OpenROAD Chatbot

Finally, we are also discussing the potential of creating a chatbot whose purpose is to answer user queries. We were thinking, there are lots of domain knowledge in Slack Channels, Github repos, and so on, so why not create a LLM-based chatbot. Stay tuned for updates!

Personal Reflections

To me, my most valuable takeaway is with regards to code quality. Often times, we as coders tend to opt for the best solution and “hack” something out quickly. Hacking is fine, as a proof of concept - but not for long term code development. Working in open-source projects like this, I have learnt to avoid creating unnecessary files, shortening the code and optimising runtime. In doing our job, we also wish to make life easier, not harder for future developers

Final Words

I would like to express my gratitude to my mentors Indira and Vitor for their guidance and insight throughout the project, as well as the OpenROAD dev team for their assistance. Would also like to thank the Google Summer of Code organising committee, and UCSC for creating such a wonderful program. Being able to contribute to actual real open-source projects with real needs, is truly the best of both worlds for aspiring programmers.

Final Blog Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Wed, 25 Oct 2023 00:00:00 +0000

In my final blog, I will first introduce the project, then describe the achievements after the midterm and summarize our experiments. As part of the Measuring Research Prototypes under Unreported Settings my proposal under the mentorship of Yang Wang and Miao YU aims to understand the impact of missing settings in artifact evaluation.

In my midterm blog(/report/osre23/osu/missingsettings/20230802-ren.450/), I took three paratmeters as the PostgreSQL config to test the performance of TPC-C benchmark and got some initial results about the effect of different parameters separately on throughput performance. After the midterm, I continue doing experiments on these four parameters (shared_buffer, min_wal_size, max_wal_size and effective_cache_size) with more values and associate them to measure the effect on performance. These parameters are related to memory consumption, checkpoints and planner cost in the database server. You can refer to my previous blog for details.

For the experiment, we continue to measure the throughput performanace for the benchmark by setting scalefactor as 10 and incrementing worker terminals. The settings for database server are all default values except the four parameters we choose to tune. For the shared_buffer parameter, we choose from initial 128mb to 8gb, in total 6 values. Then for each shared_buffer setting, effective_cache_size includes three values, from initial 4gb to 16gb. Next, for each effective_cache_size setting we tune the min_wal_size and max_wal_size as a tuple, min_wal_size has two values and max_wal_size has four values, in total 6 values. We conduct the experiments by running three rounds for each setting and get all three throughput numbers and calculate their average values.

Based on the results, the observation holds as the conclusion from midterm blog. The throughput of the benchmark can be affected by tuning shared_buffer and max_wal_size. Effective_cache_size and min_wal_size do not have obvious effect for this benchmark. The improvement is limited after shared_buffer and max_wal_size reach a certain value.

In our experiment, we only choose three possible parameters for one benchmark. The experiment is expensive considering the consuming time. There are also more values of above mentioned parameters to test. This experiment can also indicate we may need to sample a subset of settings to generate observations that match those from a full extensive artifact evaluation.

Public Artifact and Data Visualization: A Journey to Empower

Tue, 24 Oct 2023 00:00:00 +0000

Hola Amigos! As we draw the curtains on our project titled Public Artifact and Data Visualization we’re thrilled to present the incredible advancements we’ve achieved since our mid-term update. Our mission has been to foster a deeper understanding of data and empower users to make informed decisions. Let’s delve into the remarkable evolution of our project.

Unveiling New Functionalities

Modular Architecture: Your Way, Your Choice

At the core of our project is a modular architecture designed to cater to your unique preferences. We firmly believe that choice empowers users. Thus, we’ve given you the option to select between a Graphical User Interface (GUI) and a Command-Line Interface (CLI). It’s about providing a platform that adapts to your specific requirements and style of interaction.

Real-time Backend Environment Monitoring: Data as it Happens

Real-time monitoring of backend environment data is at the heart of our project. It’s not just about collecting data; it’s about providing continuous insights into system performance. This feature empowers you to make real-time, data-driven decisions—an essential capability in today’s fast-paced computing landscape.

Visualizing Environment Variables: Clarity Amidst Complexity

We’ve placed a strong emphasis on user-friendly data visualization. Our enhancements enable you to navigate through detected variables effortlessly and compare iterations within different buckets. The result is a visual representation of complex data, making it easier to comprehend and analyze.

Predefined Monitoring Commands: Your Head Start

We understand that monitoring can be a daunting task. To simplify the process, we’ve introduced predefined monitoring commands such as mpstat and iostat. These templates serve as a launchpad for monitoring common system metrics, helping you get started quickly and efficiently.

Comprehensive Customization: Tailoring the Experience

Recognizing that every user has unique needs, our platform now offers extensive documentation. This documentation serves as a guide, enabling users to fine-tune their monitoring commands. It’s about tailoring the platform to match your specific requirements and preferences. The power to customize is firmly in your hands.

Import and Export Functionality: Seamless Collaboration

In an era where collaboration and data management are essential, we’ve introduced the capability to import and export environment data. This feature simplifies data management and supports collaborative efforts, making it easy to share monitoring data and conduct analysis across various environments.

Exploring Our Repositories

As mentioned earlier, we have completed the core functionalities of our platform, and we would love to have you try it out and provide us with valuable feedback. Here are the links to our repositories where you can explore and experiment with our platform:

GUI Repository and CLI Repository
- The journey begins with a choice. Our repositories cater to a diverse range of user preferences. Inside the README.md file of the GUI repository, you’ll find meticulous installation instructions to guide you through setting up the Graphical User Interface (GUI). It’s your portal to a user-friendly experience
Sample Repository
- For those eager to embark on their monitoring journey, our Sample Repository is a valuable resource. It provides scripts that not only enable you to run our program but also serve as templates. These templates are designed to simplify the monitoring of your own programs, tailored to your unique requirements.

Project Demo

To provide you with a glimpse of what our project can do, here are some demo images showcasing the capabilities and features of “Public Artifact and Data Visualization.”

Thank You for Joining Us

We appreciate your support and participation in this journey of data visualization and empowerment. Our commitment to enhancing the world of data comprehension remains unwavering. As we mark the end of this chapter, we eagerly anticipate the exciting future that awaits in the realm of data visualization. The path doesn’t end here; it’s just the beginning of a new chapter in our collective exploration of data’s potential.`

Final Blog on Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Fri, 20 Oct 2023 00:00:00 +0000

Hello Again!

I’m excited to present my final blog post summarizing the progress and achievements made over the 2023 Summer of Reproducibility Fellowship.I will be sharing the work I’ve created for the Teaching Computer Networks with Reproducible Research: Developing a ‘classroom competition’ for adaptive video delivery.

Recap of the Journey

In my mid-term evaluation, I discussed the initial milestones and challenges I encountered during this program. At that point, I studied the key figures from the research paper ‘Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming’. My primary objectives were to ensure compatibility with both Python 2 and Python 3 and to incorporate an ‘Estimated Download Rate’ metric into the output file generated by the adaptive video client. Furthermore, I expanded the project to include two crucial visualizations: buffer occupancy vs. time and estimated download rate vs. time.

Final Project Progress

In the final weeks of my internship, I worked towards my ultimate goal, which was to reproduce existing work and create a clear guide for future students. I aimed to enable them to build upon and improve this work. To achieve this, I created a new experiment using an existing one,

which I titled “Compare Adaptive Video Policies”

This experiment compares two policies: rate-based (basic) policy and buffer-based (Netflix) policy. In the experiment, I covered the following key aspects:

How Both Policies Work: I detailed the workings of both the rate-based and buffer-based policies, explaining how each policy selects the next bitrate, among other relevant information.

Instructions for Execution of Policies: After conducting several experiments with different settings, I determined the most appropriate settings for this experiment. These settings have been added to the instructions for executing both policies, with a focus on ensuring similar “high” network rates, “low” data rates, similar durations of the “high” data rate before the interruption, and similar durations of the “interruption.” This setup allows for an easy and clear comparison of the two policies.

Discussion Part: In the discussion section, I addressed the differences that students can observe after conducting the experiment and visualising the graphs and videos.

In conclusion, I would like to thank my mentor, Fraida Fund, who has given me excellent guidance and would like to express my gratitude to OSRE23, where I have learned so much. This experience has been amazing for my personal and professional growth.

Final Blog on Using Reproducibility in Machine Learning Education

Wed, 18 Oct 2023 00:00:00 +0000

Welcome back!

In my final blog post for the 2023 Summer of Reproducibility Fellowship, I’ll be sharing my experiences and the materials I’ve created for the Using Reproducibility in Machine Learning Education project. As a quick reminder, my mentor Fraida Fund and I have been working on developing interactive open-source educational resources that teach reproducibility and reproducible research in machine learning. You can find my proposal here.

In this post, I’ll give you a rundown of my experience and share the materials I’ve created. If you haven’t checked out my previous blog posts, definitely take a look before diving into this one. Let’s get started!

Why is this project important 🤔

Reproducibility is an essential aspect of scientific research, and it’s becoming increasingly important in the field of computer science. However, most efforts to promote reproducibility in education focus on students who are actively involved in research, leaving a significant gap in the curriculum for introductory courses. Our project aims to address this issue by incorporating reproducibility experiences into machine learning education.

Why Reproducibility Matters in Education 🎓

There are two primary reasons why we believe reproducibility belongs in the computer science classroom. Firstly, it allows students to experience the process of reproducing research firsthand, giving them a deeper understanding of the scientific method and its importance in the field. This exposure can inspire students to adopt reproducible practices in their future careers, contributing to a more transparent and reliable scientific community.

Source: Fund, Fraida. “We Need More Reproducibility Content Across the Computer Science Curriculum.” Proceedings of the 2023 ACM Conference on Reproducibility and Replicability. 2023.

Secondly, as shown in the figure, involving students in reproducibility efforts can have a significant impact on the reproducibility ecosystem itself. Students can create reproducibility artifacts, such as replicable experiments or data analysis, that can be used by other researchers, including authors and graduate students. Additionally, students can consume reproducibility artifacts created by the research community, provide feedback, and suggest improvements. Authors appreciate this type of engagement, as it adds value to their work and promotes open science.

Focusing on Machine Learning 🧐

Given the growing interest in machine learning and its relevance to reproducibility, our project decided to focus on this area. Machine learning already has a strong culture of reproducibility, with initiatives like Papers with Code and the ML Reproducibility Challenge. These efforts encourage researchers to share their code and reproduce recent machine learning papers, validating their results. By leveraging these existing resources, we can create learning materials that utilize real-world examples and foster hands-on reproducibility experiences for students.

The Interactive Notebooks 📖

We have created two learning materials that focus on machine learning and reproducibility. The first material looks at a paper titled “On Warm Starting Neural Network Training” by Jordan T. Ash and Ryan P. Adams. This paper discusses the concept of warm-starting, which involves using weights from a previously trained model on a subset of the dataset to train a new model. The authors compare the performance of warm-started models with randomly initialized models and find that the warm-started models perform worse as shown in the below figure.

Our material takes students through the process of identifying the different claims made in the paper and finding the corresponding experiments that support them. They will also learn how to use open-source code and available data to reproduce these experiments and understand the computational complexity associated with reproducing each experiment. This material can be found on both github and chameleon where you can use chameleon to run the material on the required resources.

The second material examines the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” by Dosovitskiy et al., which introduces a novel way of applying the transformer architecture, which was originally designed for natural language processing, to image recognition tasks. The paper shows that transformers can achieve state-of-the-art results on several image classification benchmarks, such as ImageNet, when trained on large-scale datasets as shown in the following table.

Our material guides students through the process of understanding which claims can and cannot be validated based on the available datasets and how complex it can be to validate each claim. Additionally, they will learn how to use pre-trained models to replicate computationally expensive experiments. Again this material can be on both github and chameleon.

Both materials are designed to be easy to understand and interactive, allowing students to engage with the content and gain a deeper understanding of the concepts. Instructors can use these materials to assess their students’ understanding of machine learning and reproducibility.

Reflecting on the Journey

As we wrap up our journey of creating beginner-friendly learning materials for machine learning using reproducibility, it’s time to reflect on the rewarding experiences and valuable lessons learned along the way. Our deep dive into the world of machine learning and reproducibility not only enriched our knowledge but also provided us with an opportunity to contribute to the community at the UC Open Source Symposium 2023 at UCSC.

The symposium was a memorable event where we presented our work in a poster session. The diversity of the audience, ranging from professors and researchers to students, added depth to our understanding through their valuable feedback and insights. It was intriguing to see the potential applications of our work in various contexts and its capacity to benefit the broader community.

This project has been a personal journey of growth, teaching me much more than just machine learning and reproducibility. It honed my skills in collaboration, communication, and problem-solving. I learned to distill complex ideas into simple, accessible language and create engaging, interactive learning experiences. The most fulfilling part of this journey has been seeing our work come alive and realizing its potential to positively impact many people. The gratification that comes from creating something useful for others is unparalleled, and we are thrilled to share our materials with the world.

Your time and interest in our work are greatly appreciated! Hope you enjoyed this blog!

GPU Emulator for Easy Reproducibility of DNN Training -- Final Blog Post

Fri, 06 Oct 2023 00:00:00 +0000

Introduction

For the second half of the project, I spent some time reproducing the figures and then focused on hacking the source code of PyTorch to distinguish the Inter-GPU Computation (1GPU vs. 2GPUs).

Summarization

Finished reproducing figure 3, 4, 5, 6 from GPU Emulator for Easy Reproducibility of DNN Training.
Explored into inter-GPU computation in order to reproduce figure 9.

Reporsitory of Reproducing Figures

I have placed the repository of the deliverable here: https://github.com/FarFlyField/OSRE_DELIVERABLE/tree/main To use the repository, you can use any machine with just CPU, or you may check the results by renting GPU from Chameleon, comparing the result with the emulator’s.

The repository covers how to setup and understand the data produced from it. You will need to understand the spreadsheet and some of the graphing files.

Study of Inter-GPU Computation

I have dissected the source code in PyTorch to identify the computation time differences between using 1 GPU and 2 GPUs (inter-GPUs computation time). The one most significant difference is managed throughout the forward process. Here are a few most significant features that make the computation time longer when using 2 GPUs:

When using 1 GPU, PyTorch would put the images used to train the model onto the GPU in the main application once and for all. However, using 2 GPUs, PyTorch will transfer the images before running forward using a parallel function. The function contains two features:
- It dissects the images into multiple sections and puts them onto the GPUs respectively.
- It copied the model we need to train into the # of GPUs and create threads to train the models parallelly on these separate GPUs.
These two steps create a major time difference for the computation time we have for training multiple GPUs and also make the transfer time for using 2 GPUs smaller because we are counting the transferring time of images toward computation time.
After finishing running forward, the parallel function will gather the outputs from the two GPUs and send the output to the first GPU.

After gathering the output to the first GPU, the code will train the next batch and repeat the steps of transferring data, copying images, running parallel forwarding, and gathering the outputs once again.

The second significant difference that I’m working on right now is when PyTorch runs backward functions, which are more or less similar to forward but not the same at all. I have located the function loss.backward() function in our application code as the only contributor to the time difference in computation time. Here are a few tasks I did after locating it:

Recorded the functions’ call stack when using 1 GPU and 2 GPUs.
Recorded the time spent in the functions in the call stack of the functions.
Identified the inconsistency when measuring data, repeated and verified until the consistency is reached.

I have finished the basic measuring and drafting out the call stack but I haven’t figured out the exact differences. Because most of the functions are done in C++, printing out the inputs to evaluate the functions will be slightly harder but doable.

The data recorded and analyzed are placed here: https://docs.google.com/spreadsheets/d/1vFj-UE3mjtsHIc5OesKX1sDvr6fpPwtPUMl0pM3V8SA/edit?usp=sharing Summarized doc: https://docs.google.com/document/d/10XWNwCZ3kLzy4i6WgJ6KPsujEs2X1gzXblZtUoqMuJw/edit

Learning Machine Learning by Reproducing Vision Transformers

Fri, 06 Oct 2023 00:00:00 +0000

Hello again!

In this blog post, I will be discussing the second material I created for the 2023 Summer of Reproducibility Fellowship. As you may recall from my first post, I am working on the Using Reproducibility in Machine Learning Education project with Fraida Fund as my mentor. My goal is to create interactive open-source educational resources that teach reproducibility and reproducible research in machine learning (ML), as outlined in my proposal.

In this post, I will share with you my second material, and how it can be helpful in machine learning class to teach students about vision transformers and reproducibility at the same time. If you haven’t seen my first work, be sure to check out my previous blog post. Without further ado, let’s dive in!

Reproducing “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”

This material is a reproduction of Dosovitskiy et al.‘s 2020 paper, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. This paper introduces the Vision Transformer (ViT), a novel architecture that applies the transformer model, originally designed for natural language processing tasks, to image recognition. The ViT model achieves state-of-the-art performance on several image classification benchmarks, demonstrating the potential of transformers for computer vision tasks.

The figure illustrates the key idea behind ViT, which is to treat an image as a sequence of patches, similar to how a transformer treats a sentence as a sequence of words. Each patch is flattened into a vector and fed into the transformer encoder, which learns to capture the complex relationships between these patches. The resulting representation is then fed into an MLP head, which produces a final prediction for the image. This approach allows ViT to handle large input images and capture both global context and fine-grained details. ViT models can also be pre-trained on large datasets and fine-tuned on smaller datasets for improved performance.

To reproduce this paper, I followed a systematic approach to ensure reliable results:

Critically analyze the paper’s qualitative and quantitative claims.
Identify the necessary experiments to verify each claim.
Determine the required data, code, and hyperparameters for each experiment.
Utilize pre-trained models for validating claims that require high computational resources.
Investigate resources shared by the authors, such as code, data, and models.
Assess the feasibility of verifying different types of claims.
Design new experiments for validating qualitative claims when certain models or datasets are unavailable.

I utilized Chameleon as my platform for conducting and documenting my reproduction experiments. Chameleon is a large-scale, reconfigurable experimental environment that supports computer science systems research. It enables users to create and share Jupyter notebooks capable of running Python code on Chameleon’s cloud servers. For this work, a GPU with 24GB or more memory is required to run the notebooks on GPU, which Chameleon offers in its variety of GPUs.

I have set up a GitHub repository where you can access all of my reproduction work. The repository contains interactive Jupyter notebooks that will help you learn more about machine learning and the reproducibility of machine learning research. These notebooks provide a hands-on approach to understanding the concepts and techniques presented in my reproduction work.

Challenges

Reproducing a paper can be a challenging task, and I encountered several obstacles during the process, including:

The unavailability of pretraining datasets and pretrained models
Inexact or unspecified hyperparameters
The need for expensive resources for some hyperparameters
The use of different frameworks for baseline CNNs and Vision Transformers

These issues posed significant difficulties in replicating the following table, a key result from the Vision Transformer paper that demonstrates its superiority over prior state-of-the-art models.

To overcome these challenges, I used the same models mentioned in the paper but pretrained on different datasets, experimented with various hyperparameter combinations to achieve the best results, and wrote my own code to ensure that both the baseline and Vision Transformer were fine-tuned using the same framework. I also faced other challenges, which I discussed in my notebooks along with the solutions I applied.

How to use this material?

This material consists of a series of notebooks that guide you through the paper, its claims, experiments, and results. You will learn how to analyze, interpret, and validate the authors’ claims. To get started, I recommend briefly skimming the original paper to gain an understanding of the main ideas and public information. This will help you see how the authors could have been more transparent and clear in certain sections. The notebooks provide clear instructions and explanations, as well as details on how I addressed any missing components.

Conclusion

In this blog post, I’ve walked you through the contents of this material and the insights users can gain from it. This material is particularly intriguing as it replicates a paper that has significantly influenced the field of computer vision. The interactive nature of the material makes it not only educational but also engaging and enjoyable. I believe users will find this resource both fun and beneficial.

I hope you found this post informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for reading and stay tuned for more updates!

Final GSoC Blog - Polyglot

Mon, 25 Sep 2023 00:00:00 +0000

As I send in my final work submission for the final GSoC evaluation, I’m excited to share with you the progress we’ve made this summer (and future plans for Polyglot!). You can view the repository and web app here: https://polyphyhub.github.io/PolyGlot/. As a quick reminder of the project, we sought to extend the Polyglot web app, as developed by Hongwei (Henry) Zhou. For context, the web app follows this methodology:

Given a set of words, use an embedding model (such as Word2Vec, BERT, etc.) to generate a set of high dimensional points associated with each word.
Use a dimensionality reduction method (such as UMAP) to reduce the dimensionality of each word-vector point to 3 dimensions
Use the novel MCPM (Monte Carlo Physarum Machine) to compute the similarities between a set of anchor points and the rest of the point cloud. You could use any similarity metric here, too, such as the Euclidean distance.
The web app then displays the point cloud of 3-dimensional embeddings, but uses coloring to indicate the level of MCPM similarity each word has with the anchor point (e.g, if the anchor point is the word “dog”, the rest of the point cloud is colored such that words identified as similar to “dog” by the MCPM metric are brighter, whereas dissimilar words are darker.

The main results since the last blog are summarized as follows:

Novel timeline feature in which users can track the importance of certain words over time by watching the change in size of points (computes the IF-IDF metric for a word across all documents in a given year). Uses linear interpolation for years which do not have an explicit importance score.
An industrial collaboration with UK startup Lautonomy, where we have pre-processed and entered their data into Polyglot. Pre-processing consisted of first computing a high dimensional embedding of their set of words using OpenAI’s CLIP model https://openai.com/research/clip and the CLIP-as-service Python package https://clip-as-service.jina.ai. Next, we used UMAP to reduce the dimensionality of these embeddings to 3D. We computed the Euclidean distance on this data (in place of MCPM metric). Finally, we formatted the data to enter into Polyglot.

Although the app has developed a lot over the summer, we are planning to continue working on Polyglot, particularly with respect to one of our original goals: to set up a pipeline from PolyPhy to Polyglot. Unfortunately, with PolyPhy undergoing refactoring this summer, we weren’t able to set this pipeline up. However, that is one of our goals for the next few months. We are also moving forward with the industrial collaboration with legal analytics startup Lautonomy. We hope to release an output together soon!

If you’re curious about Polyglot or are interesting in getting involved, please feel free to reach out to myself, Oskar Elek, and Jasmine Otto!

noWorkflow as an experiment management tool - Final Report

Thu, 14 Sep 2023 00:00:00 +0000

This post describes our midterm work status and some achievements we have made so far in our project proposal for noWorkflow.

For a more friendly introduction to our work, please, refer to this tutorial available.

Our final code to merge is available in this repository.

Different ways of managing experiments

From our starting point at the midterm, and from our initial aspirations for the SoR, we kept on track with the goal of adding features to noWorkflow related to managing DS/ML experimental setups focusing on reproducibility.

With the emergence of IA across multiple fields in industry and academia, the subject of reproducibility has become increasingly relevant. In [1] we have an interesting description of the sources of irreproducibility in Machine Learning. All these sources are present at different stages during the project's experimental phases and may even persist in production environments, leading to the accumulation of technical debt [2]. The problem of irreproducibility is also discussed in [[3], [4]], pointing out that the velocity of deliverances usually comes at the expense of reproducibility, among other victims.

The CRISP-DM process as reviewed in [5] demonstrates that Data Science experiments follows a typical path of execution. In the same manner, [[3], [6], [7]], points out that Machine Learning pipelines are composed of well-defined layers (or stages) through its lifecycle. The emergence of IA in real world applications stressed the almost artisanal ways of creating and managing analytical experiments and reinforced that there is room to make things more efficiently.

In the search for possible approaches to the problem, we came across several projects that aimed to address these issues. Not surprisingly, multiple authors pursued the same goal, for instance [[9], [10]]. In these references, and confirmed in our survey, we found from targeted solutions to specific steps in modeling to services aiming for end-to-end AIOps management. Some are available as software packages, others as SaaS in cloud environments. In general terms, all of them end up offering features in different layers of the workflow (i.e. data, feature, scoring, and evaluation) or with different conceptualizations of reproducibility/replicability/repeatability as noticed by [11]. On one hand, this lack of standards makes any assessment difficult. On the other hand, it suggests a community in an exploratory process of a hot topic subject.

Specifically for this project, our focus is in the initial stages of computational scientific experiments. As studied in [8], in this phase, experiments are i) implemented by people as prototypes, ii) with minor focus on pipeline design and iii) in tools like Notebooks, that mix documentation, visualization and code with no required sequential structure. These three practices impact reproducibility and efficiency and are prone to create technical debts. However, tools like noWorkflow show a huge potential in such scenarios. It is promising because they i) demands a minimal setup to be functional, ii) works well with almost nonexistent workflows iii) require minimal additional intrusive code among the experimental one and iv) integrates well with Notebooks that are the typical artifact in these experiments.

According to its core team, the primary goal of noWorkflow is to "...allow scientists to benefit from provenance data analysis even when they don't use a workflow system.". Unlike other tools, "noWorkflow captures provenance from Python scripts without needing a version control system or any other environment". It is particularly interesting when we are in the scenario described above, where we lack any structured system at the beginning of experiments. In fact, after going through the docs, we can verify that noWorkflow provides:

Command-line accessibility
Seamless integration with Jupyter Notebooks
Minimal setup requirements in your environment
Elimination of the need for virtual machines or containers in its setup
Workflow-free operation
Open source license
Framework-agnostic position

Finally, in our research, we confirmed that there is an open spot in the management of scientific experiments that needs to be occupied by reproducibility. Provenance tools can help the academy and industry groups in this goal, and in this summer we focused on adding relevant features to leverage the noWorkflow in this direction.

Different tools for different needs

In our research phase, we didn't find any taxonomy that fully accommodated our review of different categories of tools providing reproducibility and experimental management. So, we could describe some tools in the following categories (freely adapted from this online references [here] and [here]):

Data and Pipeline Versioning: Platforms dealing with ingestion, processing, and exposing of features for model training and inference. They enable collaboration and discoverability of already existing Feature Sets throughout the teams and organizations. Provide provenance and lineage for data in different levels of complexity.

Metadata Stores/Experiment Trackers: They are specifically built to store metadata about ML experiments and expose it to stakeholders. They help with debugging, comparing, and collaborating on experiments. It is possible to divide them into Experiment Trackers and a Model Registry. Moreover, there are projects offering reproducibility features like hyperparameter search, experiment versioning, etc. However, they demand more robust workflows and are better suited for projects in the production/monitoring phases.

Pipeline frameworks: They operate within the realm of production, similar to Data Engineering workflows. Their usual goal is to allow any ML/AI products to be served across a wide range of architectures, and integrate all the low-hanging fruits along the way. For instance, pipelines adding hyperparameter optimization tasks, experiment tracking integrations, boilerplate containerized deployment, etc.

Deployment and Observability: They focus on deploying models for real-time inference and monitoring model quality once they are deployed in production. Their aim is to facilitate post-deployment control tasks such as monitoring feature drifts, conducting A/B testing, facilitating fast model shifts, and more.

The most remarkable aspect of this survey is that there are different tools for different phases in the life cycle of AI products. There are tools like DVC and Pachyderm that are Metadata Stores, allowing Experiment Tracking with features of tagging variables, as well as Data and Pipeline tracking. They are the most similar tools to noWorkflow in functionality. However, DVC possesses a more complex framework in dealing with different 'types' of tags, and relies on command line tools to extract and analyze tagged variables. Also, it depends strongly on git and replicate the git logics. Pachyderm requires a more sophisticated setup at the start, relying on containers and a server. It is an obstacle to small and lean prototypes, requiring installation of a docker image, and all friction on managing it.

There are other tools, like MLFlow and Neptune that pose themselves as Model Experiment Versioning with features of Monitoring and Deployment. They also have elements of pipeline frameworks, offering full integration and boiler plates for seamless integration with cloud platforms.

Pipelines are a vast field. They are AWS SageMaker, Google Vertex, DataRobot and Weights & Biases, among others. All of them offer features helping in all categories, with a strong focus on exploring all automation that can be offered to the final user, suggesting automatic parameter tuning, model selection, retraining, data lineage, metadata storing, etc.

Finally, Deployment and Observability frameworks are in the deployment realm, which is another stage far removed from prototypical phases of experiments. They come into the scene when all experimental and inferential processes are done, and there is an AI artifact that needs to be deployed and monitored. Such tools like Seldon, H2O, Datarobot do this job, again, with some features of Hyperparameter tuning, pipeline frameworks, data and pipeline tracking.

In light of this, when considering management and operation of experiments, we have a reduced sample of alternatives. Among them, Notebook integration/management are rare. Some of them rely on other tools like Git or enforces an overhead in the coding/setup with reserved keywords, tags and managerial workflows that hinder the process.

At first sight, our "informal" taxonomy positions noWorkflow as a Data/Pipeline Versioning and Metadata Store/Experiment Tracker. It is not a Pipeline Framework which works like a building block, facilitating the integration of artifacts at production stages. It is not a Deployment and Observability framework, because they are in the post-deployment realm, which is another stage far removed from prototypical phases of experiments.

Desiderata

As mentioned earlier, a typical workflow in DS/ML projects is well described by the CRISP-DM [5] and precede phases of deployment and production in the whole lifecycle of DS/ML projects.

Fig 1: CRISP-DM example of trajectory through a data science project

Briefly speaking, a workflow starts when a user creates a Jupyter Notebook and starts writing code. Usually, he/she imports or selects data from a source, explore features which are expected to have the highest inference potential, tunes some parameters to set up its training, trains and evaluates the predictive power of the model through different metrics. At this final step, we have delineated a trial. This trial result can suggest further improvements and new hypotheses about data, features, model types and hyperparameters. Then, we have a new experiment in mind that will result in a new trial.

When this process repeats multiple times, a researcher may end with different notebooks storing, each one, a different experiment. Each notebook has multiple hyperparameters, modeling choices and modeling hypotheses. Otherwise, the experimenter may have a unique notebook where different experiments were executed, in a nonlinear order between the cells. This former case is pointed out in [8], where Notebook flexibility makes it difficult to understand which execution order resulted in a specific output.

In a dream space, any researcher/team would have benefited at most if they could

a) in a running Notebook, being able to retrieve all the operations that contributed to the result of a variable of interest. In this case, modifications applied in the inputs or in the order of operations would be easily detectable. In the same way, any nonlinear execution that interferes in a control result.

b) Compare trials after different experiments. After experimenting with different hypotheses about hyperparameters, features or operation order, the user should easily compare the history of two trials and spot differences.

c) Retrieve a target variable among different trials that were executed in the context of an experiment. After proceeding with multiple experimental trials, users should be able to compare the results that are stored in different Notebooks (or even not).

d) Be as much "no workflow" as possible. All the former requisites should be possible with minimal code intervention, tags, reserved words or any active coding effort.

With these goals in mind, we worked on our deliverables and used the experiment carried out by [12] as a guideline to validate the new noWorkflow features.

Deliverables

In this session, we will describe what we have implemented during this summer.

We started on tagging cells and variables and then navigating through its pre-dependencies, or all other variables and function calls that contributed to its final value. This was a fundamental step that allowed us to evolve to create features that are really useful in day-to-day practice.

From the features of tagging a cell and tagging a variable, we evolved to the following features (an interactive notebook is available here):

backwards_deps('var_name', glanularity_level) : returns a dictionary storing operations/functions calls and their associated values that contributed to the final value of the tagged variable. Glanularity_level allows to set if the internal operations of the functions must be included or not.

global_backwards_deps('var_name', glanularity_level) : does the same as backwards_deps, but from all different tagging and re-tagging events in the notebook. It allows to retrieval of the complete operation of a tagged variable across all executed cells in the notebook
store_operations(trial_id, dictionary_ops) : save the current trial in order to make further comparisons with other experiments. The dictionaries aren't stored in the .noworkflow/db.sqlite, but in a shelve object named *ops.db* in the current notebook local folder.
resume_trials() : to support the management of experiments, the user can see the trial_ids of all experiments stored in the ops.db available for comparison/analysis.
trial_intersection_diff(trial_id1, trial_id2) : all mutual variables/funcion_calls between two experiments have its scalar values compared

trial_diff(trial_id1, trial_id2) : The values of variables and function calls are exhibited in a diff file format, emphasizing the operations' order. The goal here is to show that between the two experiments, the order of operations was different. Again, only scalar values are exhibited. More complex data structures (matrices, vectors, tensors, etc.) are only signaled as 'complex_type'

var_tag_plot('var_name') : Chart the evolution of a given variable across multiple trials in the database. In this case, all experiments stored in ops.db and tagged as *target_var* have their values plotted

var_tag_values('var_name') : Provides access to pandas.dataframe var_name entries with correspondent values across different trials.

Challenges

As expected, we had unexpected findings along the project. Bellow, we delve into the most significant challenges we had to face:

Jupyter notebooks allow a nonlinear execution of small parts of code through cells. More than once, we had to align about how to create functionalities to attend different scenarios that were unexpected. One example was the backwards_deps() and global_backwards_deps() functions. The latter function was born to cover the case where the user wants all dependencies rather than the local cell dependencies.
Despite the high quality of the current version of the package, the project needs documentation, which slows down the analysis of any new development. In this project, the aid of mentors was crucial at some points where a deeper knowledge was needed.
What is the vocation of noWorkflow? At some points in the project, we had to discuss forcing some kind of workflow over the user. And it would go against the philosophy of the project.
When working on comparing results, especially in DS/ML fields, complex types arise. Numerical vectors, matrices, and tensors from NumPy and other frameworks, as well as data frames, can't be properly manipulated based on our current approach.
The dilemma of focusing on graphic visual features versus more sophisticated APIs. More than once, we needed to choose between making a visual add-on to Jupyter or implementing a more complete API.
The current version of Jupyter support in noWorkflow doesn’t integrate well with Jupyter Lab. Also, even the IPython version has new versions, and noWorkflow needs to adapt to a new version.

Future Improvements

Given our current achievements and the insights gained along the project, we would highlight the following points as crucial future roadmap improvements:

Add a complex type treatment for comparisons. Today, visualizing and navigating through matrices, data frames, tensors, isn't possible with noWorkflow, although the user can do by its own means.
Integrate the dictionaries storing sequences of operations from shelve objects to a more efficient way of storage and retrieval.
Make it easier for users to manage (store, retrieve, and navigate) through different trials.
Add graphical management instead of relying upon API calls only.
Evolve the feature of tagging cells.
When tagging a model, save its binary representation to be recovered in the future.
Adding the capability of tracking the local dataset reading. Currently, it is possible to track changes in the name/path of the dataset. However, any modification in the integrity of a dataset is not traceable.

What I've learned

This was a great summer with two personal discoveries. The first one was my first formal contact with the Reproducibility subject. The second was to fully contribute with an Open Source project. In the research phase, I could get in touch with the state-of-the-art of reproducibility research and some of it nuances. In the Open Source contributing experience, I could be mentored by the core team of the noWorkflow and exercise all the skills required in doing high level software product.

Acknowledgments

I would like to thank the organization of Summer of Reproducibility for aiding this wonderful opportunity for interested people to engage with Open Source software. Also, thanks to the core team of noWorkflow for supporting me in doing this work.

Bibliography

[1] [O. E. Gundersen, K. Coakley, C. Kirkpatrick, and Y. Gil, “Sources of irreproducibility in machine learning: A review,” arXiv preprint arXiv:2204. 07610.]

[2] [D. Sculley et al., “Machine Learning: The High Interest Credit Card of Technical Debt,” in SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.]

[3] [P. Sugimura and F. Hartl, “Building a reproducible machine learning pipeline,” arXiv preprint arXiv:1810. 04570, 2018.]

[4] [D. Sculley et al., “Hidden technical debt in machine learning systems,” Adv. Neural Inf. Process. Syst., vol. 28, 2015.]

[5] [F. Martínez-Plumed et al., “CRISP-DM twenty years later: From data mining processes to data science trajectories,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 8, pp. 3048–3061, 2019.]

[6] [N. A. Lynnerup, L. Nolling, R. Hasle, and J. Hallam, “A Survey on Reproducibility by Evaluating Deep Reinforcement Learning Algorithms on Real-World Robots,” in Proceedings of the Conference on Robot Learning, L. P. Kaelbling, D. Kragic, and K. Sugiura, Eds., in Proceedings of Machine Learning Research, vol. 100. PMLR, 30 Oct--01 Nov 2020, pp. 466–489.]

[7] [A. Masood, A. Hashmi, A. Masood, and A. Hashmi, “AIOps: predictive analytics & machine learning in operations,” Cognitive Computing Recipes: Artificial Intelligence Solutions Using Microsoft Cognitive Services and TensorFlow, pp. 359–382, 2019.]

[8] [J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire, “Understanding and improving the quality and reproducibility of Jupyter notebooks,” Empirical Software Engineering, vol. 26, no. 4, p. 65, 2021.]

[9] [D. Kreuzberger, N. Kühl, and S. Hirschl, “Machine Learning Operations (MLOps): Overview, Definition, and Architecture,” IEEE Access, vol. 11, pp. 31866–31879, 2023.]

[10] [N. Hewage and D. Meedeniya, “Machine learning operations: A survey on MLOps tool support,” arXiv preprint arXiv:2202. 10169, 2022.]

[11] [H. E. Plesser, “Reproducibility vs. replicability: a brief history of a confused terminology,” Front. Neuroinform., vol. 11, p. 76, 2018.]

[12] [Z. Salekshahrezaee, J. L. Leevy, and T. M. Khoshgoftaar, “The effect of feature extraction and data sampling on credit card fraud detection,” Journal of Big Data, vol. 10, no. 1, pp. 1–17, 2023.]

KV store final Blog

Fri, 25 Aug 2023 00:00:00 +0000

Hello again! Before we get started, take a look at my previous blogs, Introduction and Mid Term. The goal of the project was to implement io_uring based backend driver for client side, which was at that time using traditional sockets. The objective was improving performance from the zero copy capabilities of io uring. In the process, I learnt about many things, about libkinetic and KV stores in general.

I started by writing a separate driver using io_uring in libkinetic/src in ktli_uring.c, most of which is similar to the sockets backend in ktli_sockets.c. The only difference was in the send and receive functions. For more detailed description about the implementation, refer to the mid term blog.

After the implementation, it was time to put it to test. We ran extensive benchmarks with a tool called fio, which is generally used to run tests on filesystems and other IO related things. Thanks to Philip, who had already written an IO engine for testing kinetic KV store (link), I didn’t have much problem in setting up the testbench. Again thanks to Philip, He set up a ubuntu server with the kinetic server and gave me access through ssh. We ran extensive tests on that server, with both socket and uring backends, with several different block sizes. The link to the benchmarks sheet can be found here.

We spent a lot of time in reading and discussing the numbers, probably the most time consuming part of the project, we had several long discussions analyzing numbers and their implications, for example in the initial tests, we were getting very high std dev in mean send times, then we figured it was because of the network bottleneck, as we were using large block sizes and filling up the 2.5G network bandwidth quickly.

In conclusion, we found out that there are many other major factors affecting the performance of the KV store, for example the network, and the server side of the KV store. Thus, though io_uring offers performance benefit at the userspace-kernel level, in this case, there were other factors that had more significant effect than the kernal IO stack on the client side. Thus, for increasing the performance, we need to look at the server side

I would like to thank Philip and Aldrin for their unwavering support and in depth discussions on the topic in our weekly meetings, I learned a lot from them throughout the entire duration of the project.

Grammar, Parsers, and Queries

Sat, 12 Aug 2023 00:00:00 +0000

Update on tree-sitter-pyrope

The pyrope hardware description language now has syntax highlighting available for neovim users. The repository includes a guide to installing the parser, and activating highlights. After we have tested the syntax highlighting, a pull request will be made to the nvim-treesitter repository. In this post, I will outline the highlighting process and reflect on a useful feature of neovim.

Syntax Trees

The pyrope language is described by a grammar. A grammar is a set of rules that describes the allowed structure of a language. A parser uses the grammar to generate a syntax tree. For example, consider this line of pyrope code.

var a:u32 = 0

Using the pyrope parser, we can get a syntax tree for this statement. The command tree-sitter parse file.prp gives us the following output.

(statement [1, 0] - [1, 13]
 (assignment_or_declaration_statement [1, 0] - [1, 13]
 decl: (var_or_let_or_reg [1, 0] - [1, 3])
 lvalue: (complex_identifier [1, 4] - [1, 5]
 (identifier [1, 4] - [1, 5]))
 type: (type_cast [1, 5] - [1, 9]
 type: (primitive_type [1, 6] - [1, 9]
 (sized_integer_type [1, 6] - [1, 9])))
 operator: (assignment_operator [1, 10] - [1, 11])
 rvalue: (constant [1, 12] - [1, 13])))

The nvim-treesitter syntax highlighting is based on this tree structure.

Queries

A query is an expression that selects nodes from the tree. For example,

(complex_identifier (identifier))

matches any identifier that is the child of a complex_identifier. Color schemes in neovim assign colors to different highlight groups. So, we can assign highlight groups to tree queries.

(constant) @number

Now, when a constant shows up in the syntax tree, it will highlight according to the @number group. Most of the work I did on this project involved studying the pyrope grammar, and writing queries based on it.

neovim

The text editor neovim is a popular choice among programmers. It allows advanced user control with configuration files. It also has an active community working on plugins to extend its functionality. Tools such as lazyvim allow for features like code completion and file management that give neovim the same functionality as IDEs. However, because neovim configuration is unique to each user, this may make it difficult to reproduce neovim instructions. For example, Professor Renau was going to test pyrope syntax highlighting in neovim. However, I did not know what configuration was necessary for him to see highlights in neovim. While I knew that syntax highlighting worked on my setup, I have lots of configuration files that may have contributed to that success. There is no guarantee that Professor Renau, or other potential users, have the same neovim configuration that I do.

NVIM_APPNAME

So, Professor Renau suggested I use the $NVIM_APPNAME variable to test the process on a fresh configuration. This feature allows the user to specify the configuration files used to launch neovim. For example, I installed lazyvim to the folder ~/.config/lazy. Then, I launched neovim with NVIM_APPNAME=lazy nvim. So instead of using my default configuration from ~/.config/nvim, the lazyvim configuration was used. This allowed me to use a neovim instance that was unaffected by my configuration files. I was able to preview the process of setting up syntax highlighting from the perspective of a lazyvim user. Similarly, the process can be done with an empty folder to mimic a brand new neovim installation The point is, configuration files can impact reproducibility in neovim. However, this feature allows us to bypass our individual configurations, and create reproducible guidelines.

Conclusion

In conclusion, most of my work involved writing queries for the pyrope tree-sitter grammar. This was for the purpose of syntax highlighting in neovim. However, an important part of any open source project is communicating the results and providing documentation. The NVIM_APPNAME feature helps view neovim from the perspective of different users, which helps for writing useful documentation.

Reproducible Evaluation of Multi-level Erasure Coding (Midterm)

Sat, 05 Aug 2023 00:00:00 +0000

Hi Everyone,

I hope everything goes well! This is my second blog post for my project Reproducible Evaluation of Multi-level Erasure Coding under the mentorship of John Bent, Anjus George, and Meng Wang. In summary, my project aims to build a platform to reproducibly evaluate the performance and durability of MLEC (Multi-Level Erasure Coding) for large-scale storage systems under different design configurations. The details are in this proposal.

In the course of these few weeks, I’ve completed several tasks to achieve the aim of this project, including

Literature Review
Studying the Erasure Coding Simulator and Creating Reproducible Evaluations, with the following policies
- Clustered/Declustered Local-level SLEC
- Clustered/Declustered Network-level SLEC
- MLEC with C/C, C/D, D/C, D/D configuration

Literature Review

Prior to developing the simulator, my first step was to delve into various literature related to distinct Erasure Coding policies. To understand a simulator for complex Erasure coding policy such as MLEC, I want to start from the simpler EC policies, and then extend my knowledge to more complex ones such as MLEC. Moreover, I also aimed to contrast the durability of MLEC with other comparable EC policies like LRC in my evaluations, making it vital to understand the implementation of these policies.

Over the first week, I read several papers regarding different chunk placement policies regarding erasure coding, including LRC (Local Reconstruction Codes), CL-LRC (Combined Locality for Local Reconstruction Codes), SODP (Single Overlap declustered parity), and MLEC (Multi-Level Erasure Coding). These papers offered a fundamental comprehension of each policy, their respective advantages and drawbacks, and their practical usage in production environments.

Simulator Reproduction

After gaining some understanding with the papers I read, I started to study the EC simulator by building the simulator myself. I got the MLEC simulator from the mentors. However, the simulator lacks documentation and guides, making it hard for others to reproduce evaluation results. The simulator is also complicated to understand, as it simulates various EC schemes, chunk placements, and rebuild policies, which results in 13,000 LOC. Therefore, my goal is to understand the design and implementation details of the simulator, after which I will create guides for reproducible evaluations.

In order to fully understand the simulator, the best way is to rebuild the simulator by myself. The simulator is designed to mimic disk failures over the span of a year under varying chunk placement policies. Once successfully rebuilt, the simulator will enable me to assess the durability of MLEC in relation to other widely-used chunk placement policies. I followed the given simulator and rewrote it on my own in Python.

Based on the skeleton of the given simulator, I first rebuilt a simple simulator that simulates SLEC (single level erasure coding, in both local and network settings) with clustered parities. With the arguments given, the simulator can run arbitrary numbers of iterations that simulate disk failures in one year. The simulator then collects iterations in which there is a data loss. The ratio of failed iterations to total executed iterations is the durability of the erasure coding policy. This simulation allows us to evaluate the durability of SLEC, laying foundations for later evaluation of MLEC.

Next, I extended my simulator from local-level SLEC implementation by adding more policies. I began by introducing a network-level SLEC policy with clustered parities. This differs slightly from the local-level EC as it necessitates the consideration of factors like network bandwidth within the simulator.

In addition, I have delved deeper into simulating declustered parities and successfully discovered a method to simulate disk failures. Basically, the simulator generates failures within a one-year timeframe and subsequently repairs them using priority queues. The disks associated with stripes experiencing the most failures are given the highest repair priority. With this construction, the simulator is capable of simulating local-level declustered parities, with the ability to specify parameters.

Upon successfully simulating local-level declustered parities, the construction of the simulator for network level declustered parities was rather straightforward. I then validated it using the simulator and math models provided by the mentors. The results perfectly agree with each other, which proves the correctness of my understanding for the SLEC declustered placements. By implementing the simulator myself, I strengthened my understanding of erasure coding designs and the simulation techniques, which equipped me with a solid foundation to continue to reproduce MLEC simulations.

Based on my knowledge gained from implementing SLEC simulators myself, I then reverse-engineered the MLEC simulator provided by the mentors from their MLEC paper. I choose to start from the simplest policy, which is clustered parities in both levels. After spending a considerable time digging into the simulator source codes, I was able to understand the simulation workflows, different repair methods that it implements, and the splitting method that it uses to simulate high durabilities. I then revised my simulator based on my understanding. I also tried to run a few experiments using the same configuration setups as specified in the paper. The results agree well with those in the paper, which verified the success of my reproducing work.

Technical Issues

In the process of building the MLEC, I’ve encountered many issues, conceptual or technical. The mentors are super helpful and responsive in the process, so I was able to have steady progress.

Summary

Overall, I’ve rebuilt a python simulator for various EC policies, and the simulator can successfully reproduce the results from paper.

Next Steps

My next step would be to package the simulator into reprodTrovi artifact, so others can reproduce evaluations on performance and durability of various EC policies, in particular MLEC

Mid Term Blog : Using Reproducibility in Machine Learning Education: Reproducibility with Incomplete Methodology Descriptions

Fri, 04 Aug 2023 00:00:00 +0000

Hey,

I am Shekhar and I am one of several students who are working on developing materials for reproducibility in machine learning education, under the mentorship of Fraida Fund. My Proposal aims to develop interactive educational materials about reproducibility in machine learning, for use in graduate and undergraduate classes. Our goal is to help students and researchers (1) understand some of the challenges they may face when trying to reproduce someone else’s published result, and (2) in their own publications, to specify the methodology so that the result will be more easily reproduced by others.

Motivation

My work is inspired by my participation in the 2022 Machine Learning Reproducibility Challenge, where I was reproducing a result related to bias in hate speech classifiers. The paper seemed at first to have complete methodology details. However, when I tried to implement their approach based on the description of the paper, I realized some important details were missing - for example, in the part where they replaced swear words in the text with other words having similar meaning. I wasn’t able to identify the exact list of swear words they used, or what approach they followed if the selected replacement was also a swear word. The choices I made when the authors’ approach was left ambiguous had a significant impact on the magnitude of the final result.

Milestones and Accomplishments

To inform researchers and students about this problem, I created a fictitious machine learning research paper, and a sequence of accompanying Python notebooks to highlight various choices that can be made to fill in the gaps, and explore how these choices can impact the overall results of the research. Our “research paper” is about the impact of data augmentation on few-shot learning for intent classification. We implemented a basic data augmentation strategy with synonym replacement using the HWU64 dataset and a BERT classifier, and the results suggest that synonym replacement as a data augmentation technique leads to only minor improvement in accuracy. In the fictitious paper, we left some of the methodology details ambiguous. When reproducing the results using the accompanying notebooks, the reader follows a “Choose Your Own Adventure” format, selecting a path through a tree, where each node represents ambiguous methodology details and branches out to different choices that are made at that instance. The leaf nodes will represent the final results, providing insights into the magnitude of the differences resulting from each node selection. Some of the choices that the reader makes are -

what subset of the source dataset to use.
some of the details of data pre-processing.
some of the details of the synonym replacement data augmentation strategy.
some training hyperparameters and the details of the hyperparameter search.

During the first phase of our project, we have implemented an initial draft of these notebooks, to explore various scenarios and see their impact on results. Next, we will further develop the interactive educational material around them.

Challenges

During the first half of the project, I faced two main challenges. First, I had to come up with a hypothetical research scenario that was realistic, yet easy for students without much expertise to understand. Attaining the right balance was essential to make it engaging and educational. The second challenge was to deliberately leave some details unclear in a realistic way while ensuring that the choices based on that ambiguity had a significant impact on the results. Fortunately, I had the guidance and support of my mentor, which allowed me to successfully tackle these challenges.

Throughout this project, I faced various challenges and obstacles, but it turned out to be an incredible learning experience. I had the opportunity to dive deep into the domains of few-shot learning and meta-learning, which were entirely new to me. Moreover, I was able to find ambiguous methodologies present in academic papers and explore diverse scenarios related to them. Looking ahead, I am eager to continue working on this project throughout the summer, as it promises further learning and personal growth.

Midpoint Blog Interactive Exploration of High-dimensional Datasets with PolyPhy and Polyglot

Thu, 03 Aug 2023 00:00:00 +0000

The last few months of my GSoC project have been very exciting and I hope to share why with you here in this blog post! To briefly summarize, my project has been focused on further developing the Polyglot app, a tool for visualizing 3D language embeddings. One important part of Polyglot is its utilization of the novel MCPM metric, where points are colored according to their MCPM similarity to a user-chosen “anchor point” (e.g., if “hat” is our anchor point, then similar words like “cap” or “fedora” will be colored more prominently).

The first issue we wanted to tackle was actually navigating the point cloud. With hundreds of thousands of points, it can be difficult to find what you’re looking for! Thus, the first few features added were a search bar for points and anchor points and a “jump to point” feature which changes a user’s center of rotation and “jumps” to a chosen point. There were a few hiccups with implementing these features, mainly due to the large number of points and the particular quirks of the graphics library Polyglot uses. In the end though, these simple features made it feel a lot easier to use Polyglot.

The next set of features related to our desire to actually annotate the point cloud. Similar to how one might annotate a Google doc (ie., highlight a chunk of text and leave a comment), we wanted to set up something similar, but with points! Indeed, this led to the development of a cool brush tool for coloring points, named and commented annotations (up to 5), a search bar within annotations, and finally a button to export annotations and comments to a CSV.

The next few weeks are looking bright as we strive to finish the PolyPhy-Polyglot pipeline (a notebook for quickly formatting MCPM data from PolyPhy and getting it into Polyglot). We also hope to add a unique “timeline” feature in which users can analyze sections of the point cloud based on the associated time of each point. Overall, it’s been a very stimulating summer and I’m excited to push this project even further!

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

Thu, 03 Aug 2023 00:00:00 +0000

Introduction

As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time, our goal was to characterize the tools on genomic workflows in terms of system metrics and data quality to build machine learning models to predict the elapsed time of genomic workflows. While Shayantan (another contributor) did the analysis on data quality metrics, I contributed to the system metrics analysis. We are getting closer to that goal because we have managed to collect datasets and do some analysis.

Steps

In this project, we selected DNA-Seq Pipeline as the workflow to be analyzed. This pipeline consists of four tools for processing single-end reads, namely BWA-mem, Samtool-view, Picard-SortSam, Picard-MarkDuplicates. For each tool we executed it using various configurations and stored system metrics for each execution. To do this, we have to take two steps:

Step 1: Building the tools execution environment.
Step 2: Developing a program to execute tools using some configurations and collect runtime parameters (eg. CPU, RSS, VSZ, and IO) automatically.

Execution Environment

Tools are executed on Chameleon instances by submitting them using Slurm. The machine used in collecting system metrics is a Haswell instance of the Chameleon Texas server. This instance uses Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz with following detailed specifications.

Number of CPUs	48
Number of threads per core	2
Number of cores per socket	12
Number of sockets	2

In this experiment, we use n+1 instances, where there are n compute nodes and 1 master node. Each execution is done by submitting a job, which is a tool with a certain configuration, from a master node and it will be processed by one of the compute nodes. In order for the tool to be executed, we need to set the master node to be a common container using NFS. This common container is used to store input files and commands for executing tools so that all nodes can access them without having to download and install them.

Executing and Collecting System Metrics

Tools will be executed in various specific configurations by varying parameters such as input size, number of CPU allocation, memory allocation and threads. For example, for BWA-mem respectively the number of variations in values for the number of CPU allocations, memory allocations, and threads is 5, 4, and 5 using 10 different files so that there are 5 x 4 x 5 x 10 = 1000 configuration combinations. For each configuration will be executed 8 times so that there are 8000 data points. Configuration details can be seen in the following table.

	#repetions	#files	#allocated CPU	#allocated memory	#threads	total
BWA-mem	8	10	2, 4, 8, 16, 32	8, 16, 32, 64	2, 4, 8, 16, 32	8000
Samtool-view	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000
Picard-Sortsam	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000
Picard-MarkDuplicates	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000

Meanwhile, to run the tools, we use the following commands:

BWA-mem

$BWA mem -t $threads $REF_DIR/hg19.fa ${INPUT_DIR}/${sra_id}*.fastq > ${OUTPUT_DIR}/${sra_id}.sam

Samtool-view

$SAMTOOLS view $INPUT_DIR/${sra_id}.sam -Shb -o $OUTPUT_DIR/${sra_id}.bam

Picard-SortSam

java -jar $PICARD SortSam \
CREATE_INDEX=true \
INPUT=$INPUT_DIR/${sra_id}.bam \
OUTPUT=$OUTPUT_DIR/${sra_id}.bam \
SORT_ORDER=coordinate \
VALIDATION_STRINGENCY=STRICT

Picard-MarkDuplicates

java -jar $PICARD MarkDuplicates \
CREATE_INDEX=true \
INPUT=$INPUT_DIR/${sra_id}.bam \
OUTPUT=$OUTPUT_DIR/${sra_id}.bam \
METRICS_FILE=$OUTPUT_DIR/${sra_id}_rmd.txt \
VALIDATION_STRINGENCY=STRICT

In Slurm, each job has a job id. In addition, there is a scontrol listpids command to see the job id to PID mapping. Using this, we can obtain system metrics for a job by gathering information from the /proc/$PID system file. Information that can be obtained from it is the use of CPU, physical memory, virtual memory, read bytes, and write bytes at a particular time. So that in collecting this data, we will record these features along with the timestamp at 1 second intervals throughout the execution process.

Results

We also have calculated the correlation for each feature with the elapsed time. For BWA-mem, the features that correlate more than absolute of 0.5 are Input size, Average CPU Usage, and Output file size , which is in SAM format. For samtools there are input size, average cpu usage and output size in BAM. For Sortsam, there are input size, write operation, and BAM output size. For MarkDuplicates, there are input size and BAM output size.

Features\Tools	BWA-mem	Samtool-view	Picard-SortSam	Picard-MarkDuplicates
Allocated CPU	-0.145	-0.095	-0.179	-0.156
Allocated physical memory	-0.010	-0.038	-0.069	0.132
Input size	0.583	0.651	0.937	0.922
Threads	-0.072	-	-	-
Average CPU	-0.607	-0.567	-0.479	-0.480
Peak CPU	-0.175	0.174	-0.170	0.046
Average RSS	0.040	0.034	0.131	0.182
Peak RSS	0.068	0.046	0.314	0.175
Average VSZ	0.032	-0.349	-0.127	0.090
Peak VSZ	0.048	0.074	-0.130	0.088
Write bytes	0.037	0.190	0.735	0.244
Read bytes	-0.031	0.109	0.070	0.110
Output SAM size	0.589	-	-	-
Output BAM size	-	0.763	0.934	0.923
Output BAI size	-	-	0.400	0.399

Future Works

For further work, we will analyze the correlation between elapsed time and features whose scores are below an absolute 0.5. Because there is a possibility that these features are actually correlated with the elapsed time but do not appear to be correlated because the measurements are made by calculating the overall data. So we also need to calculate the feature correlation for each data grouped by input file. Then, we create a machine learning model to predict elapsed time.

[FLASHNET]: Leveraging ML-augmented I/O in Linux

Wed, 02 Aug 2023 00:00:00 +0000

Hello everyone,

This is my second blog post for SoR 2023. As you may recall from my initial blogpost, I am working on the Flashnet project under the mentorship of Haryadi S. Gunawi.

I’ve been assigned two major tasks under Flashnet:

Perform post-training quantization (PTQ) on existing Flashnet models
Implement a rocksDB client (to interface with the Flashnet kernel) with 3-way replication

Task 1: Perform post-training quantization (PTQ) on existing Flashnet models

Since all of our models are currently built using the keras API, I decided to use the tensorflow-lite library, which supports direct conversion. Unfortunately, I encountered several persistent bugs while attempting to apply full-integer quantization on our binary neural network model:

Shape/dimension distortion:

Bug description: The quantized tflite model produces outputs of shape (8, 1) –– same as input shape–– when the original model produces single-value outputs (1, 1).

Status: Resolved

The original model has an input dimension of 8 for each input/x-value and there could be several inputs grouped in a single batch.
Input/batch size is also determined implicitly in the normalization layer of the original model
However, the “interpreter” in the quantized model runs inference one by one, and so batch size needs to be explicitly set to “1” i.e. a shape of single input, (1,8)
Doing so resolves the model distortion

Incorrect y-value range:

Bug description: There are no variation in the quantized model outputs (i.e. it spits out the same value for each input row)

In the original model, each inference output is a floating point value between 0 and 1. Outputs also vary according to input. This output is rounded towards 0 or 1 using a 0.5 standard cutoff (i.e. x > 0.5 → x = 1). Since the quantized model condenses 32-bit floats into 8-bit integers, we should expect a similar variation in output values across an 8-bit integer range.

Printing the quantized model weights, I discovered that weight burst/exploding gradient may be occur during quantization process i.e. the values of weights are exploding to infinity or vanishing to 0, and therefore unable to deliver any meaningful value. The likely consequence of this is that the inference output always equals the bias matrix (since the Wx term in y = Wx + B gets zeroed out).

Status: Open

Multiple potential causes were considered, without any success:
- Improper quantization of inputs/outputs
- Insufficient training time/number of epochs
- Incompatible model type/structure
- Incompatible tensorflow-lite version
At this point, I concluded that tensorflow-lite is too bug-ridden to make making any further attempts with the library not worthwhile.

Task 2: Implement a rocksDB client (to interface with the Flashnet kernel) with 3-way replication

rocksdb is an embedded database for key-value data. Our Flashnet team is currently implementing a Flashnet client in ceph, and so they have tasked me to explore an implementation in rocksdb as an alternative.

I’ve started on this segment of the project only recently, so my current work is still in its formative stages. As of writing, I’ve been primarily concerned with setup of software (on a new chameleon instance), running toy db examples, and educating myself on basic terminology/rocksdb documentation.

Future work

I expect to continue working on Task 1 (do quantization from ground-up or use a different library) and Task 2 as detailed above. I also hope to implement a transformer-based model to supplement our existing suite of Flashnet models.

[Midterm] FlashNet: Towards Reproducible Continual Learning for Storage System

Wed, 02 Aug 2023 00:00:00 +0000

Mid-Term Report

As part of the FlashNet my proposal under the mentorship of Haryadi S. Gunawi and Daniar Kurniawan aims to implement and optimize the FlashNet model in real-world storage systems using continual learning techniques. We focus on predicting I/Os latency to decide whether or not the I/O should be failovered to other SSD. The following sections elaborates the work description, major milestones achieved, accomplishments, and challenges during the first half of summer.

Work Description, Major Milestones Achieved, and Accomplishments

For the first half of the summer, I implemented continual learning pipeline of the model and several drift detection algorithms. After that, I evaluated the effectiveness. Below are the detailed description for each subtask.

1. Continual Learning pipeline

Firstly, I designed the pipeline. As shown on the graph below, the pipeline contains 4 main modules, namely initial train, retrain, inference, and monitor.

The modules were first developed in Python using linear regression model. Turns out, linear regression model is not good enough that it gave bad accuracy. To overcome this problem, I introduced more models and learning task.

Hence, in the final implementation, we have random forest and neural networks model for both regression and classification task. Aforementioned models outperforms linear regression. The pipeline is also already optimized.

2. Drift detection algorithms

Sometimes, the built model’s performance may degrade when facing recent I/Os having different characteristics than what it was trained upon. Hence, there should be a retrain process. Retrain should be triggered. The trigger could be as simple as periodically, or using technique called drift detection. While retraining too often might cause big overhead for computation, retraining too seldom might also cause performance degradation. Hence, we should build a good and reliable drift detection algorithm that can sense the presence of concept and covariate drift in recent data.

In order to build a good algorithm, I used heuristics derivated from the understanding about latency and throughput change over time. However, the result turns out not really good. Thus, I’ve been relying on using statistical tests as the drift detector. By far, Kalmogorov-Smirnov Test–commonly known as ks-test–is the best drift detector.

3. Evaluation

The featured image in the headline of this blog, also shown below, is the result of the evaluation. I evaluated the models and drift detection algorithms using Cumulative Distribution Function (CDF) graph, to see if any tail cut is made.

Challenges

During the implementation, I encountered several challenges as follows,

1. Choice of Model

Since we want to integrate the pipeline to real storage systems, we had to be mindful of model choice. Machine learning based models are lighter than deep learning based models. However, deep learning based models offer higher accuracy, thus more preferable. Hence, I implemented both and examine the effectivity of the models.

2. Choice of Drift Detection Algorithm

Continual learning technique is chosen for this task may require the model to be retrained since the workload may change over time. However, the implication is we need to have a condition that triggers the retraining to be done. As training model is costly, we need to retrain it mindfully. Thus, we use drift detection algorithm to detect whether or not retraining is needed.

There are two types of drift detection algorithms, namely statistical based test and model based drift detection. For minimizing overhead reason, we pick statistical tests. There exists various algorithms of choice. I picked 5 of them to be implemented and evaluated.

Plan

For the second half of the summer, I am going to study Riak and create Chameleon Trovi artifact for deploying Riak in a cluster.

Introducing Levels of Reproduction and Replication in Machine Learning

Wed, 02 Aug 2023 00:00:00 +0000

Hello again,

I am Mohamed Saeed and this is my second blog post for the 2023 Summer of Reproducibility Fellowship. As you may recall from my previous post, I am working on the Using Reproducibility in Machine Learning Education project with Fraida Fund as my mentor. My goal is to create interactive open educational resources that teach reproducibility and reproducible research in machine learning (ML) as I proposed.

In this post, I will share with you some of the progress I have made so far, as well as some of the challenges I have faced and how I overcame them. I will also highlight some of the specific accomplishments that I am proud of and what I plan to do next.

Reproducing “On Warm Starting Neural Network Training”

This material is a reproduction of the paper “On Warm Starting Neural Network Training” by Jordan T. Ash and Ryan P. Adams (2020). This paper investigates the effect of warm-starting neural networks, which means using the weights of previous models trained on a subset of the data, to train on a new dataset that has more data.

The figure illustrates how the new model uses the weights from the previous model as its initial values. This allows the new model to train on both the “Original” data, which it has already seen, and the new data, which it has not encountered before. In contrast, the randomly initialized model treats the entire data as unfamiliar and starts from scratch.

The paper also shows that this method can lead to lower test accuracy than starting from scratch with random weights, even though the training loss is similar. The paper also proposes a simple way to improve the test accuracy of warm-starting by adding some noise to the previous weights.

To reproduce this paper, I followed a systematic approach that ensured reliable results. This approach involved:

Reading the paper and its main claims carefully.
Finding out what resources the authors shared, such as code, data, and models.
Looking for additional materials online that could help me save time and fill in the gaps left by the authors.
Setting up the environment and dependencies needed to run the code smoothly.
Writing code and updating any outdated functions that might cause errors.
Running the code and verifying that it matched the results reported in the paper.
Analyzing and interpreting the results and comparing them with the paper’s findings.

I used Chameleon as my platform for running and documenting my reproduction experiments. Chameleon is a large-scale, reconfigurable experimental platform that supports computer science systems research. It allows users to create and share Jupyter notebooks that can run Python code on Chameleon’s cloud servers.

I created a GitHub repository where you can find all related to my reproduction work in the form of interactive jupyter notebooks that will help you learn more about machine learning and reproducibility of machine learning research.

Challenges

Reproducing a paper is not an easy task. I faced several challenges along the way. One of the biggest challenges was the lack of code and pretrained models from the authors. This is a common problem for many reproducibility projects. Fortunately, I found a previous reproducibility publication for this paper on ReScience journal. I used some of their code and added some new functions and modifications to match the original paper’s descriptions. I also encountered other challenges that I discussed in the notebooks with the solutions that I applied.

How to use this material?

This material is a series of notebooks that walk you through the paper and its claims, experiments, and results. You will learn how to analyze, explain, and validate the authors’ claims. To get started, I suggest you skim the original paper briefly to get the main idea and the public information. This will help you understand how the authors could have been more clear and transparent in some sections. I have given clear instructions and explanations in the notebooks, as well as how I dealt with the missing components. You can use this material for self-learning or as an assignment by hiding the final explanation notebook.

Conclusion and Future Work

In this blog post, I have shared with you some of my work on reproducing warm starting neural network training. I have learned a lot from this experience and gained a deeper understanding of reproducibility and reproducible research principles in ML.

I am very happy with what I have achieved so far, but I still have more work to do. I am working on reproducing the Vision Transformer: An Image is Worth 16x16 Words paper by Alexey Dosovitskiy et al. This time my approach is to use the available pretrained models provided by the authors to verify the claims made in the paper. However, there are some challenges that I face in reproducing the paper. For example, some of the datasets and code that the authors used are not publicly available, which makes it hard to replicate their experiments exactly. These challenges are common in reproducing research papers, especially in computer vision. Therefore, it is important to learn how to deal with them and find ways to validate some of the claims.

I hope you enjoyed reading this blog post and found it informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for your attention and stay tuned for more updates!

Midterm Blog Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Wed, 02 Aug 2023 00:00:00 +0000

As part of the Measuring Research Prototypes under Unreported Settings my proposal under the mentorship of Yang Wang and Miao YU aims to understand the impact of missing settings in artifact evaluation.

Based on our project proposal, the first step is to test the benchmark application on targeted systems. We pick open-source database system PostgreSQL as the target system. We test the TPC-C benchmark on PostgreSQL under default settings. We measure the throughput performanace for the benchmark by setting scalefactor as 10 and incrementing worker terminals. The settings for database server are all default values. We will take these results as baseline. In order to test on more parameters and system settings, we need to choose an association of parameters to get optimal throughput.

We use an online tool PGTune, which aims to tune PostgreSQL config by the hardware. We select shared_buffer, min/max_wal_size and effective_cache_size as first set of parameters to measure. They are related to memory consumption, checkpoints and planner cost in the database server. Based on PostgreSQL official documentation, shared_buffer sets the amount of memory the database server uses for shared memory buffers. Max_wal_size sets the maximum size to let the WAL grow during automatic checkpoints. Larger settings for shared_buffers usually require a corresponding increase in max_wal_size, in order to spread out the process of writing large quantities of new or changed data over a longer period of time. Effective_cache_size sets the planner’s assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used.

We conduct the experiments by setting the parameters with increments and compare the throughput performance with each other and the baseline. Based on the results, the throughput of the benchmark with larger shared_buffer and max_wal_size is up to 1.5X of the performance under default settings. The improvement by tuning max_wal_size is larger than that of tuning shared_buffer. The increased effective_cache_size does not have effect for this benchmark workload compared to its default value of the system.

There are more values of above mentioned parameters to test. Next, I will test those parameters with increments of the values. Furthemore, we need to choose an association of more parameters to get optimal throughput. Also, the tuning tool may not generate optimal values for very high memory systems based on its description. This requires we test more possible parameters and their values for better performance.

Midterm: High Fidelity UAV Simulation Using Unreal Engine with specular reflections

Wed, 02 Aug 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aims to Develop a Unreal Engine based simulator for testing. The simulator will be using Unreal Engine for the physics and visualization.

What we have done so far

We found that we can use Unreal Engine as a physics simulator and co-simulate with Simulink using the tools provided by MathWorks.
Simulated a example provided by MathWorks but i wasn’t getting the expected behaviour and there were very few resource available.
So we decided with using Gazebo and ROS for simulation instead of Unreal Engine and Simulink for the example of a balancing bot which had been designed in Solidworks.
For using Gazebo, i had converted the Solidworks model into an URDF and imported it into Gazebo.

Future Work

Currently, i am working on using Gazebo and ROS for controling a balancing bot using a PID control algorithm. Afterwards document the process of import a model into Gazebo for testing a control algorithm.

ScaleBugs: Reproducible Scalability Bugs

Wed, 02 Aug 2023 00:00:00 +0000

Introduction

As part of the Scalebugs Project, we have worked on building a dataset of reproducible scalability bugs. To achieve this, we go through existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. Workloads are designed to reproduce these scalability bugs by triggering some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance.

So far we have worked on packaging the buggy and fixed versions of scalability systems, a runtime environment that ensures reproducibility, and the workloads used to trigger the symptoms of the bug inside docker containers. By packaging these versions together, we are simplifying the process of deployment and testing. This enables us to switch between different versions efficiently, aiding in the identification and comparison of the bug’s behavior. For each scalability system, we have carefully built a runtime environment that is consistent and reproducible. This approach ensures that each time we run tests or investigations, the conditions remain identical.

New Terms

In order to make sense of the various bug reports, we had to learn some terminologies associated with scalability systems:

Clusters: Clusters are groups of related or connected items, often found in various fields such as computer science, data analysis, or even social sciences. For example, in data analysis, clusters might represent groups of data points with similar characteristics, making it easier to understand patterns or trends in the data.

Cluster Membership: Cluster membership refers to the process of determining which items or entities belong to a particular cluster. This task can be done based on various criteria, such as similarity in attributes, spatial proximity, or shared characteristics.

Locks: In computer programming, locks are mechanisms used to manage access to shared resources, such as files, data structures, or hardware devices. When multiple processes or threads need to access a shared resource simultaneously, locks ensure that only one process or thread can access it at a time, preventing data corruption or conflicts.

Lock Contentions: Lock contention occurs when multiple processes or threads attempt to acquire the same lock simultaneously. When this happens, one process or thread must wait until the lock becomes available, leading to potential delays and reduced performance.

Critical Paths: In project management or process analysis, a critical path is the longest chain of dependent tasks that determines the overall duration of the project or process. Any delay in tasks along the critical path will directly impact the project’s completion time.

Tokens: Tokens can have various meanings depending on the context. In computer programming, tokens are the smallest units of source code recognized by a compiler or interpreter. In cryptography, tokens can represent digital certificates or authentication data used for secure communication.

Nodes: In the context of network theory or graph theory, nodes are individual points or entities that form a network or graph. In a computer network, nodes can be devices like computers or routers, and in a social network, nodes can represent individuals or entities.

Peers: Peers are entities within a network that have the same status or capabilities. In peer-to-peer networks, each node can act as both a client and a server, enabling direct communication between nodes without relying on a central server.

Gossipers, Gossip Protocol: In distributed systems, gossipers are nodes that share information with each other using the gossip protocol. The gossip protocol involves randomly selecting peers and exchanging information in a decentralized manner, allowing information to spread quickly across the network.

Threads: Threads are the smallest units of execution within a process in computer programming. Multiple threads can run concurrently within a single process, enabling multitasking and parallel processing. Threads can share the same resources within the process, making them more lightweight than separate processes. However, proper synchronization is essential to prevent data corruption or conflicts when multiple threads access shared resources.

Flush and Writes Contention: This refers to a situation where simultaneous operations involving data flushing (saving data to a storage medium) and data writing (updating or adding data) are causing conflicts or delays. This contention can arise when multiple processes or threads attempt to perform these operations concurrently, leading to performance bottlenecks or potential data integrity issues.

Accomplishments

We have been able to build docker containers for the following scalability bugs:

IGNITE 12087

This bug stems from the resolution of the IGNITE-5227 issue (another bug), which has led to a significant decline in the performance of a particular operation. Prior to addressing IGNITE-5227, the insertion of 30,000 entries displayed remarkable efficiency, completing in roughly 1 second. However, post the resolution, executing the same insertion process for 30,000 entries witnessed a considerable slowdown, taking approximately 130 seconds – a performance degradation of nearly 100 times.

CASSANDRA 14660

This bug is related to how clusters work together and how a lock is causing conflicts with the critical path. The issue arises from a method call that uses O(Peers * Tokens) resources while contending for a lock, which is causing problems in the write path. The lock is used to protect cached tokens that are essential for determining the correct replicas. The lock is implemented as a synchronized block in the TokenMetadata class.

How was this fixed?

It was fixed by reducing the complexity of the operation to O(Peers) taking advantage of some properties of the token list and the data structure.

CASSANDRA 12281

This bug is also related to how clusters work together and a lock conflict. The issue arises when a specific method is trying to access a lot of resources (O(Tokens^2)) while contending for a read lock. As reported, a cluster with around 300 nodes has around 300 * 256 (assuming the default number of tokens) tokens, thus joining a new member reportedly is taking more than 30 mins. This happens because due to the long execution time here, this lock makes every gossip message delayed, so the node never becomes active.

How was this fixed?

The granularity of the lock is decreased, meaning that the expensive function calls now do not take the problematic read lock and simply use a synchronized block, synchronizing on a specific field, that does the job much better.

HA16850

This is a bug related to obtaining thread information in the JvmMetrics package. When obtaining thread information, the original buggy version used MXBeans to obtain thread information. The call uses an underlying native implementation that holds a lock on threads, preventing thread termination or creation. This means that the more threads that we have to obtain information for, the longer the function call will hold a lock. The result is that the execution time scales on the number of active threads O(threads).

How was this fixed?

Developers utilized a ThreadGroup to keep track of obtaining metrics for threads. The result is that there is no lock held for every thread.

CA13923

This issue revolves around conflicts between the “flush” and “writes” processes. The main problem is that during the “flush” process, a resource-intensive function called “getAddressRanges” is invoked. This function has a high computational cost and its complexity is O(Tokens^2). In other words, the time it takes to complete this function grows quickly as the number of “tokens” increases. This situation is causing challenges and delays in the overall process.

How was this fixed?

This function call affected many paths and they made sure no one calls getAddressRanges in critical paths.

Challenges

Demanding Memory Requirements: Running certain builds consumes a significant amount of memory. This places a strain on system resources and can impact the overall performance and stability of the process.

Little Issues Impacting Execution: Often, seemingly minor details can obstruct the successful execution of a build. Resolving such issues requires thorough investigation and extensive research into similar problems faced by others in the past.

Complexities of Scalability Bugs: Identifying the underlying causes of scalability-related bugs is intricate. These bugs exhibit unique characteristics that can complicate the process of pinpointing and comprehending their root origins.

What is Docker? ( For those who don’t know about it )

Docker is a platform that facilitates the containerization of applications, leading to consistent and efficient deployment across diverse environments. Its benefits include portability, resource efficiency, isolation, and rapid development cycles. DockerHub complements Docker by providing a centralized hub for sharing and accessing container images, fostering collaboration and ease of use within the Docker ecosystem.

More about docker https://docs.docker.com/get-started/overview/

Mid-term blog post for Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Tue, 01 Aug 2023 00:00:00 +0000

Hello!

I am Srishti Jaiswal and this is my second blog post for the 2023 Summer of Reproducibility Fellowship.

Introduction

As I reach the halfway mark of my internship journey, I have had the incredible opportunity to work on a project that revolves around reproducing an adaptive video research result using cloud-based experimentation. This blog post delves into my exciting work so far, the significant milestones achieved, specific accomplishments to celebrate, and the challenges overcome. Utilizing CloudLab and FABRIC, I embarked on a journey to reproduce essential figures from the research paper Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming, ensure Python2 and Python3 compatibility and incorporate an Estimated Download Rate column in the log file produced by the video client. Let’s explore the details of this captivating internship experience.

Major Milestones Reached

Here are the milestones we have reached so far:

Familiar with CloudLab and Fabric Testbeds: I learned how to run an adaptive video experiment, which is the jumping-off point for my project, on the CloudLab and FABRIC platforms.
Python2 and Python3 Compatibility: My first task was to port an existing open-source code base developed for Python2 (which is no longer supported) so that it can run in Python3. Now code is running successfully in both versions for all the policies of the existing open source, i.e. Basic, Netflix and Sara. Fixed issue#1 .
Estimated Download Rate for Basic Policy: To make it easier for users to understand and visualize how the adaptive video policy works, I added an additional metric, “Estimated Download Rate”, to the output file produced by the adaptive video client.
Graphing Buffer Occupancy and Estimated Download Rate: I extended the existing experiment to show two additional visualizations that are important for understanding how the adaptive video client works: buffer occupancy vs time and estimated download rate vs time.

Overcoming Challenges

I encountered several challenges throughout this project, especially as it was my first time working independently on a research paper as a third-year engineering student. However, with my mentor’s guidance and support, I persevered and learned to tackle each obstacle with determination.

One significant challenge was porting the entire code from Python2 to Python3. This transition resulted in numerous errors, and I often found it challenging to pinpoint where the mistakes occurred. To overcome this, I adopted a step-by-step approach, fixing errors one by one and verifying them using Python2 for comparison.

Understanding the complex codebase was another hurdle that led to moments of feeling stuck in an infinite loop. But every time I faced such situations, I sought my mentor’s advice, and together, we made strategic changes to overcome these challenges.

I am immensely grateful for my mentor’s expertise and support throughout this internship. Her guidance played a crucial role in helping me navigate through the challenges and grow both professionally and personally. I eagerly look forward to the rest of the journey, knowing that I can continue making meaningful contributions to this research project with her inspiring mentorship.

Future Prospects

As the second half of my internship approaches, I am eager to refine further and expand our experimentation. Our main aim is to reproduce the existing work and provide a clear guide for other students to do the same for this, I have to create a framework that helps them improve and build upon this work.

I hope you enjoyed reading this blog post.If you have any questions or feedback, please feel free to contact me. Thank you for your attention and stay tuned for more updates!

Midterm: Open Source Autonomous Vehicle Controller

Tue, 01 Aug 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller Project my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aimed to create comprehensive technical documentation to help onboard new users of the OSAVC controller.

I have accomplished the following:

From the KiCad Schematic Editor, created pinouts of the I/O connectors on the OSAVC.
Detailed a hardware overview of the OSAVC by labeling and describing each electrical component.
Documented the setup for loading code on the OSAVC, including software such as Git, MPLAB X, XC32 Compiler, and serial terminal and hardware by showing how to connect the PICKit3 and OSAVC to a PC.
Tested the OSAVC by receiving and transmitting characters in the serial port into a buffer.
Fixed bugs/errors in the NEO_M8N GPS module library and PWM motors library.
Created a new library for the uni and bidirectional ESC brushless motors.
Created a user-interfaced test harness for all peripherals: serial, IMU, GPS, encoder, PWM actuators, radio telemetry, Mavlink heartbeat, radio controller, and LIDAR.
Incorporated new user interface element and fixed video streaming errors in the Flask app running on the Raspberry Pi 4 communicating with the OSAVC.
Documented both software and hardware steps to run the OSAVC with a companion computer such as a Raspberry Pi 4.
Highlighted common problems encountered with the OSAVC.
Created a contributor’s guide for others to create new libraries or contribute to the OSAVC project.
Designed a switching voltage regulator in SOLIDWORKS
Designed a self balancing bot that employs the OSAVC in SOLIDWORKS

Future Work

Currently, the laser cutter at UCSC is in maintenance, so we couldn’t assemble the self balancing bot yet. Once we assemble it, I will finish and document the control algorithms. We can also try incorporating ML models on the Raspberry Pi with the Coral USB accelerator on the self balancing bot.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

Tue, 01 Aug 2023 00:00:00 +0000

We are currently midway into the OSRE 2023 program and the following post lists the progress that I have made on the project so far. As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time our overall goal was to enumerate the effect of sequence data quality on execution times. Towards that end, we decided to first identify suitable datasets from the two commmonly available -omics data modalities - transcriptomics and genomics. Albrecht et al. [1] developed seqQscorer to automate the quality control step of NGS data analysis through predictive modeling. They have also published the list of ENCODE datasets used for training the models. Quality label has been assigned as 0 for released files or 1 for revoked files. Based on the guidelines set forth by ENCODE’s Data Coordination Centre (DCC) a comprehensive manual annotation of the data was done by scientists and the resulting quality variable “status” was published to serve as an indication of the quality of the data. The following steps outline the process of generating the data table for building the machine learning models.

Step 1: Programmatically accessed 86 (34 released ; 34 revoked) RNA-seq files from ENCODE database. All the fastq files were single ended.
Step 2: Programmatically accessed 288 (144 released ; 144 revoked) DNA-seq files from ENCODE database. All the fastq files were paired ended.
Step 3: Implemeted the STAR aligner for RNA-seq and the BWA aligner for DNA seq. The resulting outputs contained the alignment times for both the “revoked” and “released”.
Step 4: Ran statistical tests to determine whether there is any significant differences in the runtimes of both types of files.

Currently I am running the FASTQC tool to extract data quality metrics for the same set of files as discsussed above. Once I have collected those metrics, I can start building regression models to determine whether there is any significant impact of data quality on execution time. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Through our analysis we aim to develop a reproducible ML model that will give the user an estimate of the runtime based on the raw FATSQ file as input.

References

[1] Albrecht, S., Sprang, M., Andrade-Navarro, M.A. et al. seqQscorer: automated quality control of next-generation sequencing data using machine learning. Genome Biol 22, 75 (2021). https://doi.org/10.1186/s13059-021-02294-2

[Mid-term] Capturing provenance into Data Science/Machine Learning workflows

Mon, 31 Jul 2023 00:00:00 +0000

This post describes our midterm work status and some achievements we have done so far in the project for the noWorkflow package.

The initial weeks

I started doing a bibliographical review on reproducibility in the Data Science (DS) and Machine Learning (ML) realms. It was a new subject to me, and I aimed to build a more robust theoretical background in the field. Meanwhile, I took notes in this series of posts.

Then, as planned, I integrated with the current noWorkflow supporters in order get a broader view of the project and their contributions. Additionally, Juliana Freire, João Felipe Pimentel, and I set up a weekly one-hour schedule to keep track of my activities.

Brainstormed opportunities

At the beginning of June, we also met with other project supporters to brainstorm about our initial proposal. From this meeting, we came up with a plan on how technically approach a noWorkflow new feature in Data Science and Machine Learning experimental management.

In this brainstorm, we aligned that Jupyter Notebooks are, by far, the most frequent set up in DS/ML computational experiments. They established themselves as the fundamental artifact by embedding code, text and enabling execution and visualization. Entire experiments are created and kept in Jupyter notebooks until they are sent to production. And the opportunity at hand is to integrate noWorkflow with Jupyter Notebooks. Then, our mid-term goal was adapted from the original plan of only selecting and executing a prototypical ML experiment. We added the goal of paving the way for providing a tagging feature for Notebook cells.

More specifically, DS/ML experimental workflows usually have well-defined stages composed of data reading, feature engineering, model scoring, and metrics evaluation. In our dream space, the user would tag a cell in their experiment, enabling the capture of the tagged metadata into a database. This step integrates the ultimate goal of facilitating comparisons, management, and even causal inference across different trials of a DS/ML experiment.

Current deliverables

So, based on our plans, we create a separate table to store the metadata from cell tagging. This table stores the cell hash codes and information to match the code executed within a cell. As a result, we can store tags and the activation ids of the cells enabling us to identify a cell containing a given stage in a DS/ML experiment.

The second feature implemented was tagging a specific variable. In the same way for a cell, now it is possible to stamp a given variable with a tag, keeping its name, id, and received value in this separated table.

Finally, we worked on displaying the dependencies of a given variable. In this case, by tagging a given variable, we can display the other variables, values, and cells activated in its construction. Then, we can visualize the dependencies that contributed to its final value.

For an overview of current developments, please refer to my fork of the main project.

Challenges

During this period, we had to make choices along the way. For instance, capturing the provenance of cells through tags is a different solution than tagging code chunks in scripts. In this case, we decided to stick with tagging Notebook cells at this moment. We also opted to start storing the metadata to enable comparisons between trials rather than focus on a sophisticated graphic and user-friendly cell tagging system. We also opted to keep this metadata info stored in a separate table in the database.

Next steps

In the second half of the summer, our goal is to integrate these features in order to proceed with comparisons among experiments. Such comparisons would use the tagged variables as the hyperparameters of DS/ML experiments or key variables to assess the experiments, such as errors or scores. As a result, we will be able to compare the results of two trials in a more accurate, and easily reproducible experiment.

Implemented IO uring for Key-Value Drives

Mon, 31 Jul 2023 00:00:00 +0000

Hi everyone!

I’m Manank Patel, (link to my Introduction post) and am currently working on Efficient Communication with Key/Value Storage Devices. The goal of the project was to leverage the capabilities of io_uring and implement a new backend driver.

In the existing sockets backend, we use non-blocking sockets with looping to ensure all the data is written. Here is a simplified flow diagram for the same. The reasoning behind using non blocking sockets and TCP_NODELAY is to get proper network utilization. This snippet from the code explains it further.


NODELAY means that segments are always sent as soon as possible,
even if there is only a small amount of data. When not set,
data is buffered until there is a sufficient amount to send out,
thereby avoiding the frequent sending of small packets, which
results in poor utilization of the network. This option is
overridden by TCP_CORK; however, setting this option forces
an explicit flush of pending output, even if TCP_CORK is
currently set.

In the above figure, we have a loop with a writev call, and we check the return value and if all the data has not been written, then we modify the offsets and then loop again, otherwise, if all the data has been written, we exit the loop and return from the function. Now this works well with traditional sockets, as we get the return value from the writev call as soon as it returns. In case of io_uring, if we try to follow the same design, we get the following flow diagram.

Here, as you can see, there are many additional steps/overhead if we want to check the return value before sending the next writev, as we need to know how many bytes has been written till now to change the offsets and issue the next request accordingly. Thus, in every iteration of the loop we need to to get an sqe, prep it for writev, then submit it, and then get a CQE, and then wait for the CQE to get the return value of writev call.

The alternate approach would be to write the full message/iovec atomically in one call, as shown in following diagram.

However, on trying this method, and running fio tests, we noticed that it worked well with smaller block sizes, like 16k, 32k and 64k, but was failing constantly with larger block sizes like 512k or 1m. This was because it was not able to write all the data to the socket in one go. This method showed good results as compared to sockets backend (for small BS i.e). We tried to increase the send/recv buffers to 1MiB-10MiB but it still struggled with larger blocksizes.

Going forward, we discussed a few ideas to understand the performance trade-offs. One is to use a static variable and increment it on every loop iteration, in this way we can find out if that is really the contirbuting factor to our problem. Another idea is to break down the message in small chunks, say 256k and and set up io uring with sqe polling and then link and submit those requests in loop, without calling io_uring_submit and waiting for CQE. The plan is to try these ideas, discuss and come up with new ideas on how we can leverage io_uring for ktli backend.

Improving Video Applications' Accuracy by Enabling The Use of Concierge

Mon, 31 Jul 2023 00:00:00 +0000

Introduction

Hello, it’s me again, Faishal, a SoR project contributor for the edgebench project. For the past these two months, my mentors and I have been working on improving the performance of our system. In this report, I would like to share with you what we have been working on.

Motivation

Edgebench is a project that focuses on how to efficiently distribute resource (bandwidth and cpu usage) across several video applications. Nowaday’s video applications process its data or video on a server or known as edge computing, hence bandwidth or compute unit may be the greatest concern if we talk about edge computing in terms of WAN, because it is strictly limited.

Consider the following case, suppose we have 3 video applications running that is located in several areas across a city. Suppose the total bandwidth allocated to those 3 video applications is also fixed. Naively, we may divide the bandwidth evenly to every camera in the system. We may have the following graph of the allocated bandwidth overtime.

They are fixed and won’t change. However, every video application has its own characteristic to deliver such a good result or f1-score. It is our task to maintain high average f1-score. Therefore we need to implement a new solution which is accuracy-oriented. The accuracy-gradient[1] comes into this.

System Design

On our current design, we need a resource allocator, namely concierge. This concierge determines how much bandwidth is needed for every video application (vap) in the system. Concierge will do the allocation at a certain time interval that has been determined before. This process is called profiling, on this process, the concierge will first ask every vap to calculate their f1-score at a certain video segment when the bandwidth is added by profile_delta. Then the difference of this f1-score is substracted by the default f1-score, namely f1_diff_high. After that, the concierge will ask to reduce its bandwidth by profile_delta and do the same process as before, this result will be named f1_diff_low. Those two results will be sent to the concierge for the next step. On the concierge, there will be sensitivity calculation, where sensitivity is

This equation tells us which video application will give us the best f1-score improvement if we add more bandwidth to one vap while reducing other’s bandwidth. From this, we will optimize and the concierge will give the bandwdith to the one with the highest sensitivity and take the bandwidth from the app with the lowest sensitvity.

Results

As aforementioned, our main objective is to improve the accuracy. However, there are two parameters that will be taken into account which are improvement and the overhead of its improvement. We first choose 3 dds apps[2] that we think will be our ideal case. The following graphs show the profile of our ideal case

We can see that two of them have high sensitivity especially on lower bandwidth and one of them has low sensitivity. This is a perfect scenario since we may sacrifice one’s bandwidth and give it to the app that has the highest sensitivity at that iteration. We will do the experiment under the following setup

DATASETS=("" "uav-1" "coldwater" "roppongi")
MAX_BW=1200
PROFILING_DELTA=80
MI=5

That setup block tells us we will use the total bandwith of 1200 kbps, that means at first we will distribute the bandwidth evenly (400 kbps). The profiling_delta will be 80 kbps and profiling interval (MI) will be 5 seconds.

Mode	DDS (uav-1)	DDS (coldwater)	DDS (roppongi)	Average
Baseline	0.042	0.913	0.551	0.502
Concierge	0.542	0.854	0.495	0.63 (+25.5%)

From the result, we managed to improve the average f1-score by 0.1 or 25.5%. This is obviously a very good result. There are a total of 10 videos in our dataset, for the next experiment, we first will generate 6 combinations of dds apps. Noted that for each combination, one video will be uav-1 since we know that it has the highest sensitivity. We will the experiment with 4 bandwidth scenarios (1200, 1500, 1800, 2100) in kbps.

The left figure depicts the average improvement of the concierge. Here we can see that the improvement decreases when the total bandwidth increases. The reason behind this is at a higher bandwidth, the sensitivity tends to be closer to 0 and the concierge won’t do any allocation. Overall, this confirms our previous result that with the help of uav-1, the concierge can improve the f1-score up to 0.1. The next experiment is to randomly pick 3 dds videos out of 10 videos that will be generated 10 times. We would like to see how it perfoms without any help of uav-1.

From the result, we still managed to get the improvement. However, it seems that average improvement decreases compared to the previous one. The reason of this phenomenon will be discussed later.

Overhead Measurement

From the graph above, each graph represents the total bandwidth used. In this experiment, it is clearly known that the lower MI leads to higher overhead since there would be more profiling process than higher MI. From the 4 graphs above, it can be known that there would be a significant trade off if we lower the MI since the improvement itself is not highly significant. The highest improvement is at 1200kbps. Hence, for higher bandwidth, there is no need to do the profiling too often

Discussion

There are some limitations of our current design. If we have a look at box-plot in figure 5 above, we can see that there is some combinations where the improvement is negative.

The figure above depicts the profiling process from the segment 6 to determine the bandwidth used at segment 7. Here we can see that the f1-score at that bandwidth for (jakarta) drops significantly. Our current design cannot address this issue yet since we only consider current video segment. There is a need to not only look at current segment, but also the previous and the future segment should be taken into account as well.

Regarding the overhead, we are aware that 50% overhead is still considered bad. We might as well try the dynamic MI or skip the profiling for certain video if not neccesarry.

Conclusion

Regardless the aforementioned limitations, this report shows that the concierge is generally capable of giving an f1-score improvement. The update of the next will be shown in the final report later.

References

[1] https://drive.google.com/file/d/1U_o0IwYcBNF98cb5K_h56Nl-bQJSAtMj/view?usp=sharing
[2] Kuntai Du, Ahsan Pervaiz, Xin Yuan, Aakanksha Chowdhery, Qizheng Zhang, Henry Hoffmann, and Junchen Jiang. 2020. Server-driven video streaming for deep learning inference. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 557–570.

Mid-term blog post for Public Artifact Data and Visualization

Mon, 31 Jul 2023 00:00:00 +0000

Over the past few weeks, our platform development has been progressing steadily, and we are excited to share the milestones we have achieved so far. As planned in our introductory blog, we have successfully laid the groundwork for the platform with the guidance and support of our mentor.

Milestones and Accomplishments

Here are some of the key functionalities we have implemented so far:

Modular Architecture: We successfully designed the platform with a modular architecture, separating the Graphical User Interface (GUI) and Command-Line Interface (CLI) functionalities. This modularity allows users to interact with the platform in their preferred way.
Experiment and Bucket Creation: Users can now create experiments, buckets (for storing different implementations of experiments), and iterations using either the GUI or CLI.
Real-time Backend Environment Monitoring: Through the command line interface, users have the capability to control the monitoring of backend environment data, allowing for real-time tracking and analysis of important metrics.
Visualizing Environment Variables: Users can now visualize detected environment variables on the platform. Moreover, they can compare iterations within different buckets and gain more insights by observing the timeseries data, such as CPU usage, in a graphical format.

Challenges

In the early stages of designing our platform, we encountered significant challenges at the system design level. One of the most daunting obstacles we faced was devising an effective method to monitor backend environment variables. To tackle this obstacle, we engaged in extensive discussions and sought guidance from our mentor. After careful consideration, we decided to adopt a multi-process approach to monitor the backend environment variables effectively. Specifically, we devised a meticulous strategy of creating a separate process in the background for each specific metric we needed to monitor. By allocating a dedicated process to each metric, we ensured a streamlined and efficient monitoring process.

Currently, we are facing a challenge related to monitoring metrics. Since different users have varying monitoring requirements, it is impractical for us to manually write monitoring solutions for each user. To address this issue, we are actively working on implementing a pluggable design that allows users to configure their own monitoring preferences.

Our approach involves providing users with the flexibility to define their custom configuration files or write monitoring programs following our documented guidelines. This way, users can specify the specific metrics they wish to monitor and tailor the monitoring process to their individual needs.

Try it Out!

GUI Repository and CLI Repository
- In the README.md file of GUI repo, you will find detailed installation instructions to set up the Graphical User Interface (GUI). Follow the steps provided to get started with our platform.
Sample Repository
- In this repository, we have included scripts that allow you to run our program. Additionally, you can use these scripts as templates to monitor your own programs according to your specific requirements.

We welcome you to take the platform for a test drive and feel free to raise any issues you encounter during the installation process. Your feedback is invaluable to us, as it helps us identify and address any potential installation challenges and improve the user experience.

Enhancing Drift Detection through Fine-Tuning Llama2

Sun, 30 Jul 2023 00:00:00 +0000

Greetings everyone, I’m Kangrui. Over the past few weeks, we’ve dedicated our efforts and have consequently made significant progress in our drift detection methods. Now, I’m excited to present to you a detailed elaboration on how we prompted and fine-tuned Llama2 to efficiently carry out the drift detection task.

Motivation

Why LLM in drift detection method?

The use of large language models (LLMs) in drift detection methods presents numerous benefits that place it as a prominent solution in this domain.

Rapid Development: LLMs are in the vanguard of technological advancement. This field is evolving rapidly with continuous enhancements in model architecture, training techniques, and data handling. With every new version, these models are showing an increasing capacity to understand and generate human-like text, pushing the limits of what is achievable in Natural Language Processing (NLP) and Artificial Intelligence (AI) as a whole.
Superior Performance: Traditional drift detection methodologies such as Page-Hinkley, EDDM, and HDDM have their merits and have found success in numerous scenarios. Even Deep Learning (DL) techniques, like training a predictive model based on error rates, have made significant strides in the field. However, when handling complex, high-dimensional, and real-time data, LLMs have demonstrated exceptional results. They are not only able to effectively predict and respond to drifts but also adapt to new trends more swiftly. Our experiments using LLMs like GPT-3.5-turbo have yielded impressive results, notably outperforming other methods.

Fig. 1: Concept dirfts detected by GPT-3.5-turbo in Cori dataset

Flexibility: One of the major advantages of using LLMs is their flexibility in dealing with different types of input and output. In contrast to traditional methods, which are confined to single feature concept drift detection and can only process numerical values, LLMs can handle a range of input types including text, numbers, and more complex data structures. This capability allows them to detect multi-feature concept drifts, thereby broadening the scope and complexity of problems they can tackle. Moreover, the generation capability of LLMs can provide rich and detailed output, facilitating more comprehensive insights into the detected drifts.

Why Llama2 in drift detection method?

Llama2 presents a series of advantages that make it an excellent choice for applying llm in drift detection. Here’s a breakdown of the key reasons:

Performance Guarantee: As a newly released model, Llama2 has undergone extensive development and testing, providing a reliable guarantee of performance. It represents the cutting edge in AI technology, having benefited from the latest research and advancements in language model design.
Accessibility Guarantee: One significant advantage of Llama2 is that it is open-source. It is readily accessible on HuggingFace, which also provides a range of mature tools to fine-tune and deploy the model.
Flexibility for Fine-Tuning: Llama2 comes in different sizes, such as 7B, 13B, and 75B parameters, which allows for flexibility in model selection based on the task’s requirements and computational resources.

Data

Dataset

In our study, we employed Synthetic data streams for the fine-tuning of Llama2. Synthetic data streams serve as an invaluable resource for controlled experiments in the domain of drift detection. These curated datasets encompass varied types of drifts, providing us with the capability to assess the efficacy of our detection algorithms under diverse scenarios.

Here is a brief introduction to the synthetic datasets we used:

Sine1 & Sine2: These datasets induce abrupt concept drift within a two-dimensional feature space. The classification rule, a sine function, dictates the instance labels, which are flipped at every drift point.
Mixed: This dataset, characterized by its combination of numeric and boolean features, uses a composite classification rule. The abrupt concept drift is simulated via a periodic reversal of class labels.
Stagger: This categorical dataset incorporates abrupt concept drift by periodically altering the classification rules tied to the features.
Circles & LED: These datasets are designed to simulate gradual concept drift. In Circles, the classification of instances is determined by their spatial relation to specific circles. LED imitates a seven-segment digit display, introducing drift by interchanging the pertinent attributes.

Typically, the synthetic datasets contain 100,000 or 1,000,000 instances. The concept drift happens every 25000 or 33333 instances each portraying either abrupt (with drifting period of 50 instances) or gradual concept drifts (with drifting period of 500 instances).

Data Preprocessing and Metrics

Given the token limit of Llama2 and the specific requirements of our project, we needed to transform the data into an appropriate format.

As such, we processed each data stream into three sections: the ‘undrifted’ period, the ‘drifting’ period, and the ‘drifted’ period. All instances in each section were randomly and independently drawn from the original data stream, summing up to a maximum of 100 instances. The number of instances for the undrifted and drifted periods ranged from 20 to 50, and for the drifting period, it ranged from 10 to 20.

For instance, let’s consider a dataset containing 100,000 instances where the concept drift occurs every 25,000 instances, causing abrupt concept drift. To format a data point, we could draw 20 to 50 instances from the first 25,000 as the undrifted period. Then, we could draw 10 to 20 instances from the 25,001st to 25,050th instance as the drifting period. Finally, we would draw 10 to min(100 - num(undrifted period) - num(drifting period), 50) from the 25,051st to 50,050th instance as the drifted period. This newly formatted data stream would then be fed into Llama2.

We also included some additional information to assist Llama2’s inference process. A typical data point in our processed dataset includes:

{
 "before_period": [0, 31],
 "transition_period": [32, 38],
 "after_period": [39, 59],
 "before_index": [196, 19963],
 "transition_index": [20002, 20030],
 "after_index": [20310, 39984],
 "meta": "Dataset: MIXED\n\tv's type is nominal, range is ('False', 'True')\n\tw's type is nominal, range is ('False', 'True')\n\tx's type is numeric\n\ty's type is numeric\n\tclass's type is nominal, range is ('p', 'n')\n",
 "data_stream": ...
}

From this dictionary, the “meta” and “data_stream” entries are fed into Llama2. The “transition_period” serves as the criterion: if Llama2’s answer lies within the “transition_period”, we deem it correct.

Llama2

Inference

We experimented with three variations of prompts during the inference phase.

Prompt Version 1:

[INST] <<SYS>>
 You are a helpful, respectful, and honest assistant. Always provide the most helpful responses possible while ensuring safety. Ensure that your responses are socially unbiased, positive, and free from harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. If a question lacks coherence or sense, explain why instead of providing incorrect information. If you are uncertain about an answer, refrain from sharing false information.
 <</SYS>>
 Your task is to identify the index in a given data stream where the relationship between the features and labels begins to change. The data stream is formatted as a list, with each element being a two-element list: the first represents the features (also a list), and the second is the label. If your answer is 'x', it indicates that the data pattern starts shifting at the xth data point in the stream.
 Here's an example of the data's metadata: Dataset: SINE1
 x's type is numeric
 y's type is numeric
 class's type is nominal, range is ('p', 'n')

 The given data stream is: [[[0.7, 0.07], 'p'], [[0.45, 0.78], 'n'], ..., [[0.64, 0.45], 'n']]
 Your task is to respond with a single index. No additional information is required.
[/INST]

Prompt Version 2:

The same as Prompt 1, but with a specific range for the index response:

Please provide an index ranging from 0 to 96. No additional information is required.

Prompt Version 3:

This prompt uses an instruction-input-output design, which we adopted for fine-tuning:

Below is an instruction paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Identify the index in a given data stream where the relationship between features and labels begins to change. The data stream is formatted as a list, each element being a two-element list: the first represents the features (also a list), and the second is the label. For instance, if the response is 'x', it means that the data pattern starts shifting at the xth data point in the stream. Only respond with an index, no further information is necessary.

### Input:
Meta Data:
Dataset: SINE1
 x's type is numeric
 y's type is numeric
 class's type is nominal, range is ('p', 'n')

Data stream:
[[[0.7, 0.07], 'p'], [[0.45, 0.78], 'n'], .., [[0.64, 0.45], 'n']]

### Response:

Despite minor differences between Prompt Version 1 and Version 2, both suggested by Meta, the results varied significantly, a topic we will delve into in the following section. Prompt Version 3, employing the instruction-input-output structure, was used during our fine-tuning process.

Fine-Tuning

We utilized the tools provided by llama-recipes to fine-tune Llama2. The key command used to initiate the fine-tuning process is illustrated below:

python llama_finetuning.py --use_peft \
 --peft_method lora \
 --quantization \
 --model_name meta-llama/Llama-2-13b-chat-hf \
 --output_dir ./fine_tuned_model/Llama-2-13b-chat-hf-test_finetune \
 --dataset alpaca_dataset \
 --batch_size_training 40 \
 --num_epochs 1

Some explaination about the parameters:

--use_peft: This flag indicates the use of the Parameter-Efficient Fine-Tuning (PEFT) method. PEFT allows us to fine-tune the model more efficiently.
--peft_method lora: Here, we specify that the Lora (Layer-wise Optimal Brain Surgeon with Relevance-based Adjustment) method should be used for PEFT.
--quantization: The quantization flag is used to reduce the memory footprint of the model during the inference stage. It does so by reducing the precision of the model's weights.
--dataset alpaca_dataset: Specifies the dataset setting used for fine-tuning, in this case, the 'alpaca_dataset' indicates the instruction-input-output structure for fine-tuning.

Results

The performance of various models and prompt versions is depicted in Fig. 2.

Fig. 2: Performance comparison of different models and prompt versions.

It is evident from the results that the design of the prompt has a significant impact on Llama2’s performance. Furthermore, due to computational resource constraints, we have only managed to fine-tune Llama2 on a portion of our dataset (approximately 1,000 instances). The entire training set consists of 19,000 instances, and the test set includes 5,000 instances. Despite these limitations, a performance increase is noticeable after fine-tuning.

GPU Emulator for Easy Reproducibility of DNN Training -- Interim Blog Post

Sun, 30 Jul 2023 00:00:00 +0000

Introduction

Motivation

The growing popularity of Deep Neural Networks has resulted in a substantial increase in demand for Graphics Processing Units (GPUs). GPUs are crucial for conducting matrix computations in DNN training and inference. However, they are expensive to purchase for personal use, and the limited availability of GPU resources in public research clouds like Chameleon further exacerbates the issue. This scarcity of resources can cause delays in DNN-related research projects. Therefore, building an emulator can ameliorate the trouble of reserving GPUs, and the emulator can be modified to gather the profiles needed for optimization much quicker.

Overture

The follwing sections will introduce the completed tasks and specify the details within each. The contents are briefly summarized and will try to present the necessary information only. We finished the following tasks:

Literature Review
Emulator implementation:
- Time Profiling
- Pinned Memory
- Inter-GPUs Computation
Reproducing Figures

I will introduce them and the importance of each one.

Tasks + Reason

Literature Review

While waiting for the measurements, I started reading about other GPU-related papers, especially the ones about GPU Schedulers. We found that besides emulating computation and transfer time, we should also emulate the GPU memory profile in order to reproduce some other papers. Fortunately, it’s doable. In fact, without actually using a GPU, we can emulate many aspects of the GPU, more than just its timing. I found several papers that are reproducible theoretically, but they use Tensorflow while my current work targets Pytorch. Therefore I need to keep looking for the ones that use Pytorch.

Afterwards, we started doing more paper reviews and looked over the papers about GPU Scheduling from 2018-2023 to see if we can reproduce figures from other papers. We went over 150 papers to search for the ones that do have implementation in PyTorch and the complemented GitHub page. We managed to find about 15 papers built in PyTorch and 6 of them were published on GitHub.

We found the paper “CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs” and its GitHub page. The paper has three badges of “Artifacts Available, Evaluated, and Reproduced.” The paper’s content is implemented in PyTorch which means we can probably emulate this paper’s result with the emulator we already have by adding more features. We have started testing out to see if we can set up a similar environment and reproduce the experiments in the paper. After checking out the reproducibility of the paper, we will try to reproduce it using our emulator, and we might add new features to our emulator during this process.

Firstly, I tried to reproduce the figures in the paper “CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs”, but stopped after a considerable number of attempts because the README was incomplete and too hard to follow. I first headed to the GitHub of the paper. I read the paper and understood that the GNN’s training was not the same as regular deep learning training, because it had input irregularity, and CoGNN helped better schedule the jobs to the machines by their algorithm. However, when I tried to install the software by the requirement of their environment README in order to reproduce the figures, many dependency issues were there, and barely any packages required were installed successfully. Their README in the software module was unclear on how to run the experiments too. Following the experiment setup did not give me the expected results. After a set of struggles with even completing one suggested experiment, we eventually decided to move on with other papers and abandoned this paper, reminding me the importance of reproducibility again.

Secondly, we found another paper “Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent”. After reading the paper, we figured that the main focus was on distributing the resources (CPU, GPU) of the nodes to the jobs that were distributed by the Kubernetes Scheduler. In this way, there would be less GPU fragmentation and a higher utility rate of the resources. The paper used a simulator to simulate a large number of nodes and run the jobs by simulation. I successfully ran the experiments demonstrated in the repo and even created a smaller sample so that we could gain the result faster, because their original experiment takes 1020 times which will take about a month. However, when we dug deeper into their paper, we soon realized that their emulator is not a “real” one. Although their emulator is built off Kubernetes, the side where they used to create the figures are mere simulators and therefore doesn’t fit with our goal of emulating only GPU-related parts while running other real-system parts.

Reason:

The purpose is to figure out which papers can be reproduced using the emulator, and what other features are needed for the emulator to work.

Emulator implementation

Time Profiling

I did the performance profiling of different GPUs, which included CPU-to-GPU data transfer time and GPU computation time. These two elements will always be rather constant on GPUs so they can be easily emulated by profiling first and then utilized in the emulation. We did it for 6 different GPUs including k80, rtx6000, m40, a100pcie, v100, and p100.

After having the performance profiling information of a few types of GPU nodes, I implemented the first naive version of the emulator. I used the profile recorded and sleep() function to represent the amount of time that each step needs to accomplish. Meanwhile, the time also varies with the command given so some simple arithmetics were implemented too. It’s implemented on a CPU node yet if we want to know the time profile of a GPU, we can still get them just like on a real GPU node.

Reason:

The time profile collected can be compared with Data Wait Time to conduct research on minimizing pipeline stall across different GPUs and models.

Pinned Memory

Pin memory threads – GPU-based Pytorch utilizes such threads to copy data from SHM to pinned memory, but CPU-based Pytorch doesn’t do so. Therefore, I need to implement an emulation of the pin mem threads. Fortunately, the data copy time is predictable. I have already found out that pin mem time has little to do with # of workers or the model type but only the batch size. I still need to find out if it has anything to do with the GPU nodes, which I assume not at this point.

While implementing the features, We first emulated the CPU-to-GPU transfer time and GPU computation time for the p100 GPU based on the profiled information. Another CUDA behavior that requires emulation is that CUDA copies data from shared memory to pinned memory. In order to emulate it, we measured and emulated the time for copying such data (pinned memory). However, the emulator did not behave exactly as the real GPU. This was because we only emulated the time cost of using pinned_memory, but didn’t emulate its memory cost. In order to resolve the problem above, we wrote a CPython module to manually allocate page-locked memory (which behaves the same as CUDA’s pinned_memory). After we implemented this mechanism, the emulator’s fundamental functions were equipped and properly mimicked CUDA’s behaviors.

Reason:

After collecting the GPU profile, I did a comparison with the actual GPU but noticed some differences in their IO time, meaning there was a difference between the emulation-based Pytorch and the actual GPU-based Pytorch.

Inter-GPUs Computation

We worked on the emulation of inter-GPU computation time in order to emulate Figure 9 in the DNN stall paper. This is one of the influential factors in multi-GPU training and we decided to first figure out how to implement this feature. As claimed in the paper, the larger the batch size, the less time it took to update the model. However, our current emulator would give out the same computation time since we have not added features to emulate inter-GPU behaviors. The smaller the batch size, more overheads were proven to be larger. The first step was to rent a lease that had 2 GPUs and saw the effects of inter-GPUs on computation time. We found that there was a small amount of overhead when running two GPUs instead of 1 GPU on the p100 node. My job was to find out where and how these overheads happened and find ways to emulate these features in order to reproduce Figure 9. We used resnet18, 4 workers, 10 batches to separately run 128 batch-size with 1 GPU (Group A) and 256 batch-size with 2 GPUs (Group B). With our current emulator, we would get the same computation time for both experiments to finish 1 batch. However, we saw that the computation time of Group B was longer than Group A, meaning there were some overheads in computation time. I then hacked into the source code of PyTorch and successfully figured out one part of the overhead contributing factors.

Reason:

To better complete the emulator so that it can procide accurate emulation even when using more than 1 GPU on a machine.

Reproducing Figures

After implementing the emulator, we managed to use it to reproduce Figures 3, 4, 5, and 6 in the paper “Analyzing and Mitigating Data Stalls in DNN Training” after a series of experiments and testing. It was noted that some environments in the paper were not the same as what we ran in the past week, but general patterns did apply to the expected hypothesis and measurements. We double checked all the data and figures produced and found out that our prototype meets our expectations, and it was time to look for other papers to reproduce to make the emulator more interesting. The orginial comparing with the reproduced figures are demonstrated as below, you can notice that the patterns do reflect our expected results: Original Figure 3:

Reproduced Figure 3:

Original Figure 4:

Reproduced Figure 4:

Original Figure 5:

Reproduced Figure 5:

Original Figure 6:

Reproduced Figure 6:

Reason:

Our origninal goal was to reproduce papers. Therefore, reproducing figures is a really good step to achieve that.

Summary + Coming Future

We will keep on trying to complete the emulator and figure out the exact mechanisms needed for the implementation. We will also seek for more features and see if it’s possible to add in better features into the emulator.

PDC Midterm Evaluation

Sun, 30 Jul 2023 00:00:00 +0000

Mid-Term Evaluation Update

Hello! I’m Nick, a GSoC contributor for the Proactive Data Containers (PDC) Project. Over the past few weeks I’ve worked on verifying the functionality of the Python API for the PDC project and ensuring the smooth onboarding for new users of the data containers.

I began by documenting the installation of the Ubuntu virtual machine in order to run the PDC repository, since the project wasn’t initially supported on Apple silicon hardware. The installation notes that I recorded for PDC would help contribute towards a more refined and precise process that can be seen updated on the github webpage.

After installing the dependencies of the project onto the VM, I would begin maintaining the existing Python API and making changes that would allow the tests to compile and run successfully. The manual setup had a few problems with file directories paths that prevented the installation of a few files on new devices, which I fixed by manually by linking the path and removing a few header files. However, this proved to only be a temporary fix as the prior issues was evidence of a hardcoded path, which was resolved by some alteration and fishing in the source code.

Now the PDC and PDCpy installations should go smoothly regardless of what OS is being used, and the instruction documentation can be found from the github page which should allow any user to access the data containers.

Building extensions between Python libraries for Biotechnology laboratories

Fri, 28 Jul 2023 00:00:00 +0000

Hello again! This is Luiza, a GSoC contributor for the LabOp Project. My task is to build bridges between programming languages for Biotechnology Laboratory automation.

When talking about life sciences, reproducibility is a issue amongst most research centers. Biotechnology focused laboratories usually have their own protocols developed in house for their own applications. Researchers rely on such protocols to perform their experiments and collect data but when it comes to sharing those protocols and performing them in different laboratories many difficulties arise. Whether it is by lack of equipment, reagents or even by having different orders of execution, replicating a protocol in another laboratory is a challenge. To address this issue LabOp was developed to represent a protocol and convert it in many ways possible, so it can be executed by humans and by machines.

PylabRobot and PyHamilton also come to the picture as such libraries exist to make it possible to write protocols for Hamilton robots(and Tecan machines as well for PylabRobot) but those libraries share the limitation of being able to only represent laboratory protocols at their lower levels, with the user having to write every single command in Python for the protocol to be executed. Thus I’m currently developing an extension for LabOp protocols to be converted into PylabRobot/PyHamilton scripts. This way the researcher writing the protocol can do it in a friendlier fashion, using human-friendly terms to write protocols for robot execution.

BehaviourSpecialization for Liquid Handling class

The first step is building a correspondence spreadsheet with a hello world protocol written in both languages (LabOp | PylabRobot ). This way we can make an equivalence between the functions, parameters and default commands of both Libraries, as well as their structure. This spreadsheet will serve as guidance for the conversion of the Liquid handling steps from their representation in LabOp to their representation in Pylabrobot.

The second step is to create a file that’ll do execute the conversion. In this file I will define a Labware map that’s basically a dictionary translating the resources LabOp names into Labware IDs recognizable by PylabRobots “resource” classes and a Behaviourspecialization class that should convert LabOp actions into PylabRobots Liquid Handler class operations as they’ll coordinate the commands sent from the script to the machines.(see featured images)

Dictionary for LabOp to Pylabrobot container correspondence

Then we move to the protocol that will be tested on the Hamilton Machines, this is a Plasmid purification protocol that is usually performed by a human at a very lower level, one sample at a time. This limitation is not present on Hamilton robots as they can handle many samples at the same time with only one protocol execution. The robot that will be running this protocol has two modules that are not yet present in PylabRobot’s extensions, a pressure pump module and a on deck heatershaker. I’ll be implemmenting this modules in PylabRobot based on their default commands present in PyHamilton and run the protocol on a Hamilton Starlet unit.

The steps of the protocol have been decoupled to facilitate the pilot testing, they are as follows:

Liquid handling - GOOD TO GO
Pressure pump module- requires adjustments
plate grippers(necessary to move the plasmid plate from one module to another) - requires adjustment
On deck heaterShaker- GOOD TO GO

The first pilot tests of the protocol will be run with water instead of plasmid to verify that all the steps are going smoothly, when that’s out of the way we will perform the protocol with dirty plasmids that require purification (which is what the protocol is for). The measurements for success will be sequencing the plasmid (if possible), performing a gel eletrophoresis and measuring absorbance of the DNA.

The goal of this tests is to gather data from the efectiveness of the protocol and its execution on the machine, thus confirming that it is in fact a useful mechanism for DNA purification.

PolyPhy Infrastructure Enhancement

Thu, 27 Jul 2023 00:00:00 +0000

As part of the Polyphy Project, my proposal was aimed at improving various aspects of the project, including CI/CD workflows, encapsulation, and security. Under the mentorship of Oskar Elek, I have made significant progress in the following areas:

Fixed GitHub CI Workflows and Release to PyPI: During the first phase, I focused on refining the GitHub CI workflows by implementing new flows that facilitate seamless releases to PyPI. This ensures that the project can be easily distributed and installed by users, making it more accessible and user-friendly.
Encapsulation from Jupyter into Module: I successfully encapsulated the code from Jupyter notebooks into a module. This step is crucial as it prepares the codebase to be released as a standalone module, making it easier for developers to use and integrate into their own projects.
SonarCloud Integration for Better Code Analysis: To ensure the codebase’s quality, I set up SonarCloud to perform comprehensive code analysis. This helps in identifying potential issues, bugs, and areas of improvement, leading to a more robust and reliable project.
Migration to Docker from Tox: In order to improve the containerization process, I replaced the existing solution, Tox, with Docker. Docker provides better container management and ensures a consistent development and deployment environment across different platforms.
Research on Community Platforms for Self-Hosting: I conducted extensive research on various community platforms suitable for self-hosting. This will enable the project to establish a thriving community and foster active collaboration among users and contributors.
Enhanced Security Measures: I implemented several security improvements to safeguard the project and its users. These include setting up a comprehensive security policy, implementing secret scanning to prevent unintentional exposure of sensitive information, code scanning to identify potential vulnerabilities, private vulnerability reporting to handle security issues responsibly, and Dependabot integration for monitoring and managing dependencies.
Upgraded Taichi to Utilize Class-Based Features: As part of the project’s development, I successfully upgraded Taichi to utilize class-based features available, thereby enhancing the codebase’s organization and maintainability.

Moving forward, I plan to continue working diligently to achieve the goals outlined in my proposal. The improvements made during the first half of the GSoC program have laid a strong foundation for the project’s growth and success.

Stay tuned for further updates and exciting developments as the project progresses!

Uncovering Actionable Insights using ReadTheDocs Analytics

Thu, 27 Jul 2023 00:00:00 +0000

Introduction

Hello again! This is Jack, a GSoC contributor for the OpenROAD Project. My task is to update and optimise the documentation to encourage user adoption and engagement.

For open-source repo maintainers, readthedocs is a godsend. One of its more underrated features are in providing search and traffic analytics of up to 90 days for the Community tier users. This is awesome, because ReadTheDocs is “always free for open source and community projects”.

Motivation

Why are analytics important?

Analytics are great as a proxy indicator for documentation engagement. For instance, traffic to a page, could highlight how popular the tool is, or it could also mean the tool is unclear and therefore people might need more visits to the page to further understand usage. But overall, it still indicates that the page needs to be taken care of due to the increased visits.

In what follows we aim to provide a quick tutorial as well as list out some of the actionable insights we uncovered in the OpenROAD/OpenROAD-flow-scripts documentation project.

Preamble

To download the analytics raw csv files, refer to this website.

You should also have the following packages installed: pandas, numpy, matplotlib, scipy.

Traffic Analytics

Traffic analytics are easy to understand. It comes in the format Date, Version, Path, DailyViews as follows:

df = pd.read_csv('ta_or.csv')[::-1].reset_index(drop=True)
df.Date = df.Date.apply(lambda x: x.split()[0])
df.head()

Figure 1: Loading traffic analytics dataframe

The raw data is not all that informative. Let us aggregate the data to obtain the weekly views.

weeklydf = df.copy()
weeklydf.Date = pd.to_datetime(weeklydf.Date) - pd.to_timedelta(7, unit='d')
weeklydf = weeklydf.groupby(['Path', pd.Grouper(key='Date', freq='W')])['Views']\
 .sum()\
 .reset_index()\
 .sort_values('Date')
weeklydf[weeklydf.Path == '/index.html']

Figure 2: Aggregated weekly traffic

Note that we can replace the page path with any interesting page path we desire. A useful command to obtain all possible page paths in this dataset is to use:

weeklydf.Path.unique()

Figure 3: Unique paths in dataset

With these neat data in our arsenal, let us do some plotting! For the visualisation, we have chosen to use the traffic aggregated on a daily scale. On top of this, we also plot a linear best-fit line of all the points to track the trendline over time.

The code below shows how to plot the top 20 pages.

def plot_views(df, numPages = 20):
 # Groupby Path, sum views
 pathResults = df.groupby('Path').Views.sum().sort_values(ascending=False)
 fig, ax = plt.subplots(numPages, figsize = (15,30))
 fig.tight_layout()

 for i in range(numPages):
 key = pathResults.index[i]
 temp = df[df.Path == key]
 ax[i].scatter(temp.Date, temp.Views)
 ax[i].set_xticks(np.arange(0,90, 7)) # this line is to not clutter the x-axis too much.
 ax[i].set_ylabel('Views')
 ax[i].set_title(key)

 # linear regression
 x, y = temp.Date, temp.Views
 bestfit = stats.linregress(range(len(y)),y)
 print(bestfit)
 equation = str(round(bestfit[0],2)) + "x + " + str(round(bestfit[1],2))
 ax[i].plot(range(len(y)), np.poly1d(np.polyfit(range(len(y)), y, 1))(range(len(y))), '--',label=equation)
 ax[i].legend(loc='upper right')

Figure 4: Top 20 pages by daily view counts (in descending order)

Also, we can aggregate the total views by day to plot daily traffic:

def plot_daily_traffic(df):
 # Groupby Date, sum views
 fig = plt.figure(figsize = (15,10))
 dateResults = df.groupby('Date').Views.sum()
 x, y = dateResults.index, dateResults.values
 plt.scatter(x, y)
 plt.xticks(np.arange(0,90, 7))
 plt.ylabel('Views')
 plt.title('Traffic by Day')

 # linear regression
 bestfit = stats.linregress(range(len(y)),y)
 print(bestfit)
 equation = str(round(bestfit[0],2)) + "x + " + str(round(bestfit[1],2))
 plt.plot(range(len(y)), np.poly1d(np.polyfit(range(len(y)), y, 1))(range(len(y))), '--',label=equation)
 plt.legend(loc='upper right')

Figure 5: Daily aggregated traffic

Key Trends:

Notice how there seems to be a cyclical pattern every week - rise in average view counts during Mon-Fri, then a falloff on weekends. This is most evident in the pages /index.html, /main/README.html. This could be attributed to the standard work or study week of Mon-Fri.
According to the gradient of the best-fit line for Figure 2, there seems to be a slow decline of traffic for the OpenROAD docs. For a gradient of -0.77, it translates roughly to decline of 22 views per month. The small decline could be attributed to the higher traffic from 19-29 March 2023, the dates for the OpenROAD 7nm design contest. Contest are always good for driving traffic.

Actionable insights:

Top pages are usually landing pages: index.html, main/README.html, main/src/README.html. We thus prioritised making these pages more readable and concise.
This is followed by tutorial /tutorials/index.html and /search.html. The prominence of the tutorials page made us shift the tutorials link to a higher position on the left navigation sidebar. Search tips were also included to obtain better search results. More about search in the next section.
Next, as OpenROAD consists of 20 tools: traffic analytics helps us come up with an order to update: ifp, gui, odb, ppl, sta, grt, mpl, gpl, rsz, rcx. pdn, cts, psm

Search Analytics

Search analytics come in the form of: Date, Query, TotalResults. Contrary to traffic analytics, TotalResults do not refer to search count for the query that day, but rather it corresponds to the total results returned by that query on that day. Separate aggregation still needs to be done to obtain the final count.

Firstly, let us load the dataset and perform a groupby on the column Date to obtain the daily count aggregates.

df = pd.read_csv('sa_or.csv')[::-1].reset_index(drop=True)
df = df.rename(columns ={'Created Date': 'Date', 'Total Results': 'TotalResults'})
df.Date = df.Date.apply(lambda x: x.split()[0])

dateResults = df.groupby('Date').TotalResults.count()
dateResults

Figure 6: Code output for daily aggregated search counts.

Now we are ready to plot the daily aggregated searches. This represents the number of times a search was performed on the documentation website.

def plot_daily_searches(df):
 dateResults = df.groupby('Date').TotalResults.count()
 x, y = dateResults.index, dateResults.values
 plt.scatter(x, y)
 plt.xticks(np.arange(0,90, 7))
 plt.ylabel('# Times Searched')
 plt.title('Search count by day')

 # linear regression
 bestfit = stats.linregress(range(len(y)),y)
 print(bestfit)
 equation = str(round(bestfit[0],2)) + "x + " + str(round(bestfit[1],2))
 plt.plot(range(len(y)), np.poly1d(np.polyfit(range(len(y)), y, 1))(range(len(y))), '--',label=equation)
 plt.legend(loc='upper right')

Figure 7: Daily aggregated search counts

We can also do an additional plot for queries that return zero results. In other words, we are interested in the terms people are curious about; but is not covered by our documentation currently. Think of it as an on-site search engine optimisation.

zeroResults = df[df.TotalResults == 0]
zeroResults = zeroResults.groupby('Query').Date.count().sort_values(ascending=False)
print('\nAll 0 results queries (desc)\n')
print(zeroResults.index.tolist())

Example output as follows:

['autotuner', 'tdms', '*macro*', 'rtlmp_max_inst', 'get_property',
'check_setup', 'centos', 'initialize_padring', 'core_utilization',
'pin_access', 'read_libraries', 'config', 'eco', 'rpt',
'improve_placement', 'define_process_corner', 'global_place',
'report_worst_slack', 'max_phi_cof', 'report_power', 'get_pins',
'registerfile', 'set_global_routing', 'prebuilt', 'env',
'repair_clock_inverters', 'set_thread_count', 'report_',
'partition_design', 'place_cell', 'blockage', 'partitionmgr',
'nmos', 'tuner', 'write_sdf', 'place_density', 'place_pins_args',
'size_cell', '*macor*', 'repair_clock_inverter', 'misk',
'readhaty', 'readhat', 'obstruct', 'odbpy', 'openpdn', 'openram',
'placement_cfg', 'read_macro_placement', 'output_drc', 'positon',
'pct', 'qrctechtable', 'qrctechfile', 'qrctech', 'qrc',
'properly covered', 'precision innovations', 'repeater', '"rcx-0487"',
'report_worst', 'report_area', 'report_clock_properties', 'skywater',
'study', 'sv', 'synth', 'synth_hierarchical', 'systemverilog',
'tdm', 'tdms_place', 'triton', 'ungroup', 'verilog_files',
'wrc', 'write_lef', 'write_partition_verilog', 'שואם',
'si2', 'sever', 'setrc', 'rtl_macro', 'report_dcalc', 'report_design',
'report_design_info', 'report_instance', 'report_slews', 'resize',
'rtlmp', 'set_power_activity', 'rtree', 'run_all', 'run_all.tcl',
'sc', 'set_all_input_output_delays', 'set_io_pin_constraints', 'metis',
'lefdef', 'make_result_file', 'macro_placement_cfg', 'clock__details',
'clocks__details', 'combinational', 'config.mk', 'coord',
'core_margin', 'db_process_node', 'dbblocjs', 'dbdatabase',
'dbr', 'dbrt', 'dbrttree', 'debian', 'define_pin_shape',
'densiy', 'desgin', 'diff_file', 'clk_period', 'clk_io_ptc',
'cdl', 'analog', './env.sh', '178', '6_final',
'6_final.odb', '_placement', 'abat', 'add_stripe', 'arch',
'ccs', 'binaries', 'bookshelf', 'buff_cell', 'buildwithdocker',
'busbitchars', 'buschar', 'captable', 'directoryobject',
'disallow_one_site_gaps', 'distribute', 'is_port', 'hierarch',
'hop', 'hyper', 'initialie_flooorplan', 'initialize_flooorplan',
'instance_count', 'is_chip', 'lean', 'gui_final', 'lec',
'*def*', 'limitation', 'lyp', 'maco', 'macro_pin',
'macro_place', 'harness', 'gui.py', 'dont', 'fill_cell',
'dreamplace', 'em', 'enable_dpo', 'energy', 'env.sh', 'erc',
'export', 'findmaste', 'grt_layer_adjustments', 'findmaster',
'freepdk45', 'gdt', 'global_', 'global_place_db',
'global_placementy', 'graph', '갲']

For our case we can roughly the problem with these zero-result queries fall under one of these categories:

Missing documentation: Either the parameter of functionality
Typo: User has the right keyword, but did not type it correctly. We will therefore provide them with search tips such as using fuzziness ~N operator for better matches.

Future Work

ReadTheDocs could also be linked with Google Analytics, but this remains for more advanced users.

Another rich source of information helpful to open-source maintainers are GitHub issues. These are the direct platform where users discuss their problems. Another great way to track documentation engagement is to use metrics such as: installation issues per unit week, or user-issue retention rate, which tracks the number of users that continue to file issues after their first.

Conclusion

This post showcases the amount of insight one can gather from parsing traffic and search analytics. It also provides useful Python functions that can be applied to the analytics dataset for fast prototyping and experimentation. If you are a contributor to open-source projects, try uncovering some insights for your doc pages today!

Halfway Through GSOC: My Experience and Learnings

Mon, 17 Jul 2023 00:00:00 +0000

Hello there! I’m Jonathan Edwin, all the way from the beautiful archipelago of Indonesia. This year, I got the exciting chance to jump on board the 2023 Summer of Reproducibility initiative. It’s been quite the adventure! Right now, I’m pouring my energy into a fascinating project titled Using Reproducibility in Machine Learning Education project. I’m thrilled to be able to make my own little mark on it.

For those of you who are not familiar with what I’m working on, let me shed some light. My project, as part of the “Using Reproducibility in Machine Learning Education” initiative under guidance of Fraida Fund, focuses on creating educational resources that center around reproducing some key machine learning techniques. These include Cutout data augmentation, U-Net, and Siamese networks, to name a few. The end product will be a series of interactive Jupyter notebooks that provide step-by-step guidance for students, helping them not only understand these complex models but also gain hands-on experience in achieving research reproducibility.

Progress and Challenges

Embarking on this project, I dove headfirst into the world of Cutout data augmentation, immersing myself in the many experiments outlined in the foundational paper. This initial study proved to be an intricate blend of multiple datasets, two network architectures, and a performance evaluation of models with and without Cutout data augmentation. Additionally, it included the exploration of these models in combination with other data augmentation techniques.

One of our main objectives has been to help students visualize how the model interacts with the data, and for this, we’ve been leveraging a tool called Grad-CAM. The initial paper provided a rich landscape for exploration and learning, leading us to segment our journey into five interactive Jupyter notebooks - Introduction, CutOut, ResNet, WideResNet, Regularization, and Grad-CAM.

I’m excited to share that, as we’ve hit the mid-term milestone, I’ve managed to make significant strides and completed the notebooks up to the WideResNet section. It’s been a journey full of learning and growth, overcoming various challenges along the way - understanding the intricacies of the experiments, deconstructing complex architectures, and distilling all this into digestible, interactive notebooks for students. Despite the challenges, the process has been incredibly rewarding. As we gear up for the next half of the project, I’m eager to tackle the remaining sections and share my work with the community.

Learnings and Skills Gained

Embracing the Iterative Process of Open Source Development: My initial foray into open source development had me writing and running code in one environment, then copying parts of it to another environment and pushing it from there to GitHub. This occasionally led to mistakes during the code migration. However, I’ve since learned to write or change a little bit of code, run the new version directly from GitHub, catch errors, and improve. In open source development, the end goal is to ensure everything works flawlessly, even if it involves several iterations. This is especially true considering the code from GitHub might directly run on platforms like Chameleon or Google Colab.

Understanding the Distinction between Reproducing Experiments and Crafting Educational Content: There’s a stark difference between merely reproducing an experiment from a research paper and creating an educational resource around that experiment. The former generally involves cloning and running the code, verifying it against the claims in the paper with minimal modifications. The latter, however, necessitates adapting and simplifying the code, regardless of the learner’s skill level, to ensure their comprehension. It’s about carefully guiding learners through each step for a more profound understanding.

The Power of ‘Show, Don’t Tell’: This priceless lesson was imparted by my mentor, Ms. Fraida Fund. Rather than telling me what to do when I erred or needed to learn something new, she demonstrated the correct way first-hand. This hands-on approach made understanding far easier. This principle is also reflected in the creation of our notebooks. For instance, we chose to include the Grad-CAM notebook. Although not directly referenced in the paper, it offers students a clear visual understanding of the impact of the Cutout technique, embodying the “show, don’t tell” philosophy.

Next Steps

As we step into the second half of this thrilling journey, our primary goal is to complete the remaining sections of our Cutout project. We’re setting our sights on the final notebook - Grad-CAM. The Grad-CAM notebook will offer a visual exploration of how our models interpret and interact with data, thereby solidifying the students’ understanding of Cutout data augmentation. So, stay tuned for more as we plunge into these fascinating topics!

Conclusion

Looking back, my time with the Summer of Reproducibility initiative has been nothing short of a profound learning experience. Working on the “Using Reproducibility in Machine Learning Education” project has been both challenging and rewarding, and I am incredibly grateful for this opportunity.

I’ve gained valuable insights into open-source development, delved deeper into the intricacies of machine learning techniques, and experienced firsthand the transformative power of a ‘show, don’t tell’ teaching approach. Moreover, I’ve learned that the creation of educational resources requires a delicate balance between preserving the essence of original research and adapting it to foster easy understanding.

As we press forward, I’m excited about the prospects of the coming weeks. The completion of the Grad-CAM notebook lies ahead, marking the final pieces of our Cutout project. Beyond this project, the skills and lessons I’ve acquired during this initiative will undoubtedly guide me in future endeavours.

I can confidently say that my GSOC journey has been a remarkable chapter in my growth as a developer and researcher. Here’s to more learning, more coding, and more breakthroughs in the future!

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Wed, 12 Jul 2023 00:00:00 +0000

As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim, Martin Putra and collaborator Charis Christopher Hulu (another OSRE fellow) aims to analyze large-scale sequencing datasets in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times.
Recent advancements in Next-Generation Sequencing (NGS) technologies have resulted in massive amounts of nucleotide sequence data and automated genomic workflows to streamline analysis and data interpretation. The success of NGS-driven research has also led to a sudden increase in data of varying size and complexity, making it more time-consuming for researchers to test hypotheses. Analyzing high-throughput genomic data requires a step-by-step execution of dedicated tools - also known as workflows. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Prior work in this area has suggested that the runtimes of genomic workflows get affected due to qualitative differences in the data. Additionally, there is very little consensus on what constitutes “input quality” regarding data from large genomic experiments. In this proposal, we hypothesize that genomic data quality significantly impacts the genomic workflows’ execution time. We aim to leverage machine learning techniques to extract predictive features from quality control tools that robustly predict workflow execution time.

Highlighting and Formatting Pyrope HDL

Thu, 22 Jun 2023 00:00:00 +0000

As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau aims to develop syntax highlighting and a vertical alignment tool for Pyrope. Pyrope is a modern hardware description language under development by MASC. Code is parsed with the tree-sitter grammar for Pyrope. I am working on developing a query file for the nvim-treesitter plugin. This gives neovim users Pyrope syntax highlighting based on the parse tree. In addition to syntax highlighting, I am working on a vertical alignment tool to improve code readability. These features will improve the usability and convenience of Pyrope.

Proactive Data Containers

Tue, 20 Jun 2023 00:00:00 +0000

As part of the Proactive Data Containers (PDC) my proposal under the mentorship of Houjun Tang aims to novel data abstraction for managing science data in an object-oriented manner. PDC’s will provide efficient strategies for moving data in deep storage hierarchies and techniques for transforming and reorganizing data based on application requirements. The functionality of the container object themselves are already well developed, so my goal will be to verify the functionality tests regarding the Python API to ensure that it can be used with ease, as well as create command line tools so that it is a complete data object that can be used across platforms and is simple and helpful for the users.

Public Artifact Data and Visualization

Sat, 17 Jun 2023 00:00:00 +0000

Hello! As part of the Public Artifact Data and Visualization our proposals (proposal from Jiayuan Zhu and proposal from Krishna Madhwani) under the mentorship of Anjo Vahldiek-Oberwagner aims to design a system that allows researchers to conveniently record and compare the environmental information, such as CPU utilization, of different iterations and versions of code during an experiment.

In academic experiments, there is often a need to compare results and performance between different iterations and versions. This comparative analysis helps researchers evaluate the impact of different experimental parameters and algorithms on the results and enables them to optimize experimental design and algorithm selection. However, to conduct effective comparative analysis, it is essential to record and compare environmental information, alongside the experimental data. This information provides valuable insights into the factors that may influence the observed outcomes.

Through this summer, we aim to develop a system that offers a streamlined interface, enabling users to effortlessly monitor their running programs using simple command-line commands. Moreover, our system will feature a user-friendly dashboard where researchers can access historical runtime information and visualize comparisons between different iterations. The dashboard will present comprehensive graphs and charts, facilitating the analysis of trends and patterns in the environmental data.

Interactive Exploration of High-dimensional Datasets with PolyPhy and Polyglot

Fri, 16 Jun 2023 00:00:00 +0000

Hello! My name is Kiran and this summer I’ll be working with Polyphy and Polyglot under the mentorship of Oskar Elek. The full proposal is available online.

For a brief overview, the Polyglot app allows users to interact with a 3D network of high-dimensional language embeddings, specfically the Gensim Continuous Skipgram result of Wikipedia Dump of February 2017 (296630 words) dataset. The high-dimensional embeddings are reduced to 3 dimensions using UMAP. The novel MCPM slime mode metric is then used to compute the similarty levels between points (much like how you might compute the Euclidean distance between two points). These similarity levels are used to filter the network and enable users to find interesting patterns in their data they might not find using quantitative methods alone. For example, the network has a distinct branch in which only years are nearby! Users might find other clusters, such as ones with sports words or even software engineering words. Although such exploration may not lead to quantitatively significant conclusions, the ability to explore and test mini hypotheses about the data can lead to important insights that go on to incite quantitatively significant conclusions.

In our project, we aim to expand Polyglot such that any user can upload their own data, once they have computed the MCPM metric using PolyPhy. This will have important applications in building trust in our data and embeddings. This could also help with research on the MCPM metric, which presents a new, more naturalistic way of computing similarity by relying on the principle of least effort. Overall, there is an exciting summer ahead and if you’re interested in keeping up please feel free to check out the Polyglot app on Github!

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Fri, 16 Jun 2023 00:00:00 +0000

Hi! I’m Charis, an undergraduate student in the IT and Big Data Analytics program at the Calvin Institute of Technology. As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim and Martin Putra aims to gain insight into features that are highly correlated with execution times of genomics workflows and build machine learning models for predicting workflow execution time.

Genomics workflows exhibit a long-tail pattern in their execution times. According to the previous project team’s findings, approximately 2% of genomics workflows had a median execution time of up to 15%, resulting in weeks of execution. Interestingly, it was observed that input quality plays a role in these execution time differences. Therefore, we will analyze features such as the quality of input data as well as the amount of resources allocated in the execution of genomics workflows to find features that correlate with execution time. Based on these features we will build a machine learning model that can predict the execution time of genomics workflows.

By collaborating with Shayantan Banerjee (another contributor) who will study data quality, I will study the system metrics of genomics workflows both at workflow-level and tool-level. Metrics will be collected by running genomics workflows using the Slurm workload manager under various resource allocation conditions. Genomics workflows will be executed on Chameleon clusters of different sizes.

GPU Emulator for Easy Reproducibility of DNN Training

Tue, 13 Jun 2023 00:00:00 +0000

Hi! I’m Haoran Wu, a third year at the University of Chicago majoring in Economics and Computer Science. With my proposal, I’m working on the GPU Emulator for Easy Reproducibility of DNN Training project with Professor Vijay Chidambaram. A Deep Neural Network (DNN) is an advanced artificial neural network that employs multiple layers to process intricate patterns and relationships within data. It finds applications in various fields such as image and speech recognition, natural language processing, and predictive modeling. The layers in a DNN progressively extract higher-level features from raw input data, enabling the network to learn and generalize patterns effectively.

Nevertheless, not all DNN research experiments require the use of a GPU. System researchers, for instance, may be primarily interested in performance profiles and not necessarily in the accuracy of training or inference. These researchers might focus on optimizing the storage layer and data loading of DNN training. In such cases, a GPU emulator that accurately replicates GPU behavior without needing a physical GPU can fulfill their requirements. By utilizing a GPU emulator, system researchers can evaluate their system optimizations’ performance without competing for limited GPU resources in the cloud, thereby avoiding unnecessary delays in their research progress. Our work will eventually be open source and benefit the community.

Optimizing FasTensor: Enabling Efficient Tensor Execution on GPUs

Mon, 05 Jun 2023 00:00:00 +0000

Greetings,

I am Rishabh Singh, and I am excited to be part of the 2023 Google Summer of code program. My proposal under the mentorship of John Wu and Bin Dong focuses on optimizing the FasTensor tensor computing library for efficient usage on GPUs, specifically targeting tensor contraction while preserving structure-locality. This optimization is crucial for scientific applications and advanced AI model training. Throughout the project, I will develop custom computational operations for GPUs, implement FasTensor on GPUs, assess its performance, and provide comprehensive documentation. By the end, I aim to deliver a working implementation, a performance report, and a detailed execution mechanism guide. Leveraging my background in software engineering and machine learning, I will utilize languages like C++ and OpenMP to ensure efficient memory management and data movement. Stay tuned for regular updates and informative blogs as I progress through the summer.

Using Reproducibility in Machine Learning Education

Mon, 05 Jun 2023 00:00:00 +0000

I am Jonathan Edwin, coming from Indonesia, and I am extremely thrilled to be involved in the 2023 Summer of Reproducibility initiative. I am actively contributing to the project by making valuable contributions to the Using Reproducibility in Machine Learning Education project.

As part of the Using Reproducibility in Machine Learning Education my proposal under the mentorship of Fraida Fund aims to develop educational resources focusing on reproducing and replicating fundamental machine-learning techniques, such as Cutout data augmentation, U-Net, and Siamese networks. The project aims to provide students with a hands-on learning experience that enhances their understanding of the models and their underlying principles while imparting valuable skills in ensuring research reproducibility. The project will involve the creation of a series of interactive Jupyter notebooks covering the selected papers, guiding students through reproducing results, and focusing on best practices for ensuring reproducibility. Upon completion, the notebooks will provide a comprehensive and accessible learning experience for students while emphasizing the importance of reproducibility in machine learning education. The proposal also identifies potential challenges associated with the project and proposed solutions to address them. Challenges include incompatibility issues with the original code and current frameworks or environments, difficulty in reproducing the exact results due to factors such as randomness or lack of specific details in the paper, and ensuring that the interactive elements in the Jupyter Notebooks are engaging and effective in teaching reproducibility concepts.

FlashNet: Towards Reproducible Continual Learning for Storage System

Sun, 04 Jun 2023 00:00:00 +0000

Hello! I’m Rani, a third year undergraduate student at Institut Teknologi Bandung majoring at Informatics. As part of the FlashNet my proposal under the mentorship of Haryadi S. Gunawi and Daniar Kurniawan aims to implement and optimize the FlashNet model in real-world storage systems using continual learning techniques.

In real world workloads, it is known that the I/O stream changes and varies. Hence, the performance of I/O read/write could vary and introduce the tail latency. We would like to predict the latency of I/O read to cut the tail and improve the system’s performance. This project focuses on improving the FlashNet pipeline and introducing adaptability to the machine learning models built.

During the summer, we planned to implement the continual learning pipeline using machine learning models we have built previously in the project. Of course, continual learning isn’t a continual learning without the ability of self-motivated retraining. Thus, we will implement several drift detection algorithms, evaluate, and test them. Besides, we will also build a visualization platform to evaluate and monitor the performance of the models built. Lastly, we planned to create Chameleon Trovi artifacts to demonstrate our experiments and make these implementations available and reproducible to the public.

Introducing Levels of Reproduction and Replication in Machine Learning

Thu, 01 Jun 2023 00:00:00 +0000

Greetings everyone,

I am Mohamed Saeed and I am delighted to be part of the 2023 Summer of Reproducibility program, where I am contributing to the Using Reproducibility in Machine Learning Education project.

My proposal was accepted, and I am fortunate to have Fraida Fund as my mentor. The objective of my project is to develop highly interactive open educational resources that can be utilized by instructors teaching graduate or undergraduate machine learning courses. These resources will focus on integrating instruction on reproducibility and reproducible research principles.

Understanding and practicing reproducibility in machine learning (ML) research is of utmost importance in today’s scientific and technological landscape. Reproducibility ensures the reliability, transparency, and credibility of ML findings and discoveries. By learning the principles of reproducibility, students from different levels can validate research results, test introduced methodologies, and understand level of reproducibilty of research.

My contribution will involve developing interactive educational resources that encompass code examples, writing exercises, and comprehensive explanations of key concepts of reproducing ML research. These resources will be carefully crafted to assist students at various levels of expertise. Our aim is for these resources to be widely adopted by instructors teaching graduate or undergraduate machine learning courses, as they seek to enhance the understanding of reproducibility and reproducible research principles.

I think this is a great opportunity to learn more about ML research reproducibility. I’ll be posting regular updates and informative blogs throughout the summer, so stay tuned!

ScaleBugs: Reproducible Scalability Bugs

Thu, 01 Jun 2023 00:00:00 +0000

Hello! As part of the ScaleBugs project our proposals (proposal from Goodness Ayinmode and proposal from Zahra Nabila Maharani) under the mentorship under the mentorship of Cindy Rubio González,Haryadi S. Gunawi and Hao-Nan Zhu aims to build a dataset of reproducible scalability bugs by analyzing bug reports from popular distributed systems like Cassandra, HDFS, Ignite, and Kafka. For each bug report, we will analyze whether the reported bug is influenced by the scale of the operation, such as the number of nodes being used or a number of requests. The resulting dataset will consist of bug artifacts containing the buggy and fixed versions of the scalability system, a reproducible runtime environment, and workload shell scripts designed to demonstrate bug symptoms under different scales. These resources will help support research and development efforts in addressing scalability issues and optimizing system performance.

Reproducible Evaluation of Multi-level Erasure Coding

Wed, 31 May 2023 00:00:00 +0000

Hi! My name is Alex, an undergraduate student at the University of Chicago. As part of the Reproducible Evaluation of Multi-level Erasure Coding, my proposal under the mentorship of John Bent and Anjus George aims to build a platform to reproducibly evaluate the performance and durability of MLEC (Multi-Level Erasure Coding) for large-scale storage systems under different design configurations.

To provide some context, Erasure Coding (EC) is a common approach to protect data from disk failures. Data centers nowadays increasingly use Multi-Level Erasure Coding (MLEC), a newly developed erasure coding method that aims to deal with the drawbacks of Single-Level Erasure Coding (SLEC). Despite its increasing popularity, there have not been many systematic studies to analyze and evaluate MLEC, which is the focus of this project.

The evaluation will primarily be conducted through simulations, since modifying configurations in a real large-scale system is costly and impractical. The expected deliverables of this project will be:

An MLEC simulator that can reproducibly simulate different configurations of the MLEC system, e.g. coding parameter selection, chunk placement scheme, repair method choice, etc.
An analysis of the performance and durability tradeoffs between different MLEC design choices based on the evaluation results from the simulation
Reproduced SLEC evaluation results using existing SLEC simulators
A comparison between MLEC and SLEC on performance and durability tradeoffs
Well-written documents and detailed guides on how to reproduce the evaluation results

Our plan is to build the simulator throughout the summer. We hope our simulator and evaluation results can provide designers of large-scale storage systems with valuable insights on choosing the most appropriate erasure coding configuration per their needs.

[FLASHNET]: Leveraging ML-augmented I/O in Linux

Tue, 30 May 2023 00:00:00 +0000

Hi! I’m Justin, an undergraduate at the University of Chicago. As part of the Flashnet project my proposal under the mentorship of Daniar Kurniawan and Haryadi S. Gunawi aims to port the Flashnet model into the Linux kernel.

In this attempt, I will borrow architecture/design choices from LAKE (to take advantage of its integration of ML-focused hardware acceleration in the kernel) and evaluation criteria from LinnOS to test for model inference accuracy. I also plan to support latency “bucket” inference output to improve accuracy. Ultimately, my goal is to gain further insight into best practices for integrating ML models into real-life operating systems like Linux and to inform general design choices for the Flashnet pipeline.

Intro: Open Source Autonomous Vehicle Controller

Tue, 30 May 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller Project my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aims to create comprehensive technical documentation to help onboard new users of the OSAVC controller. I will be writing tutorials and examples to demonstrate how to start with an OSAVC, programming it with the robotic equivalent of HelloWorld and later moving onto more sophisticated explanations. Hence, this will encourage more applications and wider adoption in the field of autonomous vehicles and expand the community of OSAVC users.

Reproduce and benchmark self-adaptive edge applications under dynamic resource management

Tue, 30 May 2023 00:00:00 +0000

Hello there!

I am Faishal Zharfan, a senior year student studying Telecommunication Engineering at Bandung Institute of Technology (ITB) in Bandung, Indonesia, my proposal. I’m currently part of the Edgebench under the mentorship of Yuyang Huang. The main goal of this project is to be able to reproduce and benchmark self-adaptive video applications using the proposed solution.

The topic that I’m currently working on is “Reproduce and benchmark self-adaptive edge applications under dynamic resource management” or known as edgebench is led by Prof. Junchen Jiang and Yuyang Huang. Edgebench is a project that focuses on how to efficiently distribute resource (bandwidth and cpu usage) across several video applications. Nowaday’s video applications process its data or video on a server or known as edge computing, hence bandwidth or compute unit may be the greatest concern if we talk about edge computing in terms of WAN, because it is strictly limited. We may distribute the bandwidth evenly across the cameras, however the needs of bandwidth/compute unit of each camera is different. Therefore we need another solution to tackle this problem, the solution proposed recently is called “accuracy gradient”, with this solution, we can tell how much of one application needs the bandwidth on a certain time to achieve higher accuracy. The goal of this solution is to allocate more bandwidth to the apps which has the higher f1-score improvement and reduce the other which doesn’t have a significant diminishment of f1-score. Henceforth, in the end we would have a higher total f1-score.

Throughout this summer, we have planned to implement the “accuracy gradient” and test several baselines to be compared with the solution. As for the implementation, we are currently implementing the latency measurement. We are aware that there is an overhead over this solution, therefore the latency should be taken into account.

Enhancing and Validating LiveHD's Power Modeling Flow

Mon, 29 May 2023 00:00:00 +0000

As part of the Enhancing and Validating LiveHD’s Power Modeling Flow my proposal under the mentorship of Jose Renau and Sakshi Garg aims to enhance and validate LiveHD’s power modeling flow, a critical feature for estimating power consumption in modern hardware designs. The existing flow requires further refinement to ensure its stability, accuracy, compatibility with a wider range of netlists and VCD files, and overall performance. To address these challenges, the project will focus on methodically debugging the current implementation, establishing a comprehensive validation methodology for verifying the accuracy of power estimates, and optimizing the flow to handle larger netlists and VCD files efficiently. Additionally, the project aims to improve existing documentation by providing detailed explanations, examples, and tutorials to facilitate user adoption and understanding. Upon successful completion, the project will deliver a more reliable, accurate, and efficient power modeling flow within LiveHD, contributing to the development of energy-efficient hardware designs. This refined flow will not only enhance the capabilities of LiveHD but also encourage wider adoption and utilization by the hardware design community, fostering innovation in the field of energy-efficient devices and systems.

High Fidelity UAV Simulation Using Unreal Engine with specular reflections

Mon, 29 May 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aims to Develop a unreal engine based simulator for testing. The simulator will be using unreal engine for the physics and visualization.

The existing framework uses gazebo simulator with ROS which limit the developement to only Python and C++ programing languages. I intend to develope this simulator with intention connecting it with Python and C++, additionaly expanding support to Matlab so that in future the control algorithm design and validation process becomes easier. To smoothen future developement, i intent to add detailed documentation consisting of the developement period weekly report, examples and tutorial. Upon succesful completion, the project will deliver a powerful simulator with realistic simulation using unreal engine and additional support other programming languages like matlab.

For more information about the Open Source Autonomous Vehicle Controller and the UC OSPO organization, you can visit the OSAVC project repository and the UC OSPO website.

OpenRAM Layout verses Schematic (LVS) visualization

Mon, 29 May 2023 00:00:00 +0000

As part of the OpenRAM Layout verses Schematic (LVS) visualization my proposal under the mentorship of Jesse Cirimelli-Low and Matthew Guthaus aims to develop a comprehensive Python-based graphical user interface (GUI) with a robust backend system to effectively analyze, visualize, and debug layout versus schematic (LVS) mismatches in the OpenRAM framework. The proposed solution focuses on efficiently processing LVS report files in JSON format, identifying mismatched nets in the layout, and visually representing extra nets in the schematic graph using advanced backend algorithms. By implementing a powerful backend system, the GUI will streamline the debugging process and improve overall productivity, while maintaining high performance and reliability. The deliverables for this project include a fully-functional GUI with a performant backend, features for visualizing and navigating through LVS mismatches, comprehensive documentation, and user guides.

Automatic Cluster Performance Shifts Detection Toolkit

Sat, 27 May 2023 00:00:00 +0000

Hi! I am Kangrui, a Pre-doc student at the University of Chicago. As part of the Automatic Cluster Performance Shifts Detection Toolkit my proposal under the mentorship of Sandeep Madireddy and Ray Andrew aims to design a real-time performance shift detection algorithm for high-performance computing clusters, ensuring minimal overheads.

This project focuses on developing a real-time performance shift detection algorithm tailored to heterogeneous workloads, aiming to promptly inform administrators about performance changes. The primary goal is to design an algorithm that efficiently detects shifts in real-time, with minimal system overheads.

In addition to algorithm development, we plan to enhance the Darshan toolkit’s functionality by integrating our algorithm, offering users early performance shift detection. This integration will aid administrators in making informed system utilization and scheduling decisions.

To promote transparency and reproducibility, we’ll encapsulate our findings, scripts, and profiling data within a Jupyter notebook, especially Chameleon Trovi, enabling other researchers to reproduce our experiments easily.

Looking ahead, we plan to expand the algorithm’s applicability to cater to diverse HPC workloads and infrastructures. Other areas of interest include its use in detecting shifts in financial markets or monitoring IoT data streams. Further refinement of our algorithm, to reduce overheads and improve real-time detection capabilities, is also a part of our future endeavours. This task may involve evaluating various shift detection methods and noise filtering techniques.

Using Reproducibility in Machine Learning Education: Reproducibility with Incomplete Methodology Descriptions

Sat, 27 May 2023 00:00:00 +0000

Hey,

I am Shekhar and I am one of several students who are working on the project Using Reproducibility in Machine Learning Education under the mentorship of Fraida Fund. My Proposal aims to develop interactive educational materials about reproducibility in machine learning, for use in graduate and undergraduate classes. My project is inspired by my experience in the Machine Learning Reproducibility Challenge, where I found that a major challenge for reproducibility was that some details were left ambiguous in the paper I was trying to reproduce. For my project, I will develop an interactive tutorial to help demonstrate how if the methodology details are not fully specified in a publication, then someone trying to reproduce the result will have to make choices that may not match the authors’, and these choices will affect whether or not the final result is validated.

Efficient Communication with Key/Value Storage Devices

Fri, 26 May 2023 00:00:00 +0000

Hi everyone!

I’m Manank Patel, and am currently an undergraduate student at Birla Institute of Technology and Sciences - Pilani, KK Birla Goa Campus. As part of the Efficient Communication with Key/Value Storage Devices my proposal under the mentorship of Aldrin Montana and Philip Kufeldt aims to implement io_uring based communication backend for network based key-value store.

io_uring offers a new kernel interface that can improve performance and avoid the overhead of system calls and zero copy network transmission capabilities. The KV store clients utilize traditional network sockets and POSIX APIs for their communication with the KV store. A notable advancement that has emerged in the past two years is the introduction of a new kernel interface known as io_uring, which can be utilized instead of the POSIX API. This fresh interface employs shared memory queues to facilitate communication between the kernel and user, enabling data transfer without the need for system calls and promoting zero copy transfer of data. By circumventing the overhead associated with system calls, this approach has the potential to enhance performance significantly.

Update OpenROAD Documentation and Tutorials

Fri, 26 May 2023 00:00:00 +0000

Hi! I am Jack, a Masters student at the National University of Singapore. In GSoC 2023, I will be undertaking the project entitled Update OpenROAD Documentation and Tutorials to improve the user experience and documentation of this exciting open-source RTL-to-GDSII framework, jointly mentored by Indira Iyer Almeida and Vitor Bandeira. Check out my proposal here!

This project aims to review and update missing documentation and tutorials in OpenROAD-flow-scripts. A key focus will be on increasing ease-of-setup by updating documentation, setup scripts and docker-based commands. Next, we will also update documentation for the following OpenROAD components: Makefile flow variable, distributed detailed routing, Hier-RTLMP, Autotuner. If time permits, cloud enablement will be implemented, alongside notebook-based packaging to further increase ease of adoption.

Advancing Reproducible Science through Open Source Laboratory Protocols as Software

Thu, 25 May 2023 00:00:00 +0000

Hello everyone!

My name is Luiza, I am an eighth-semester Bsc Biological Sciences student from São Paulo, Brazil. As part of the LabOp working group, my proposal under the mentorship of Dan Bryce and Tim Fallon aims to build a conversor that takes normal laboratory protocols and translates them into machine executable protocols. This is possible thanks to LabOP’s versatility to represent what a Laboratory protocol should look like. I´ll be testing this specialization in Hamilton machines that are great for experimenting scalling up.

Nowadays we face a very common issue between Biotechnology laboratories, that is that protocols are difficult to share and to adapt for machine execution. Laboratory protocols are critical to biological research and development, yet complicated to communicate and reproduce across projects, investigators, and organizations. While many attempts have been made to address this challenge, there is currently no available protocol representation that is unambiguous enough for precise interpretation and automation, yet simultaneously abstract enough to enable reuse and adaptation.

With LabOP we can take a protocol and convert it in multiple ways depending on the needs of the researcher for automation or human experimentation and allowing flexibility for execution and experimentation so I`ll be building a specialization that translates protocols in a way that they can be executed by Hamilton machines.

Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Thu, 25 May 2023 00:00:00 +0000

The project plans to measure the impact of different missing settings for open-source database systems, such as MySQL and PostgreSQL particularly under the TPC-C Benchmark. The objective requires to run experiments on popular settings that are not reported and fix any problems during the experiments for the target systems. The project will compare the performance characteristics, and analyze the impact of missing settings on the performance of the target systems.

PolyPhy Infrastructure Enhancement

Thu, 25 May 2023 00:00:00 +0000

Hey!

I’m Prashant Jha, from Pune, a recent undergraduate student from BITS Pilani. As part of the Polyphy my proposal under the mentorship of Oskar Elek aims to develop and improve the current infrastructure.

Polyphorm / PolyPhy - which is led by Oskar Elek. PolyPhy is an organization that focuses on developing a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. With its roots in astronomy and inspiration drawn from nature, PolyPhy has been instrumental in discovering network-like patterns in natural language data and reconstructing the Cosmic web structure using its early prototype called Polyphorm. The organization aims to provide a richer 2D / 3D scalar field representation of the reconstructed network, making it a toolkit for a range of specialists across different disciplines, including astronomers, neuroscientists, data scientists, and artists. PolyPhy’s ultimate purpose is to create quantitatively comparable structural analytics and discover connections between different disciplines. To achieve its goals, PolyPhy requires a robust infrastructure that is engineered using DevOps, Code Refactoring, and Continuous Integration/Continuous Deployment (CI/CD) practices. You can see an instructive overview of PolyPhy in our workshop and more details about our research here.

Strengthening Underserved Segments of the Open Source Pipeline

Thu, 25 May 2023 00:00:00 +0000

Namaste everyone🙏🏻!

I’m Nandini Saagar, from Mumbai. An undergraduate student at the Indian Institute of Technology, Banaras Hindu University, IIT (BHU), Varanasi. As part of the Strengthening Underserved Segments of the Open Source Pipeline my proposal under the mentorship of Emily Lovell aims to strengthen the underserved segment of the open source pipeline.

My interest in Open Source was first piqued as a freshman when I was introduced to Open Source as a place where people from all communities and backgrounds come together to create software that can have real-world impact, that too in a completely autonomous and self-governed manner! I am so glad that I could transition from just a person who imagined Open Source to be a fair-eyed dream to being a part of multiple such communities. This journey has been life-defining for me, and that’s why I want to help deliver the message of Open Source to all teenagers!

This project seeks to invite and support broader, more diverse participation in open source by supporting early contributors, especially those who have been historically minoritized within tech. It will aim to create content that anyone with some Open Source experience can use to help and guide new students to the journey of OpenSource, GitHub, and all the relevant technologies, provide a medium and platform for all contributors to share their various OpenSource experiences and testimonials, conduct an Open Source Themed Hackathon/Scavenger Hunt, and leverage the power of social media engagement to get young and brilliant minds acquainted with the technical and open-source world at an early age.

Stay tuned to explore the enormous world of Open Source with me!

Open Source Autonomous Vehicle Controller

Wed, 24 May 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller Project my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aims to Develop a tutorial that serves as a comprehensive guide for new users of the OSAVC controller. The tutorial will start from scratch, demonstrating how to initialize and program the controller using the equivalent of a “Hello, World!” program. Subsequently, it will progress to more advanced applications.

Throughout the project, I will work closely with my mentors to ensure the accuracy, clarity, and usability of the documentation. Their guidance and expertise will be instrumental in achieving the project’s objectives effectively.

By creating comprehensive technical documentation, this project aims to empower new users to harness the capabilities of the OSAVC controller. It will facilitate their understanding of the controller’s functionalities and enable them to leverage its potential in the field of autonomous vehicle applications.

I am excited to embark on this journey, contribute to the open-source community, and make a valuable impact in the field of autonomous vehicles. Stay tuned for regular updates and progress reports as I work towards achieving the goals set forth in this project.

For more information about the Open Source Autonomous Vehicle Controller and the UC OSPO organization, you can visit the OSAVC project repository and the UC OSPO website.

Stay connected and join me in this exciting endeavor!

Verify the reproducibility of an experiment

Wed, 24 May 2023 00:00:00 +0000

Hello everyone, my name is Jesse and I’m proud to be a fellow in this 2023 Summer of Reproducibility program, contributing to noWorkflow project.

My proposal was accepted under the mentorship of João Felipe Pimentel and Juliana Freire and aims to work mapping and testing the capture of the provenance in typical Data Science and Machine Learning experiments.

What…

Although much can be said about what reproducibility means, the ability to replicate results in day-to-day Data Science and Machine Learning experiments can pose a significant challenge for individuals, companies and researche centers. This challenge becomes even more pronounced with the emergence of analytics and IA, where scientific methodologies are extensively applied on an industrial scale. Then reproducibility assumes a key role in productivity and accountability expected from Data Scientists, Machine Learning Engineers, and other roles engaged in ML/AI projects.

How…

In the day-to-day, the pitfalls of non-reproducibility appear at different points of the experiment lifecycle. These challenges arise when multiple experiments need to be managed for an individual or a team of scientists. In a typical experiment workflow, reproducibility appears in different steps of the process:

The need to track the provenance of datasets.
The need to manage changes in hypothesis tests.
Addressing the management of system hardware and OS setups.
Dealing with outputs from multiple experiments, including the results of various model trials.

In academic environments, these issues can result in mistakes and inaccuracies. In companies, they can lead to inefficiencies and technical debts that are difficult to address in the future.

Finally…

I believe this is a great opportunity to explore the emergence of these two hot topics that are IA and reproducilibity! I will share more updateds here throughout this summer and hope we can learn a lot together!

Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Tue, 23 May 2023 00:00:00 +0000

As part of the Teaching Computer Networks with Reproducible Research project my proposal under the mentorship of Fraida Fund aims to develop a classroom competition for adaptive video delivery policies, leveraging an existing open-source reproducible result. The competition will challenge students to extend the original work and design their adaptive policies for head-to-head competition against their classmates.The project will involve packaging the existing result for easy reproducibility and building on it by implementing other adaptive video policies from the literature, developing different network settings for evaluating student submissions, and creating an evaluation framework for scoring submissions based on various criteria (so that competition remains fair and unbiased). The deliverables include a functional submission and evaluation process, an evaluation framework, and documentation and materials for course instructors to use in the classroom.

OSRE Catalyst

Thu, 23 Mar 2023 00:00:00 +0000

Contributing to an open source project is a great way to build a technical portfolio, learn industry tools/practices, and have real-world impact – all while embedded in a collaborative community. The UC Santa Cruz Open Source Program Office (OSPO) wants to support more students on this path, especially those who have been minoritized in tech. We are partnering with an HBCU for a pilot summer program offering, with hopes to expand our reach in 2024.

Through a hybrid (in-person/remote) model, participating students will spend four weeks on the UCSC campus learning about open source, followed by four weeks remotely contributing to an open source project. Participants will be well-supported by our instructional team, as well as their small peer cohort, through community-building and mentorship spanning the full eight weeks.

Pilot Program Mentor & Developer

Topics: Education, Broadening Participation, Mentorship and Support, Community
Skills: communication, organization, GitHub/Markdown, basic web programming (HTML, CSS, JavaScript), open source contribution, version control/git workflow, mentorship, teaching
Difficulty: Novice to Intermediate
Size: Medium or Large (175 or 350 hours)
Mentors: Emily Lovell, James Davis

Given that this is a program pilot, your involvement and feedback will directly help shape its future!

Possible tasks:

Help cultivate a welcoming and supportive learning community
Support students in completing hands-on activities related to open source contribution (e.g. evaluating potential projects/communities, using git, setting up a development environment)
Develop technology-specific tutorials to introduce students to languages/libraries/etc. employed by their project
Offer mentorship around how to navigate documentation, large codebases, and contributor communities
Share your own input and perspective on what it’s like to be a newcomer to open source!

eBPF Monitoring Tools

Tue, 21 Feb 2023 00:00:00 +0000

eBPF is a technology that allows sandboxed programs to run in a priviledged context such as a Linux kernel. eBPF is for operating systems what Javascript is for web browsers: new functionality can be safely loaded without restarting or continually upgrading the operating system or browser and executed efficiently. eBPF is used to introduce new functionality into a running Linux kernel, including next-generation networking, observability, and security functionality. The following is just one idea of many possible.

Implement Darshan functionality as eBPF tool

Topics: performance, I/O, workload characterization
Difficulty: Medium
Size: Medium or large (175 or 350 hours)
Mentors: Tyler Reddy

Darshan is an HPC I/O characterization tool that collect statistics using a lightweight design that makes it suitable for full time deployment. Darshan is an interposer library that catches and counts IO requests (open, write, read, etc.) to a file/file system and it keeps the counters in buckets in data structure that can be queried. How many reads of small size, medium size, large size) for example are the types of things that are counted.

Having this be an interposer library requires users to link their application with this library. Having this function in epbf would make this same function transparent to users. Darshan has all the functions and could provide the list of functions to implement and the programmer could build and test these functions in ebpf on a linux machine. This could be a broadly available open tool that would be generally useful and but one of perhaps hundreds of examples of where ebpf based tools that could be in the open community for all to leverage.

Reproducible Evaluation of Multipath Network Protocols

Thu, 16 Feb 2023 00:00:00 +0000

Lead Mentor: Ilknur Aydin

As mobile devices with dual WiFi and cellular interfaces become widespread, network protocols have been developed that utilize the availability of multiple paths. However, the relative effectiveness of these protocols is highly dependent on the characteristics of the network (including the relationship between the two paths, which are often not independent). Researchers typically evaluate a multipath protocol for a small set of network scenarios, which vary from one publication to the next. It is therefore difficult to get a good picture of how different protocols perform in a range of settings.

Framework for repeatable, direct comparison of multipath transport protocols

Topics: Computer networks, wireless systems
Skills: Linux, networking, data analysis and visualization, writing
Difficulty: Large
Size: 350 hours
Mentor(s): Ilknur Aydin and Fraida Fund

In single-path congestion control, the Pantheon work created a reference set of executable benchmarks that researchers could use to evaluate novel congestion control designs against existing work in a wide range of the scenarios. This project seeks to achieve something similar for multipath protocols, using publicly available networking testbeds like FABRIC. For this project, the participant will:

Prepare a set of network benchmarks for multipath protocols, using live network links, real link traces, and emulated scenarios
Develop an experiment using the benchmarks to evaluate existing multipath protocol implementations
Prepare materials that researchers can use to evaluate novel multipath protocols against the others in the benchmark

Proactive Data Containers (PDC)

Sun, 12 Feb 2023 00:00:00 +0000

Proactive Data Containers (PDC) are containers within a locus of storage (memory, NVRAM, disk, etc.) that store science data in an object-oriented manner. Managing data as objects enables powerful optimization opportunities for data movement and transformations, and storage mechanisms that take advantage of the deep storage hierarchy and enable automated performance tuning.

Command line and python interface to an object-centric data management system

Topics: Python, object-centric data management, PDC
Skills: Linux, C, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Houjun Tang, Suren Byna

Proactive Data Containers (PDC) is an object-centric data management system for scientific data on high performance computing systems. It manages objects and their associated metadata within a locus of storage (memory, NVRAM, disk, etc.). Managing data as objects enables powerful optimization opportunities for data movement and transformations, and storage mechanisms that take advantage of the deep storage hierarchy and enable automated performance tuning. This project includes developing and updating efficient and user friendly command line and Python interfaces for PDC.

Is Reproducibility Enough? Understanding the Impact of Missing Settings in Artifact Evaluation

Wed, 08 Feb 2023 00:00:00 +0000

While Artifact Evaluation tries to ensure that the evaluation results in a paper are reproducible, it leaves one question: How about experiment settings NOT reported by the paper? Such “missing settings” may create multiple problems: 1) sometimes the artifacts simply do not work under these missing settings, creating problems when a later work needs to compare to an earlier work under these settings; 2) sometimes the artifacts do not perform well under these missing settings, which may create a bias during the evaluation; 3) to improve the artifact to work under these missing settings, sometimes one needs to re-design the system, which may change the results of the original experiments.

In this project, we plan to understand the impact of this problem: On the necessity side, how would these missing settings affect the conclusions of the original work? On the feasibility side, how much effort does it require to carry out extensive experiments? We plan to answer these questions by reproducing prior works, running them on popular settings that are not reported by these works, and fixing problems if any.

Measuring Research Prototypes under Unreported Settings

Topics: reproducibility, databases, key-value stores, DNN training
Skills: Java/Python, Linux, TPC/YCSB
Difficulty: Medium
Size: 350 hours
Mentor(s): Yang Wang, Miao YU
Contributor(s): Xueyuan Ren

The student will first pick one or a few systems she is interested in. Then she will first try to reproduce their reported results. If successful, she will further try to measure these systems under previously unreported settings. During the procedure, she will need to diagnose and fix any problems that may show up. Finally, she will analyze whether the original conclusions still hold under these new settings and whether fixing any problems will change the performance characteristics of the target systems.

OpenRAM

Wed, 08 Feb 2023 00:00:00 +0000

OpenRAM is an award winning open-source Python framework to create the layout, netlists, timing and power models, placement and routing models, and other views necessary to use SRAMs in ASIC design. OpenRAM supports integration in both commercial and open-source flows with both predictive and fabricable technologies. Most recently, it has created memories that are included on all of the eFabless/Google/Skywater MPW tape-outs.

Layout verses Schematic (LVS) visualization

Topics: VLSI Design Basics, Python
Skills: Python, VLSI, JSON
Difficulty: Easy/Medium
Size: Medium or Large (175 or 350 hours)
Mentors: Jesse Cirimelli-Low, Matthew Guthaus
Contributor(s): Mahnoor Ismail

Create a visualization interface to debug layout verses schematic mismatches in Magic layout editor. Results will be parsed from a JSON output of Netgen.

noWorkflow

Tue, 07 Feb 2023 00:00:00 +0000

The noWorkflow project aims at allowing scientists to benefit from provenance data analysis even when they don’t use a workflow system. Also, the goal is to allow them to avoid using naming conventions to store files originated in previous executions. Currently, when this is not done, the result and intermediate files are overwritten by every new execution of the pipeline.

noWorkflow was developed in Python, and it is currently able to capture provenance of Python scripts using Software Engineering techniques such as abstract syntax tree (AST) analysis, reflection, and profiling, to collect provenance without the need of a version control system or any other environment.

At the moment of this writing, the main version of noWorkflow is in the 2.0-alpha branch. We intend to release it before the summer.

Verify the reproducibility of an experiment

Topics: Reproducibility
Skills: Python, SQL or SQLAlchemy ORM
Difficulty: Moderate
Size: Medium or large (175 or 350 hours)
Mentors: João Felipe Pimentel, Juliana Freire

Implement an algorithm to compare the provenance from two (or more) trials (i.e., executions of an experiment) to check their reproducibility. The provenance stored in the relational (sqlite) database by noWorkflow 2 contains intermediate variable values from a trial. These values could be compared to check how much or where executions deviate from each other.

Specific tasks:

Compare trials of the same script (Medium)
Estimate how much on trial deviate from another (Medium)
Consider different scripts and execution flows (Large)
Indicate which parts of the scripts are not reproducible (Large)

Control levels of provenance collection

Topics: Log experiments
Skills: Python
Difficulty: Challenging
Size: Large (350 hours)
Mentors: João Felipe Pimentel, Juliana Freire
Contributor(s): Jesse Lima

Add support for different levels of provenance collection in noWorkflow 2. Currently, noWorkflow 2 collects Python construct evaluations and all the dependencies among the evaluations. However, this collection is inefficient, since some of the collected provenance may not be necessary for end-users. In this project, it is desirable to provide ways to temporarily disable the provenance collection and to manually indicate the provenance in this situation.

Specific tasks:

Disable the collection inside specific functions (through decorators?)
Disable the collection inside specific regions of the code (through with statements?)
Collect only function activations in a region, instead of all variable dependencies
Disable the collection of specific modules
Design a DSL to express general dependencies for parts of the code where the collection is disabled

Upgrade noWorkflow collection to support new Python constructs

Topics: Log experiments
Skills: Python
Difficulty: Moderate
Size: Large (350 hours)
Mentors: João Felipe Pimentel, Juliana Freire

Implement new AST transformations for provenance collection. While noWorkflow 2 works for newer Python versions, most of its implementation was targeted at Python 3.7. Newer Python versions have new constructs in which the provenance is ignored.

Specific tasks:

Identify which AST constructs implementations are missing
Design AST transformations to execute functions before and after the evaluation of the constructs
Create the dependencies for the new constructs

ScaleBugs: Reproducible Scalability Bugs

Tue, 07 Feb 2023 00:00:00 +0000

Scalable systems lay essential foundations of the modern information industry. HPC data centers tend to have hundreds to thousands of nodes in their clusters. The use of “extreme-scale” distributed systems has given birth to a new type of bug: scalability bugs. As its name suggests, scalability bugs may be presented depending on the scale of a run, and thus, symptoms may only be observable in large-scale deployments, but not in small or median deployments. For example, Cassandra-6127 is a scalability bug detected in the popular distributed database Cassandra. The scalability bug causes unnecessary CPU usage, however, the symptom is not observed unless ~1000 nodes are deployed. This demonstrates the main challenge of studying scalability bugs: it is extremely challenging to reproduce without deploying the system at a large scale.

In this project, our goal is to build a dataset of reproducible scalability bugs. To achieve this, we will go through the existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. For each bug report, we determine if the reported bug depends on the scale of the run, such as the number of nodes utilized. With the collected scale-dependent bugs, we then will craft the workload to reproduce those scalability bugs. Our workloads will be designed to trigger some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance. For example, a successful reproduction should be able to show the performance drop along with an increasing number of nodes.

Building a Dataset of Reproducible Scalability Bugs

Topics: Scalability systems, bug patterns, reproducibility, bug dataset
Skills: Linux Shell, Docker, Java, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Cindy Rubio González, Haryadi S. Gunawi, Hao-Nan Zhu
Contributor(s): Goodness Ayinmode, Zahra Nabila Maharani

The student will build a dataset of reproducible scalability bugs. Each bug artifact in the dataset will contain (1) the buggy and fixed versions of the scalability system, (2) a runtime environment that ensures reproducibility, and (3) a workload shell script that could demonstrate the symptoms of the bug under different scales.

Specific Tasks

Work with the mentors to understand the context of the project.
Learn the background of scalability systems.
Inspect the bug reports from Apache JIRA and identify scale-dependent bugs.
Craft shell scripts to trigger the exact scalability bug described by the bug report.
Organize the reproducible scalability bugs and write documentation to build the code and trigger the bug.

Strengthening Underserved Segments of the Open Source Pipeline

Tue, 07 Feb 2023 00:00:00 +0000

Contributing to an open source project offers novices the opportunity to join a community of practitioners, build a technical portfolio, gain experience with industry tools and technologies, and have real-world impact. This project seeks to invite and support broader, more diverse participation in open source by supporting early contributors – especially those who have been historically minoritized within tech.

This work builds upon a number of existing projects with similar or overlapping goals. Some examples:

The Teaching Open Source (TOS) community, which brings together instructors teaching open source
The Professors’ Open Source Software Experience (POSSE) workshops and wiki, for faculty teaching - or wanting to teach - open source
Internships such as Google Summer of Code (GSoC), Outreachy, and the MLH Fellowship
Open Source Comes to Campus, offering student workshops on tools and culture [no longer active]
Google Code-in, inviting pre-university students to make open source contributions [no longer active]

This project will investigate gaps in currently available resources/programs and seek to address them, beginning with the exploration of engaging high school students with open source. Depending on early findings, this project could also entail the development of resources for independent learners and/or mentors.

Learning Resource Development + Repository-Building

Topics: Education, Broadening Participation, Mentorship and Support, Community Development
Skills: independent research, communication, organization, GitHub/Markdown, basic web programming (HTML, CSS, JavaScript)
Difficulty: Novice to Intermediate
Size: Medium or Large (175 or 350 hours)
Mentors: Emily Lovell, James Davis
Contributor(s): Nandini Saagar

As an early contributor to this project, you will help gather information to inform the project direction – and then help bring it to life!

Possible tasks:

Meet with teachers and/or community members to identify new opportunities to engage with students (e.g. outside-of-school workshops, classroom visits, materials for teachers to use independently)
Evaluate and test existing learning activities with a high school audience in mind (e.g. consider necessary pre-requisites, time required, ideal activity format)
Evaluate and organize existing resources for newcomers (e.g. Up For Grabs, Hacktoberfest, internship/fellowship opportunites)
Help design and pilot new learning activities and/or workshops
Assist in curating an open source repository of the aforementioned resources
Conduct outreach to our target communities (e.g. brainstorm a catchy repository name, compose inviting and inclusive emails, design visual project elements)
Share your own input and perspective on what it’s like to be a newcomer to open source!

LabOP - an open specification for laboratory protocols, that solves common interchange problems stemming from variations in scale, labware, instruments, and automation.

Mon, 06 Feb 2023 00:00:00 +0000

Project idea 1: Software, hardware, and wetware building LabOP with simultaneous language & protocol development & test executions

Topics: Software standard development, Laboratory automation, Biology
Skills: Python, Semantic Web Technologies (RDF, OWL), interest to think about describing biological & chemical laboratory processes
Difficulty: Moderate
Size: Large (350 hours)
Mentors:
1. Tim Fallon
2. Dan Bryce

About: The Laboratory Open Protocol Language (LabOP)

See link: https://bioprotocols.github.io/labop/

LabOP is an open specification for laboratory protocols, that solves common interchange problems stemming from variations in scale, labware, instruments, and automation. LabOP was built from the ground-up to support protocol interchange. It provides an extensible library of protocol primitives that capture the control and data flow needed for simple calibration and culturing protocols to industrial control.

Software Ecosystem

LabOP’s rich representation underpins an ecosystem of several powerful software tools, including:

labop: the Python LabOP library, which supports:
- Programming LabOP protocols in Python,
- Serialization of LabOP protocols conforming to the LabOP RDF specification,
- Execution in the native LabOP semantics (rooted in the UML activity model),
- Specialization of protocols to 3rd-party protocol formats (including Autoprotocol, OpenTrons, and human readible formats), and
- Integration with instruments (including OpenTrons OT2, Echo, and SiLA-based automation).
laboped: the web-based LabOP Editor, which supports:
- Programming LabOP protocols quickly with low-code visual scripts,
- Storing protocols on the cloud,
- Exporting protocol specializations for use in other execution frameworks,

About the Bioprotocols Working Group

The Bioprotocols Working Group is an open community organization developing a free and open standard for representation of biological protocols.

To join the Bioprotocols Working Group:

Join the community mailing list at: https://groups.google.com/g/bioprotocols
Join the #collab-bioprotocols channel on the Bits in Bio Slack.

Leadership

Elected Term: August 24th, 2022 - August 23rd, 2023

Chair: Dan Bryce (SIFT)

Finance Committee:

Governance

Approved by community vote on August 16th, 2022

https://bioprotocols.github.io/labop/about#Governance

Mission:

The Bioprotocols Working Group is an open community organization developing free and open standards for representation of biological protocols. In support of that goal, the organization also develops tools and practices and works with other organizations to facilitate dissemination and adoption of these standards.

As an organization, the Bioprotocols Working Group holds the following values:

The standards developed by the community should be available under permissive free and open licenses.
Technical decisions of the community should be made following open and inclusive processes.
The community is strengthened by fostering a culture of diversity and inclusion, in which all constructive participants feel comfortable making their voices heard.

GPU Emulator for Easy Reproducibility of DNN Training

Sun, 05 Feb 2023 00:00:00 +0000

Deep Neural Networks (DNN) have achieved success in many machine learning (ML) tasks including image recognition, video classification and natural language processing. Nonetheless, training DNN models is highly computation intensive and usually requires running complex computations on GPUs, while GPU is a very expensive and scarce resource. Therefore, many research works on DNN training are delayed because of the lack of access to GPUs. However, many research prototypes don’t require GPUs but only the performance profiles of GPUs. For example, research on DNN training storage systems doesn’t need to run real computations on GPUs, but only needs to know how much time each GPU computation will take. Meanwhile, GPU performance in DNN training is predictable and reproducible, as every batch of training performs a deterministic sequence of mathematical operations on a fixed number of data.

Therefore, in this project we seek to build a GPU emulator platform on PyTorch to easily reproduce DNN training without using real GPUs. We will measure the performance profiles of GPU computations for different models, GPU types, and batch sizes. Based on the measured GPU performance profiles, we will build a platform to emulate the GPU behaviors and reproduce DNN training using CPUs only. We will make the platform and the measurements open-source, allowing other researchers to reproduce the performance measurements and easily conduct research on DNN training systems. We will also encourage the community to enrich the database by adding GPU performance measurements for their own models and GPU types. We will be the first one to build and release this kind of GPU emulator for DNN training, and we believe researchers and the community can benefit a lot from it, especially after more and more GPU performance profiles are added by the community.

Building a platform to emulate GPU performance in DNN training

Topics: DNN training, reproducibility, GPU emulator, performance measurement - Skills: Linux, Python, PyTorch, deep learning
Difficulty: Medium
Size: 350 hours
Mentor(s): Vijay Chidambaram, Yeonju Ro
Contributor(s): Haoran Wu

The student will measure the GPU performance profiles for different models and GPU types, based on which the student will build a platform to emulate the GPU behaviors and easily reproduce DNN training. The GPU performance measurements should be made open-source and reproducible for other researchers to reproduce results and add GPU profiles for their own needs.

Specific tasks:

Work with mentors on understanding the context of the project.
Study and get familiar with the PyTorch DNN training pipelines
Measure GPU performance profiles for different DNN models and GPU types
Based on the GPU performance measurements, build a platform to emulate the GPU behaviors and reproduce DNN training without using real GPUs
Organize and document the codes to make them reproducible for the community

Reproduce and benchmark self-adaptive edge applications under dynamic resource management

Thu, 02 Feb 2023 00:15:56 -0700

With the flourishing of the ideas like smart cities or smart manufacturing, a massive amount of edge devices (e.g., traffic or security cameras, thermometers, flood sensors, et al.) are deployed and connected to the network to collect/analyze data across the space and time and help the stakeholders like city governments or manufacturers optimizing their plans and operations. Such a large number of edge devices and large amount of communications among the devicesdd or to the central servers rise a big challenge on how to manage/schedule the resource (i.e., network bandwidth between the devices and/or computing power on both edge devices and bare metal servers) to ensure the running applications’ capability of providing a reliable service. Furthermore, with the nature of limited resources available to the edge devices, there is an uprising trend to reduce the average compute and/or bandwidth usage by leveraging the uneven distribution of interesting events with respect to both time and space in the input data. This brings further challenges for provisioning and managing the amount of resources available to the edge devices, as the running applications’ resource demands can greatly depend on the input data which is both dynamic and unpredictable.

With these challenges in mind, the team previously designed and implemented a dynamic resource manager that could understand the applications and make decisions based on such understanding at run time. This understanding is achieved based on a key insight - applications will have different magnitudes of performance improvement/degradation toward the change in the amount of resources available depending on the input data and how many resources the applications currently have, which we define as applications’ sensitivities. However, such a resource manager has only been tested with a limited number and types of video analytic applications. Hence, through the OSRE23 project, we aim to:

reproduce other state-of-art self-adaptive video analytic applications,
integrate the reproducible applications into the resource manager framework,
compare the performance with and without resource manager.

Reproduce/benchmark the self-adaptive video analytic applications’ performance under dynamic resource management

Topics: Benchmark, Reproducibility, Video analytics, Machine Learning, Resource Management
Skills: Python, PyTorch, TensorFlowd
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Junchen Jiang, Yuyang Huang
Contributor(s): Faishal Zharfan

Integrate various types of video analytic applications into the aforementioned dynamic resource manager and reproduce/benchmark the applications’ performance.

Specific tasks:

Reproduce state-of-art video analytic applications
Integrate such applications into the resource manager framework - Benchmark video analytic applications
Analysis the benchmarked performance results

FlashNet: Towards Reproducible Data Science for Storage System

Thu, 02 Feb 2023 00:00:00 +0000

The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers. The person must know both the storage side as well the ML side as if studying two different fields at the same time. This project aims to answer these questions:

How can we encourage data scientists to look into storage problems?
How can we create a transparent platform that allows such decoupling?
Within the storage/ML community can we create two collaborative communities, the storage engineers and the storage data scientists?

In the ML/Deep Learning community, the large ImageNet benchmarks have spurred research in image recognition. Similarly, we would like to provide benchmarks for fostering storage research in ML-based per-IO latency prediction. Therefore, we present FlashNet, a reproducible data science platform for storage systems. To start a big task, we use I/O latency prediction as a case study. Thus, FlashNet has been built for I/O latency prediction tasks. With FlashNet, data engineers can collect the IO traces of various devices. The data scientists then can train the ML models to predict the IO latency based on those traces. All traces, results, and codes will be shared in the FlashNet training ground platform which utilizes Chameleon trovi for better reproducibility.

In this project, we plan to improve the modularity of the FlashNet pipeline and develop the Chameleon trovi packages. We will also continue to improve the performance of our binary-class and multiclass classifiers and test them on the new production traces that we collected from SNIA IOTA public trace repository. Finally, we will optimize the deployment of our continual-learning mechanism and test it in a cloud system environment. To the best of our knowledge, we are building the world-first end-to-end data science platform for storage systems.

Building FlashNet Platform

Topics: Storage systems, reproducibility, machine learning, continual learning
Skills: C++, Python, PyTorch, Experienced with Machine Learning pipeline
Difficulty: Medium
Size: Large (350 hours)
Mentors: Haryadi S. Gunawi
Contributor(s): Justin Shin, Maharani Ayu Putri Irawan

Build an open-source platform to enable collaboration between storage and ML communities, specifically to provide a common platform for advancing data science research for storage systems. The platform will be able to reproduce and evaluate different ML models/architecture, dataset patterns, data preprocessing techniques, and various feature engineering strategies.

Specific tasks:

Work with mentors on understanding the context of the project.
Reproduce the FlashNet evaluation results from prior works.
Build and improve FlashNet components based on the existing blueprint.
Collect and analyze the FlashNet evaluation results.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Thu, 02 Feb 2023 00:00:00 +0000

A high-throughput workflow execution system is needed to continuously gain insights from th e increasingly abundant genomics data. However, genomics workflows often have long execution times (e.g., hours to days) due to their large input files. This characteristic presents many complexities when managing systems for genomics workflow execution. Furthermore, based on our observation of a large-scale genomics data processing platform, ~2% of genomics workflows exhibit a tail behavior which multiplied their execution time up to 15x of the median, resulting in weeks of execution.

On the other hand, input files for genomic workflows often vary in quality due to differences in how they are collected. Prior works suggested that these quality differences can affect genomics workflow execution time. Yet, to the best of our knowledge, input quality has never been accounted for in the design of a high-throughput workflow execution system. Even worse, there does not appear to be a consensus on what constitutes ‘input quality,’ at least from a computer systems perspective.

In this project, we seek to analyze a huge dataset from a large-scale genomics processing platform in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times. Following that, we will build machine learning (ML) models for predicting workflow execution time, in particular those which exhibit tail behavior. We believe these insights and models can become the foundation for designing a novel tail-resilient genomics workflow execution system. Along the way, we will ensure that each step of our analysis is reproducible (e.g., in the form of Jupyter notebooks) and make all our ML models open-source (e.g., in the form of pre-trained models). We sincerely hope our work can offload some burdens commonly faced by operators of systems for genomics and, at the same time, benefit future researchers who work on the intersection of computer systems and genomics.

Analyze genomics data quality & build exec. time prediction models

Topics: genomics, data analysis, machine learning
Skills: Linux, Python, Matplotlib, Pandas/Numpy, any ML library
Difficulty: Medium
Size: 350 hours
Mentor(s): In Kee Kim
Contributor(s): Charis Christopher Hulu

Analyze a large-scale trace of genomics workflow execution along with metrics from various genomics alignment tools (e.g., FastQC, Picard, and GATK metrics) and find features that correlate the most with workflow execution time and its tail behavior. Then, based on the results, we will build ML models that accurately predict genomic workflows’ execution times.

Specific tasks:

Acquire basic understanding of genomics data processing & workflow execution (will be guided by the mentor)
Reproduce past analysis & models built by prior members of the project
Propose features from FastQC/Picard/GATK metrics that can be used as a predictor for execution time and tail behavior
Write a brief analysis as to why those features might work
Build ML models for predicting execution time
Package the analysis in the form of Jupyter notebooks
Package the models in a reloadable format (e.g., pickle)

Reproducible Evaluation of Multi-level Erasure Coding

Thu, 02 Feb 2023 00:00:00 +0000

Massive storage systems rely heavily on erasure coding (EC) to protect data from drive failures and provide data durability. Existing storage systems mostly adopt single-level erasure coding (SLEC) to protect data, either performing EC at the network level or performing EC at the local level. However, both SLEC approaches have limitations, as network-only SLEC introduces heavy network traffic overhead, and local-only SLEC cannot tolerate rack failures.

Accordingly, some data centers are starting to use multi-level erasure coding (MLEC), which is a hybrid approach performing EC at both the network level and the local level. However, prior EC research and evaluations mostly focused on SLEC, and it remains to be answered how MLEC is compared to SLEC in terms of durability, capacity overhead, encoding throughput, network traffic, and other overheads.

Therefore, in this project we seek to build a platform to evaluate the durability and overheads of MLEC. The platform will allow us to evaluate dozens of EC strategies in many dimensions including recovery strategies, chunk placement choices, various parity schemes, etc. To the best of our knowledge, there is no other evaluation platform like what we propose here. We seek to make the platform open-source and the evaluation reproducible, allowing future researchers to benefit from it and conduct more research on MLEC.

Building a platform to evaluate MLEC

Topics: Storage systems, reproducibility, erasure coding, evaluation
Skills: Linux, C, Python
Difficulty: Medium
Size: 350 hours
Mentor(s): John Bent and Anjus George
Contributor(s): Zhiyan "Alex" Wang

Build a platform to evaluate the durability and overheads of MLEC. The platform will be able to evaluate different EC strategies in various dimensions including repair strategies, chunk placement choices, parity schemes, etc. Analyze the evaluation results.

Specific tasks:

Work with mentors on understanding the context of the project.
Reproduce the SLEC evaluation results from prior SLEC evaluation tools
Based on prior SLEC evaluation tools, build a platform to evaluate the durability and overheads of MLEC under various EC strategies
Collect and analyze the MLEC evaluation results

Automatic Cluster Performance Shifts Detection Toolkit

Wed, 01 Feb 2023 10:15:56 -0700

High-performance computing (HPC) clusters typically suffer from performance degradation over time. The heterogeneous nature of clusters and the inevitable defects in various infrastructure layers will result in a harder performance prediction inside. On the other hand, when software upgrades or any such events happen, we might also observe performance improvement or degradation even though nothing changes in the hardware. Due to these uncertainties, it is necessary to send early notification to administrators of changes in cluster performance in a specific time window to inform scheduling decisions and increase cluster utilization.

We are targeting HPC clusters that cater to heterogeneous, compute, and I/O intensive workloads that range from scientific simulation to AI model training that have high degree of parallelization in their workloads. In this scenario, we plan to use the Darshan open-source toolkit (https://github.com/darshan-hpc/darshan) as data collection or profiling tools to design our performance drift algorithms. Furthermore, we will possibly incorporate the distribution shift detection into Darshan, making it viable as a notification to the HPC system administrators.

Our goal is to show the efficacy of our algorithm by plotting the profiling data that display specific time windows where the performance shifts happened after being processed by our algorithm. Finally, we will package all our profiling data and experiment scripts inside Jupyter notebook, especially Chameleon Trovi, to help others reproduce our experiments.

Through this research, we seek to contribute the following:

Designing an algorithm to detect performance shifts in HPC clusters that can be adapted for heterogeneous workloads
Real-time detection of the performance shifts without introducing great overheads into the system
Contribute to Darshan to be able to automatically detect performance changes while profiling the clusters.

Automatic and Adaptive Performance Shifts Detection

Topics: Statistical Machine Learning, Deep Learning, and High-Performance Computing (HPC)
Skills: C++, Python, Statistics, good to have: Machine Learning, Deep learning
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Sandeep Madireddy (https://www.anl.gov/profile/sandeep-r-madireddy, http://www.mcs.anl.gov/~smadireddy/ ), Ray Andrew Sinurat (https://rayandrew.me)
Contributor(s): Kangrui Wang

All in all, these are the specific tasks that the student should do:

Collaborate and work with mentors to understand the goal of this project.
Implement distribution shift detection in pure statistical or machine/deep learning
Deploy the algorithm and try to see its efficacy in the clusters.
Package this experiment to make it easier for others to reproduce

OpenROAD - An Open-Source, Autonomous RTL-GDSII Flow for VLSI Designs (2023)

Wed, 01 Feb 2023 00:00:00 +0000

The OpenROAD project is a non-profit, DARPA-funded and Google sponsored project committed to creating low-cost and innovative Electronic Design Automation (EDA) tools and flows for IC design. Our mission is to democratize IC design, break down barriers of cost and access and mitigate schedule risk through native and open source innovation and collaboration with ecosystem partners. OpenROAD provides an autonomous, no-human-in-the-loop, 24-hour, RTL-GDSII flow for fast ASIC design exploration, QoR estimation and physical implementation for a range of technologies above 12 nm. We welcome a diverse community of designers, researchers, enthusiasts, software engineers and entrepreneurs to use and contribute to OpenROAD and make a far-reaching impact. OpenROAD has been used in > 600 tapeouts across a range of ASIC applications with a rapidly growing and diverse user community.

Enhance OpenROAD GUI Flow Manager

Topics: GUI, Visualization, User Interfaces
Skills: C++, Qt
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty, Ethan Mahintorabi

Develop custom features for analysis and visualizations in the [OpenROAD GUI] (https://openroad.readthedocs.io/en/latest/main/src/gui/README.html) to support native and third party flows. These include OpenROAD-flow-scripts, OpenLane and other third-party flows . Create documentation: commands, developer guide notes, tutorials to show GUI usage for supported flows.

Profile and tune OpenROAD flow for Runtime improvements

Topics: OpenROAD-flow-scripts, Flow Manager, Runtime Optimization
Skills: Knowledge about Computational resource optimization, Cloud-based computation, Basic VLSI design and tools knowledge
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty, Ethan Mahintorabi

Test, analyze and develop verifiable and re-producible strategies to improve run times in OpenROAD-flow-scripts. These include optimizations of computational resources over the cloud, tuning of algorithmic and design flow parameters. Create test plans using existing or new designs to show runtime improvements.

Update OpenROAD Documentation and Tutorials

Topics: Documentation, Tutorials, VLSI design basics
Skills: Knowledge of EDA tools, basics of VLSI design flow, tcl, shell scripts, Documentation, Markdown
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Indira Iyer, Vitor Bandeira
Contributor(s): Jack Luar

Review and update missing documentation and tutorials in OpenROAD-flow-scripts for existing and new features. Here is an example Tutorial link: https://openroad-flow-scripts.readthedocs.io/en/latest/tutorials/FlowTutorial.html for reference.

LEF and Liberty Model Testing

Topics: Testing, LEF, ‘LIB’, VLSI design basics
Skills: Knowledge of EDA tools, basics of VLSI design, lef and lib model abstracts, tcl, shell scripts, Verilog, Layout
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty

Test the accuracy of generated LIB and LEF models for signoff in OpenROAD-flow-scripts for flat and hierarchical design flows. Build test cases to validate and add to the regression suite.

Teaching Computer Networks with Reproducible Research

Wed, 18 Jan 2023 00:00:00 +0000

Lead Mentor: Fraida Fund

In the field of computer networks and wireless communication systems, the availability of open access networking and cloud computing testbeds (GENI, CloudLab, Chameleon, FABRIC, and others) has been transformative in promoting reproducible research and in making high-quality experiential learning available to students and educators at a wide range of colleges and universities. This project seeks to unite research and education use of these testbeds by developing new ways of using reproducible research to teach computer networks and related topics.

Bringing foundational results into the classroom

Topics: Computer networks, reproducibility, education
Skills: Linux, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD

To make foundational results from computer networks more concrete, this project seeks to reproduce a selection of key results and package them for use as interactive classroom demonstrations. (An example of a “foundational” result might be the result from the 1980s that motivates congestion control by showing how congestion collapse occurs when the network is under heavy load.) This involves:

Reproducing the original results on an open-access testbed
Packaging the materials for use as a classroom demo, with interactive elements
Creating assessment questions and sample “solutions” related to the materials, that instructors may use in homework assignments or exams

Developing a “classroom competition” for adaptive video delivery policies

Topics: Computer networks, adaptive video, reproducibility, education
Skills: Linux, Python, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contributor(s): Srishti Jaiswal

A carefully designed competition can be a fun and exciting way for students to challenge themselves and gain “ownership” of a new topic. This projects builds on an existing open source reproducible result for adaptive video delivery, and will challenge students to extend this work and design their own adaptive video policies for head-to-head competition against their classmates. This includes:

Packaging the result to make it easier for students to reproduce and then build on the original work
Implementing other adaptive video policies from the literature, so that students can use them as a baseline
Developing different network settings (using live link traces and emulated link patterns) in which student submissions may be evaluated
Developing an evaluation framework for scoring student submissions on different criteria and in different network settings, and making the results available in a leaderboard format

Using Reproducibility in Machine Learning Education

Wed, 18 Jan 2023 00:00:00 +0000

Lead Mentor: Fraida Fund

The computer science and engineering classroom is as essential part of the reproducibility “ecosystem” - because of broad reach and potential for big impact, and because for many students, the classroom is their first exposure to research in their field. For machine learning in particular, reproducibility is an important element of the research culture, and can be a valuable part of any introductory or advanced courses in the field. These projects will develop highly interactive open educational resources, that may be adopted by instructors of graduate or undergraduate machine learning courses to incorporate more instruction about reproducibility and reproducible research.

Introducing “levels” of reproduction and replication in ML

Topics: Machine learning, reproducibility, education
Skills: Python, machine learning, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contributor(s): Mohamed Saeed

In machine learning, replicating a published result to confirm the validity of the experimental results and the broader conclusions of the paper can take several forms, with increasing levels of effort:

using authors’ code and pre-trained weights, run the model on the same benchmarks as the original paper
training a model using authors’ code and published hyperparameters,
training a model using authors’ code and a new hyperparamter search,
validating the authors’ code e.g. with unit tests, in addition to training,
re-implementing the model,
designing additional experiments to validate that the suggested mechanism is in fact responsible for the result,
and more.

This project will develop interactive materials (using one or more exemplar published results) to illustrate and to highlight relevant aspects and pitfalls of each of these “levels” of reproduction and replication.

Packaging existing reproducible results for the ML classroom

Topics: Machine learning, reproducibility, education
Skills: Python, machine learning, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contribuor(s): Shekhar, Jonathan Edwin

The goal is to make it easier for instructors to expose students to state-of-the-art research in the classroom. This project will work with an existing set of recent reproducible results in machine learning, and will package them for easier consumption by students and more effective use in the classroom. This may include, but is not necessarily limited to:

Re-validating the result and re-packaging along with computational environment on an open access testbed
Creating tutorial material around the result, including interactive visualizations to demonstrate key elements of the work
Creating one-click demos for applying the model/technique to a new test sample
Curating test samples to highlight important advantages and limitations of the result
Creating assessment questions and sample “solutions” that instructors may use to “assign” the work to students

Public Artifact Data and Visualization

Mon, 09 Jan 2023 10:15:56 -0700

Reproducibility and Artifact Evaluation efforts have focused on reproducing the results, but not necessarily on storing, visualizing and making the results accessible. This set of projects builds the initial building blocks to log, capture, and visualize experiments.

Experiment Log

Topics: Provide tools to log experiments
Difficulty: Simple
Size: Medium or large (175 or 350 hours)
Mentors: Anjo Vahldiek-Oberwagner

Develop a client and server side tool to start/stop an experiment, timestamp the experiment. Document each iteration of the experiment and create a database to visualize the log of experiments.

Capture HW/SW state & continuous monitoring

Topics: Record initial state
Difficulty: Medium
Size: Medium or large (175 or 350 hours)
Mentors: Anjo Vahldiek-Oberwagner

Provide simple tools to gather the initial state of each experimental machine and its connected devices, configurations, software versions, … Upload into the experiment log database and visualize the recorded data. Ideally, provide diff function between experimental runs.

In a second step, monitor the machine’s state during the execution. This includes, network, memory, CPU, general OS statistics.

Record and visualize experimental results

Topics: Record results in various formats and visualize them
Difficulty: Hard
Size: Medium or large (175 or 350 hours)
Mentors: Anjo Vahldiek-Oberwagner
Contributor(s): Jiayuan Zhu, Krishna Madhwani

Description: Experiments generate results in various formats (e.g., CSV, json, text files, …). The goal of this project is to provide tools to extract common formats, connect the results to the experiment log and visualize them. Ideally, allowing to compare different experimental runs. Initially, the project could dump their results into a Prometheus instance (https://prometheus.io/) which would later become available for everyone to explore the data.

Polyphorm / PolyPhy

Thu, 15 Dec 2022 00:00:00 +0000

PolyPhy is a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our workshop and more details about our research here.

Under the hood, PolyPhy uses a richer 3D scalar field representation of the reconstructed network, instead of a typical discrete representation like a graph or a mesh. The ultimate purpose of PolyPhy is to become a toolkit for a range of specialists across different disciplines: astronomers, neuroscientists, data scientists and even artists and designers. PolyPhy aspires to be a tool for discovering connections between different disciplines by creating quantitatively comparable structural analytics.

PolyPhy infrastructure engineering and practices

Topics: DevOps Code Refactoring CI/CD
Skills: fluidity in Python, experience with OOP, experience with building and packaging libraries, understanding GitHub and its tools ecosystem
Difficulty: Challenging
Size: 350+ hours
Mentors: Oskar Elek, Anisha Goel
Contributor(s): Prashant Jha

Your responsibility in this project will be developing new infrastructure of the PolyPhy project as well as maintaining the existing codebases. This is a multifaceted role that will require coordination with the team and active approach to understanding the technical needs of the community.

Specific tasks:

Work with the technical lead to develop effective interfaces for PolyPhy, providing access to its functionality on the level of both Python/Jupyter code and the command line.
Maintain the existing codebase and configure it according to the team’s needs.
Develop and extend the current CI/CD functionality and related code metrics.
Document the best practices related to the above.

Write PolyPhy’s technical story and content

Topics: Writing Documentation Storytelling
Skills: experienced writing structured text, well read, technical or scientific education, webdev basics (preferably NodeJS)
Difficulty: Moderate
Size: 350 hours
Mentors: Oskar Elek, Ezra Huscher

Integral to PolyPhy’s presentation is a “story” - a narrative understanding - that the users and the project contributors can relate to. Your responsibility will be to develop the written part of that understanding, as well as major portions of technical documentation that match it.

Specific tasks:

Work with mentors on understanding the context of the project.
Write and edit diverse pages of the project website.
Work with mentors to improve project’s written community practices (diversity, communication).
Write and edit narrative and explanatory parts of PolyPhy’s documentation.
Create tutorials that present core functionality of the toolkit.

Community engagement and management

Topics: Community Management Social Media Networking
Skills: documented experience with current social media landscape, social and well spoken, ability to communicate technical concepts
Difficulty: Moderate
Size: 175 or 350 hours
Mentors: Oskar Elek, Ezra Huscher

Your responsibility will be to build and engage the community around PolyPhy. This includes its standing team and stakeholders, current expert users, potential adopters as well as the general public. The scope (size) of the project depends on the level of commitment during and beyond the Summer and is negotiable upfront.

Specific tasks:

Manage the team’s communication channels (Slack, Zoom, email) and maintain active presence therein.
Develop social media presence for PolyPhy on Twitter, LinkedIn and other selected social media platforms.
Manage and extend the online presence for the project, including its website, mailing list, and other applicable outreach activities.
Research and engage with new communities that would benefit from PolyPhy, both as its expert users and contributors.

FasTensor

Mon, 07 Nov 2022 10:15:56 -0700

FasTensor is a parallel execution engine for user-defined functions on multidimensional arrays. The user-defined functions follow the stencil metaphor used for scientific computing and is effective for expressing a wide range of computations for data analyses, including common aggregation operations from database management systems and advanced machine learning pipelines. FasTensor execution engine exploits the structural-locality in the multidimensional arrays to automate data management operations such as file I/O, data partitioning, communication, parallel execution, and so on.

Tensor execution engine on GPU

Topics: Data Management, Analytics
Skills: C++, github
Difficulty: Difficult
Size: Large (350 hours)
Mentor: John Wu, Bin Dong, Suren Byna

Tensor based computing is needed by scientific applications and now advanced AI model training. Most tensor libraries are hand customized and optimized on GPU, and most of they only serve one kind of application. For example, TensorFlow is only optimized for AI model training. Optimizing generic tensor computing libraries on GPU can benefit wide applications. Our FasTensor, as a generic tensor computing library, can only work efficiently on CPU now. How to run the FasTensor on GPU is still none-explored work. Research and development challenges will include but not limited to: 1) how to maintain structure-locality of tensor data on GPU; 2) how to reduce the performance loss when the structure-locality of tensor is broken on GPU.

Develop a mechanism to move user-define computing kernels onto GPU
Evaluate the performance of the execution engine
Document the execution mechanism
Develop performance testing suite

Continuous Integration

Topics: Data Management, Analytics
Skills: C++, github
Difficulty: Medium
Size: Large (300 hours)
Mentor: John Wu, Bin Dong, Suren Byna

Develop a test suite for the public API of FasTensor
Automate execution of the test suite
Document the continuous integration process

LiveHD (2023)

Mon, 07 Nov 2022 10:15:56 -0700

Projects for LiveHD.
Lead Mentors: Jose Renau and Sakshi Garg.
Contributor(s): Shahzaib Kashif

LiveHD is a “compiler” infrastructure for hardware design optimized for synthesis and simulation. The goals is to enable a more productive flow where the ASIC/FPGA designer can work with multiple hardware description languages like CHISEL, Pyrope, or Verilog.

There are several projects available around LiveHD. A longer explanation and more project options are available at projects. Contact the mentors to find a project that fits your interests.

A sample of helpful projects:

Mockturtle


Title	Mockturtle
Description	Perform synthesis for graph in LiveHD using Mockturtle
Mentor(s)	Jose Renau and Sakshi Garg
Skills	C++17, synthesis
Difficulty	Medium
Size	Medium 175 hours
Link

Mockturtle (https://github.com/lsils/mockturtle) is a synthesis tool partially integrated with LiveHD. The goal of this task is to iron out bugs and issues and to use the LiveHD Tasks API to parallelize the synthesis.

Main features:

The current synthesis divides the circuit in partitions. Each partition can be synthesized in parallel.
Support hierarchical synthesis to optimize cross Lgraphs (cross verilog module optimization)

The goal is to use Mockturtle (https://github.com/lsils/mockturtle) with LiveHD. The main characteristics:

Use mockturtle to tmap to LUTs
Use mockturtle to synthesize (optimize) logic
Enable cut-rewrite as an option
Enable hierarchy cross optimization (hier:true option)
Use the graph labeling to find cluster to optimize
Re-timing
Map to LUTs only gates and non-wide arithmetic. E.g: 32bit add is not mapped to LUTS, but a 2-bit add is mapped.
List of resources to not map:
- Large ALUs. Large ALUs should have an OpenWare block (hardcoded in FPGAs and advanced adder options in ASIC)
- Multipliers and dividers
- Barrell shifters with not trivial shifts (1-2 bits) selectable at run-time
- memories, luts

LiveHD Console


Title	LiveHD Console
Description	Create a console app that interacts with LiveHD to query parameters about designs
Mentor(s)	Jose Renau and Sakshi Garg
Skills	C++17
Difficulty	Medium
Size	Medium 175 hours
Link

Current LiveHD uses replxx but it a no longer maintained shell/console. The result is that it fails in newer versions of OSX.

There is an alternative Crossline (https://github.com/jcwangxp/Crossline). This affects main/main.cpp and nothing else.

In addition to replace the current console with auto-completion, the plan is to add “query” capacity to visualize some of the LiveHD internals.

Query bits, ports… like
- https://github.com/rubund/netlist-analyzer
- https://www.jameswhanlon.com/querying-logical-paths-in-a-verilog-design.html
It would be cool if subsections (selected) parts can be visualized with something like https://github.com/nturley/netlistsvg
The shell may be expanded to support simulation in the future
Wavedrom/Duh dumps

Wavedrom and duh allows to dump bitfield information for structures. It would be interesting to explore to dump tables and bit fields for Lgraph IOs, and structs/fields inside the module. It may be a way to integrate with the documentation generation.

Example of queries: show path, show driver/sink of, do topo traversal,….

Compiler error generation pass


Title	Lgraph and LNAST check pass
Description	Create a pass that check the integrity/correctness of Lgraph and LNAST
Mentor(s)	Jose Renau and Sakshi Garg
Skills	C++17
Difficulty	Medium
Size	Large 350 hours
Link

Create a pass that checks that the Lgraph (and/or LNAST) is semantically correct. The LNAST already has quite a few tests (pass.semantic), but it can be further expanded. Some checks:

No combinational loops
No mismatch in bit widths
No disconnected nodes
Check for inefficient splits (do not split buses that can be combined)
Transformations stages should not drop names if same net is preserved
No writes in LNAST that are never read
All the edges are possible. E.g: no pin ‘C’ in Sum_op

Open Source Autonomous Vehicle Controller

Mon, 07 Nov 2022 10:15:56 -0700

The OSAVC is a vehicle-agnostic open source hardware and software project. This project is designed to provide a real-time hardware controller adaptable to any vehicle type, suitable for aerial, terrestrial, marine, or extraterrestrial vehicles. It allows control researchers to develop state estimation algorithms, sensor calibration algorithms, and vehicle control models in a modular fashion such that once the hardware set has been developed switching algorithms requires only modifying one C function and recompiling.

Lead mentor: Aaron Hunter

Projects for the OSAVC:

Vehicle/Craft sensor driver development

Topics: Driver code to integrate sensor to a microcontroller
Skills: C, I2C, SPI, UART interfaces
Size 175 hours
Difficulty Medium
Mentor Aaron Hunter, Carlos Espinosa, Pavlo Vlastos

Help develop sensor libraries for use in autonomous vehicles. We are in particular interested in sensors for UAVs: airspeed sensors (pitot tube) or barometers, but also proximity detectors (ultrasonic), and range sensors. Code will be written in C using state machine methodology and non-blocking algorithms. Test the drivers on a Microchip microncontroller.

Technical Documentation

Topics: Documentation
Skills: Technical writing, markdown language, website
Size 175 hours
Difficulty Medium
Mentor Aaron Hunter/Carlos Espinosa/Pavlo Vlastos
Contributor(s) Aniruddha Thakre

Technical Documentation: Write a tutorial to demonstrate how to start with an OSAVC and program it with the robotic equivalent of HelloWorld, moving onto more sophisticated applications. Create a web page interface to the OSAVC repo highlighting this tutorial. In this project you will start from scratch with an OSAVC PCB and bring it to life, while documenting it in a way to help new users.

ROS/Gazebo Robot Simulation

Topics: Robot simulation with ROS/Gazebo
Skills ROS/Gazebo, Python
Size 175 or 350 hours
Difficulty Medium to Hard
Mentor Aaron Hunter, Carlos Espinosa, Pavlo Vlastos
Contributor(s) Damodar Datta Kancharla

Generate a simulated world and a quadcopter model in ROS/Gazebo. Provide a link from Mavlink to ROS using the mavros package and simulate a real vehicle data stream to command the simulated quadcopter in Gazebo. At the same time return the image stream from Gazebo to allow for offline processing of ML models on the images.

Writing a blog about your OSRE 2023 project

Sun, 06 Nov 2022 11:15:56 -0700

Starting in 2023 the Organization Admins will be asking students and contributors to provide regular status updates which will help us better highlight the work you are doing and track activities within our OSRE projects. These progress reports will also form the basis of blog reports prepared by students in the course of their summer. Blog reports should include links to proposals, presentations, reports, and an overview of the student’s experience.

Your experience is invaluable for future OSRE candidates and for improving the program every year.

Size and content

Keep it short and crisp. Include a short description of your project, a link to your project proposal, and, later in the program, links to the GSoC reports you provided.

Making a pull request for your blog

Fork the git repository
If you haven’t already done so, add your profile using these instructions
- IMPORTANT: Under user_groups: add - 2023 Contributors (as opposed to any of the two mentor groups)
- The short bio and any other information goes below the frontmatter
Post your blog
- Add /content/report/osre23/ORGANIZATION/PROJECTNAME/DATE-USERNAME/index.md
- Add a frontmatter to index.md, using the labels below
- Blog text goes below the frontmatter
- In that same directory include a picture and call it featured.png (also supports .jpg, .jpeg)
Commit to your fork and make a pull request and email OSRE Admins (currently: Stephanie Lieggi, Carlos Maltzahn).

Example frontmatter and text body

---
title: "YOUR TITLE"
subtitle: "YOUR SUBTITLE (OPTIONAL)"
summary:
authors:
 - USERNAME1
 - USERNAME2
tags: ["osre23"]
categories: []
date: YYYY-MM-DD
lastmod: YYYY-MM-DD
featured: false
draft: false

# Featured image
# To use, add an image named `featured.jpg/png` to your page's folder.
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
image:
 caption: ""
 focal_point: ""
 preview_only: false
---

As part of the [PROJECTNAME](/project/osre23/ORGANIZATION/PROJECTNAME) my [proposal](https://...) under the mentorship of MENTOR aims to ...

Efficient Communication with Key/Value Storage Devices

Sun, 27 Feb 2022 00:00:00 +0000

Network key value stores are used throughout the cloud as a storage backends (eg AWS ShardStore) and are showing up in devices (eg NVMe KV SSD). The KV clients use traditional network sockets and POSIX APIs to communicate with the KV store. An advancement that has occurred in the last 2 years is a new kernel interface that can be used in lieu of the POSIX API, namely io_uring. This new interface uses a set of shared memory queues to provide for kernel-to-user communication and permits zero copy transfer of data. This scheme avoids the overhead of system calls and can improve performance.

Implement `io_uring` communication backend

Topics: performance, I/O, network, key-value, storage
Difficulty: Medium
Size: Medium or large (120 or 150 hours)
Mentors: Philip Kufeldt (Seagate), Aldrin Montana (UC Santa Cruz) Contributor(s): Manank Patel

Seagate has been using a network-based KV HDD as a research vehicle for computational storage. This research vehicle uses open-source user library that implements a KV API by sending network protobuf-based RPCs to a network KV store. Currently it is implemented with the standard socket and POSIX APIs to communicate with the KV backend. This project would implement an io_uring communication backend and compare the results of both implementations.

DirtViz 2.0 (2023)

Mon, 07 Feb 2022 00:00:00 +0000

DirtViz is a project to visualize data collected from sensors deployed in sensor networks. We have deployed a number of sensors measuring qualities like soil moisture, temperature, current and voltage in outdoor settings. This project involves extending our existing visualization stack, DirtViz 1.0 (see github), and expanding it to version 2.0. The project goal is to create a fully-fledged dataviz tool tailored to the types of data collected from embedded systems sensor networks.

Visualize Sensor Data

Topics: Data Visualization, Analytics
Skills: javascript, python, bash, webservers, git, embedded systems
Difficulty: Easy/Moderate
Size: Large, 350 hours
Mentors: Colleen Josephson, Sonia Naderi, Stephen Taylor, John Madden

Specific tasks:

Refine our web-based visualization tools to easily allow users to zoom in on date ranges, change axes, etc.
Create a system for remote collaborators/citizen scientists to upload their own data in a secure manner
Craft an intuitive navigation system so that data from deployment sites around the world can be easily viewed
Document the tool thoroughly for future maintenance
If interested, we are also open to you investigating correlations between different data streams and doing self-directed data analysis