osre24 | UCSC OSPO

Final Blog: SS_Bench - Benchmarking SciStream

Fri, 31 Jan 2025 00:00:00 +0000

Introduction

Hello! My name is Acheme, and I’m thrilled to have collaborated with my mentors Joaquin Chung and Flavio Castro under the SciStream project. This project aims to develop SciStream-bench, a set of benchmarks and artifacts designed to precisely evaluate the performance of scientific streaming applications across diverse traffic patterns when running over the SciStream framework.

In the first half of the project, I focused on describing scientific streaming profiles based on use-cases experienced at Argonne National Lab. The necessary python scripts were developed to generate bursty and constant rate streaming traffic profiles.

In the second half, I built upon this foundation by conducting experiments with the traffic profiles and measuring performance through metrics of latency, jitter and throughput. These experiments were conducted with different message sizes across LAN and WAN network topology.

Key Achievements

Streaming Traffic Profile:
- Developed scripts to generate streaming traffic profiles with configurable parameters.
Created an Artifact:
- I created an artifact using a Jupyter notebook to document an easy to follow integration of SciStream with FABRIC testbed for future experimenters.

Conclusion and Future Work

The work demonstrated that SciStream offers tolerable overhead for secure data streaming and experimentation with this middlebox is possible in publicly available testbed like FABRIC. Future work would be to look into the comparative analysis of the performance of SciStream with or without hardware acceleration or offloading.

Deliverables

SciStream on FABRIC Demo: A demo can be found here on how to integrate SciStream on the FABRIC testbed SciStream on FABRIC.
Jupyter Notebook: An Artifact on FABRIC portal: FABRIC Artifact.

Final Report: Deriving Realistic Performance Benchmarks for Python Interpreters

Tue, 12 Nov 2024 00:00:00 +0000

Hi, I am Mrigank. As a Summer of Reproducibility 2024 fellow, I have been working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah. In particular, we want to benchmark Meta’s Static Python interpreter (which is a part of their Cinder project) and compare its performance with CPython on different levels of typing. In this post, I will share updates on my work since my last update. This post forms my final report for the Summer of Reproducibility 2024.

Since Last Time: Typing Django Files

Based on the profiling results from load testing a Wagtail blog site, I identified three modules in Django that were performance bottlenecks and added shallow types to them. These are available on our GitHub repository.

I also wrote a script to mix untyped, shallow-typed, and advanced-typed versions of a Python module and create a series of such gradually typed versions.

Summary of Experience and Contributions

I tried to set up different versions of Zulip to make them work with Static Python. My setup scripts are available in our repository. Unfortunately, Zulip’s Zerver did not run with Static Python due to incompatibility of some Django modules. A few non-Django modules were also initially throwing errors when run with Static Python due to a bug in Cinder – but I was able to get around with a hack (which I have described in the linked GitHub issue I opened on Cinder’s repository).
I created a locust-version of the small Django-related benchmarks available in pyperperformance and skybison. This helped me confirm that Django is by itself compatible with Static Python, and helped me get started with Locust. This too is available in our repository.
As described in the midterm report, I created a complete pipeline with Locust to simulate real-world load on a Wagtail blog site. The instructions and scripts for running these load tests as well as profiling the Django codebase are available (like everything else!) in our repository.
We added shallow types to the three Django modules mentioned above, and I created scripts to mix untyped, shallow-typed, and advanced-typed versions of a Python module to create a series of gradually typed versions to be tested for performance. We found that advanced-typed code may often be structurally incompatible with shallow-typed code and are looking for a solution for this. We are tracking some examples of this in a GitHub issue.

Going Forward

I had a great time exploring Static Python, typing in Python, load testing, and all other aspects of this project. I was also fortunate to have a helpful mentor along with other amazing team members in the group. During this project, we hit several roadblocks like the challenges in setting up real-world applications with Static Python and the difficulty in adding advanced types – but are managing to work around them. I will be continuing to work on this project until we have a complete set of benchmarks and a comprehensive report on the performance of Static Python.

Our work will continue to be open-sourced and available on our GitHub repository for anyone interested in following along or contributing.

[Final Report] Automated Reproducibility Checklist support within StatWrap

Sat, 02 Nov 2024 00:00:00 +0000

Namaste🙏🏻! I’m Adi Akhilesh Singh, and I’m excited to share my final updates on the Reproducibility Checklists project by StatWrap, under the mentorship of Luke Rasmussen.

Project Overview

This project introduces customizable reproducibility checklists in StatWrap, enabling metadata-driven and user-guided generation of checklists. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklist to ensure their work is reproducible.

Project Links

Explore the StatWrap project repository and my contributions during GSoC ‘24:

Progress And Achievements

During the timeline of this project, I worked on designing the interface for the checklist page and the data structure to support the project needs.

The interface was designed with user needs in mind, featuring components such as:

URLs component to manage external links or file URIs, attached to the project.
Images component to display project image files.
Checklist Notes component to manage user-added notes.

All these assets (Files, URLs, Images) can be added to each checklist statement using the existing assets and external resources(urls) present in the project.

Additionally, for each checklist item, StatWrap runs relevant scans to provide meaningful data based on its requirements. For example, for the item, “All the software dependencies for the project are documented,” StatWrap scans project files to list the languages and dependencies detected. For each checklist statement supported in StatWrap, we implement methods to retrieve specific information by scanning project data. StatWrap currently supports six such checklist statements identified as foundational for ensuring research reproducibility. Additionally, the checklist can be exported as a PDF summary, generated by StatWrap using the checklist data, with options to include notes.

Future Prospects

As the project concludes, several areas for growth have emerged:

Expanding language support within StatWrap. While StatWrap already includes key languages used in research, there is always a scope to extend compatibility to cover even more technologies.
Options to export a data-extensive report that includes checklist and their associated scan results. These and other enhancements, like adding new checklist statements with their scanning methods, will extend StatWrap’s impact on reproducibility in research.

Earlier Blogs

If you’re interested in seeing the project’s evolution, check out my earlier posts:

Thank you for reading!

ML-Powered Problem Detection in Chameleon

Fri, 18 Oct 2024 00:00:00 +0000

Hello! My name is Syed Mohammad Qasim, a PhD candidate at the Department of Electrical and Computer Engineering, Boston University. This summer I worked on the project ML-Powered Problem Detection in Chameleon as part of the Summer of Reproducibility (SoR) program with the mentorship of Ayse Coskun and Michael Sherman.

Chameleon is an open testbed that has supported over 5,000 users working on more than 500 projects. It provides access to over 538 bare metal nodes across various sites, offering approximately 15,000 CPU cores and 5 petabytes of storage. Each site runs independent OpenStack services to deliver its offerings. Currently, Chameleon Cloud comprehensively monitors the sites at the Texas Advanced Computing Center (TACC) and the University of Chicago. Metrics are collected using Prometheus at each site and fed into a central Mimir cluster. All logs are sent to a central Loki, with Grafana used for visualization and alerting. Chameleon currently collects around 3,000 metrics. Manually reviewing and setting alerts for them is time-consuming and labor-intensive. This project aims to help Chameleon operators monitor their systems more effectively and improve overall reliability by creating an anomaly detection service to augment the existing alerting framework.

Over the summer, we focused on analyzing the data and identified 33 key metrics, after discussions with Chameleon operators, from the Prometheus Node Exporter that serve as leading indicators of resource usage on the nodes. For example:

CPU usage: Metrics like node_load1, node_load5, and node_load15.
Memory usage: Including buffer utilization.
Disk usage: Metrics for I/O time, and read/write byte rates.
Network activity: Rate of bytes received and transmitted.
Filesystem metrics: Such as inode_utilization_ratio and node_procs_blocked.
System-level metrics: Including node forks, context switches, and interrupts.

Collected at a rate of every 5 minutes, these metrics provide a comprehensive view of node performance and resource consumption. After finalizing the metrics we wanted to monitor, we selected the following four anomaly detection methods, primarily due to their popularity in academia and recent publication in high-impact conferences such as SIG-KDD and SC.

Omni Anomaly, [KDD 2019] [without POT selection as it requires labels.]
USAD, [KDD 2020]
TranAD, [KDD 2022]
Prodigy, [SC 2023] [Only the VAE, not using their feature selection as it requires labels.]

We collected 75 days of healthy data from Chameleon, and after applying min-max scaling, we trained the models. We then used these models to run inference on the metrics collected during outages, as marked by Chameleon operators. The goal was to determine whether the outage data revealed something interesting or anomalous. We can verify our approach by manually reviewing the results generated by these four anomaly detection methods. Below are the results from the four methods on different outages, followed by an example of how these methods identified the root cause of an anomaly.

The above figure shows the percentage of outage data that was flagged as anomalous by different models.

The above two plots shows two examples of the top 5 metrics which contributed to the anomaly score by each anomaly detection model.

Although the methods seem to indicate anomalies during outages, they are not able to pinpoint the affected service or the exact cause. For example, the first partial authentication outage was due to a DNS error, which can manifest in various ways, such as reduced CPU, memory, or network usage. This work is still in progress, and we are conducting the same analysis on container-level metrics for each service, allowing us to narrow the scope to the affected service and more effectively identify the root cause of anomalies. We will share the next set of results soon.

Thanks for your time, please feel free to reach out to me for any details or questions.

Data Leakage in Applied ML: model uses features that are not legitimate

Tue, 24 Sep 2024 00:00:00 +0000

Hello everyone!

I have been working on reproducing the results from Identification of COVID-19 Samples from Chest X-Ray Images Using Deep Learning: A Comparison of Transfer Learning Approaches. This study aimed to distinguish COVID-19 cases from normal and pneumonia cases using chest X-ray images. Since my last blog post, we have successfully reproduced the results using the VGG19 model, achieving a 92% accuracy on the test set. However, a significant demographic inconsistency exists: normal and pneumonia chest X-ray images were from pediatric patients, while COVID-19 chest X-ray images were from adults. This allowed the model to achieve high accuracy by learning features that were not clinically relevant.

In Reproducing “Identification of COVID-19 samples from chest X-Ray images using deep learning: A comparison of transfer learning approaches” without Data Leakage, we followed the methodology outlined in the paper, but with a key change: we used datasets containing adult chest X-ray images. This time, the model achieved an accuracy of 51%, a 41% drop from the earlier results, confirming that the metrics reported in the paper were overly optimistic due to data leakage, where the model learned illegitimate features.

To further illustrate this issue, we created a toy example demonstrating how a model can learn illegitimate features. Using a small dataset of wolf and husky images, the model achieved an accuracy of 90%. We then revealed that this performance was due to a data leakage issue: all wolf images had snowy backgrounds, while husky images had grassy backgrounds. When we trained the model on a dataset where both wolf and husky images had white backgrounds, the accuracy dropped to 70%. This shows that the accuracy obtained earlier was an overly optimistic measure due to data leakage.

You can explore our work on the COVID-19 paper here.

Lastly, I would like to thank Fraida Fund and Mohamed Saeed for their support and guidance throughout my SoR journey.

Towards Scalable Performance Benchmarking of Genomics Workflows

Thu, 19 Sep 2024 00:00:00 +0000

Project Background

Optimizing genomics workflows execution on a large-scale & heterogeneous cluster requires in-depth understanding of resource requirement and utilization pattern of each application in the workflows. Such information can be obtained by using a benchmarking tool. However, performance data generated by such tool should represent the scale of its target system, lest the design decisions made from it is misguided. My project aims to build GenScale, the first benchmarking tool which can rapidly generate genomics workload performance data at the scale representative of production systems.

As Summer of Reproduciblity (SoR) 2024 comes to an end, I took the time to reflect on my time working on GenScale, the challenges I faced, and the future works & impacts I hope GenScale create for our community.

Milestones & Challenges

The time I spent working on GenScale during SoR can be classified into three phases:

1. Per-Application Container & Input Creation.

Containerization is the current de-facto standard for genomics workflow execution, thus I designed GenScale to execute applications as containers. This requires me to package each application included in the benchmark as a container. I use state-of-art DNA-Seq & RNA-Seq alignment workflows as references for the list of applications & workflow structure. The container images & source files I created are publicy available in GitHub (Deliverables #1)

I also prepare sample inputs for each application to ease the burden of users who do not have sufficient familiarity with genomics applications. The effort is not trivial, because in a workflow, the inputs for a certain step depend on the outputs of previous step(s). Simply speaking, to prepare inputs for the last application in a workflow, we need to get the outputs of applications executed before it, which also requires the outputs of another set of applications, and so on until we arrive at the beginning of workflow. This translates into significant manual labor of carefully tracing & collecting intermediate files from each step of the reference workflows.

All inputs are hosted in a public Google Drive and ChameleonCloud object store (Deliverables #2). In total, I prepared containers and inputs for 7 popular genomics applications: BWA, FastQC, Fastq Cleaner, GATK, Picard, STAR, and Trimmomatic.


Figure 1. Production-grade softwares used in GenScale: Kubernetes for task orchestration, and Prometheus + Grafana for real-time resource monitoring.

2. Components Development.

In this phase, GenScale main components were developed. GenScale consists of three components: (a) Workflow Manager, (b) Task Orchestrator, and (c) Resource Monitor. The Workflow Manager is built from scratch to allow high degree of freedom when scheduling workflows. I use industry-grade solutions for the other components, namely Kubernetes for orchestrating tasks / containers, and Prometheus + Grafana for real-time resource monitoring. My deliverables include semi-automatic installation scripts & easy-to-follow instructions to set up all three components. (Deliverables #3)

3. Performance Data Generation.

The last phase is to use GenScale prototype to generate performance data of each application. I focused on collecting data for three types of resources: compute (CPU utilization), memory (resident set size), and I/O (read & write operations over time). GenScale export these information into a single CSV file to facilitate easy analysis. My deliverables include performance data for DNA-Seq and RNA-Seq workflows. I also provide a sample Python Notebook which analyzes the CPU utilization pattern of each application in DNA-Seq workflow. (Deliverables #4)


Figure 2. CPU utilization pattern of 9 applications in DNA-Seq Alignment workflow collected by GenScale. y-axis: (num. cores) x 100%, x-axis: time elapsed in seconds.

Deliverables

This project’s deliverables can be found in the following Github repo: https://github.com/martinluttap/sor24-genscale/tree/main. In summary, the deliverables include:

Container Images
Input Dataset
Source Code
Performance Data & Sample Analysis Notebook

Future Works, Broader Impacts

Understanding workload characteristics is a crucial step for designing efficient scheduling policy & resource management techniques. GenScale and the performance data it can generate might be a starting point for such effort. Furthermore, I hope GenScale will catalyze meaningful engagements between the computer systems community and bioinformatics community. I believe state-of-arts systems techniques can greatly aid the computing efforts among bioinformatics community. Similarly, domain-specific knowledge & problems within bioinformatics provide unique grounds for the systems community to further advance their field.

[Final] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Wed, 18 Sep 2024 00:00:00 +0000

Hello everyone,

In my SoR 2024 project, ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. I’m excited to share the final progress and insights we’ve gathered on tackling scalability bugs in large-scale distributed systems. I aimed to tackle the reproducibility challenges posed by scalability bugs in large-scale distributed systems. Below is a detailed summary of the investigations and findings we’ve conducted on scalability bugs in large-scale distributed systems.

Project Overview

As you may recall, our project, ScaleRep, aimed to tackle the challenge of scalability bugs—those insidious issues that often arise in large-scale distributed systems under heavy workloads. These bugs, when triggered, can lead to significant system issues such as downtime, performance bottlenecks, and even data loss. They are particularly difficult to catch using traditional testing methods.

Our primary focus was on reproducing these bugs, documenting the challenges involved, and providing insights into how these bugs manifest under various conditions. This documentation will help researchers identify, benchmark, and resolve similar issues in the future.

Progress

Since the midterm update, several Apache Ignite bugs have been investigated, some of which have been successfully reproduced and uploaded to Trovi for the research community to access and reuse. Below is the progress on the bugs investigated:

Bugs Investigated

Key Insights & Challenges

Complexity of Scalability Bugs Many scalability bugs involve subtle and complex interactions that are not easily detected in standard testing environments. For instance, IGNITE-20602 only manifested under certain high-load conditions and required a specific workload and environment to reliably trigger the issue. This highlights the importance of large-scale testing when investigating scalability issues.
Dependency and Documentation Gaps We encountered significant challenges with outdated dependencies and incomplete documentation, particularly in older bugs like IGNITE-16072. In these cases, reproducing the bug required extensive modifications or wasn’t feasible without investing disproportionate effort in updating dependencies.
Effectiveness of Trovi and Chameleon Packaging and sharing our reproducible investigations through Trovi and Chameleon have proven highly effective. By providing researchers with pre-configured environments and detailed documentation, we’ve laid the groundwork for future collaboration and further research on these bugs. We expect this to greatly benefit others attempting to reproduce similar issues.
Impact of Speed-Based Throttling Our investigation into IGNITE-16600 revealed several important insights into speed-based throttling and its impact on system performance under high-load conditions. By analyzing the checkpoint starvation and thread throttling mechanisms, we were able to identify areas for improvement in the latest Ignite releases.

Next Steps

Expanding Collaboration: The packaged bugs and replayable Trovi experiments will be made available to the broader research community, encouraging further investigation and enhancements to large-scale distributed systems.

The ScaleRep project has been an exciting journey into the world of scalability bugs, pushing the boundaries of what’s possible in terms of reproducibility and benchmarking. Through this project, we’ve demonstrated the importance of rigorous testing and comprehensive documentation in improving the reliability of distributed systems.

Final Blog: Enhancing User Experience Reproducibility through TROVI Redesign

Wed, 18 Sep 2024 00:00:00 +0000

Hello! My name is Alicia Esquivel Morel, and I’m a graduate research assistant at the University of Missouri – Columbia, pursuing a PhD in Computer Science. This summer, I worked on a project to improve user experience reproducibility through a redesign of TROVI, as part of the Summer of Reproducibility (SoR) program.

Before even starting this project, and me as a rising researcher, I always saw reproducibility as one of the biggest challenges in research. What I wanted to see was always as reproducibility—being able to consistently replicate experiments and share them in a way that others can follow.

TROVI, is a platform designed to help with this. However, as I joined the project, I knew it had room for improvement, not oly in the user interface, but also in the ease of integrating code and data.

This project aimed to address these challenges by redesigning TROVI to streamline experiment replication, making the platform more intuitive and accessible. The goal was simple: create a user-friendly experience that eliminates confusion and frustration, allowing researchers to focus on their work instead of the technical aspects of running a research experiment.

Our goals in the beginning of the summer:

We wanted to simplify TROVI’s interface for intuitive navigation, inspired by platforms like Google Colab.
We wanted to make uploading and sharing code and data easier, with seamless integration with tools like GitHub.
We wanted to create a mechanism for users to provide feedback, allowing TROVI to evolve based on real user needs.

How was the progress and what we have achieved

I started by conducting thorough UX research and a literature review on reproducibility platforms, establishing a solid foundation for the redesign. With user feedback guiding the process, I created wireframes and low-fidelity prototypes, focusing on making the platform more intuitive.

As the project progressed, I built a higher-fidelity prototype that connected various components of the platform, ensuring a seamless user journey. I then tackled the back-end integration, which tied together the front-end flows with TROVI’s API.

Throughout this project, I received valuable support and guidance from my mentors. Mark Powers walked me through TROVI’s architecture and helped me understand exactly what was needed for a successful redesign. Thanks to his mentorship, I not only completed the project but learned a great deal along the way. Thanks Mark Powers!!

Through iterations and feedback from initial user testing, and we the help of Kate Keahey, I refined the design to ensure it met the needs of the research community. By the end of the program, TROVI had evolved into a cohesive, user-friendly platform that leads to enhanced experiment reproducibility.

Accomplishments

A simplified interface that makes navigating, uploading, and collaborating much easier.
GitHub integration that streamlines the process of sharing code and data with collaborators.
A built-in feedback loop that enables TROVI to grow with its users, adapting to their needs as they arise.

The platform is also getting ready to move into production and will soon be available for the research community.

What’s Next?

While the core objectives have been successfully met, future improvements could further enhance the platform’s capabilities, such as additional integrations and more advanced collaboration features. User testing will continue to provide insights for ongoing development.

I’m grateful for this opportunity! Thank you for following along!

[MidTerm] StatWrap: Automated Reproducibility Checklists Generation

Mon, 16 Sep 2024 00:00:00 +0000

Namaste🙏🏻! I’m Adi Akhilesh Singh, and I’m excited to share progress updates on the Reproducibility Checklists project by StatWrap, under the mentorship of Luke Rasmussen.

Project Overview

The project aims to integrate customizable reproducibility checklists into StatWrap, using metadata and user input to automate their generation. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklists to ensure their work is reproducible.

Progress

Over the past few months, my mentors and I have worked on developing the interface for the checklists page and designed key components to support our project goals. We’ve implemented logic that iterates over each checklist item, displaying its statement along with Boolean controls (Yes/No buttons) for user interaction.

We’ve also developed components to display attached images and URLs linked to each checklist item. Additionally, we’ve integrated a notes feature that allows users to add, edit, and view project-related notes. Currently, we are writing methods to integrate real-time project data into the checklists. For example, one method we’ve implemented scans project files (assets) to detect the languages used.

What’s Next?

As we move closer to the final evaluation phase, our focus will be on the following objectives:

Implement methods for each checklist item, integrating real-time data from the project data to auto-populate checklist answers.
Enhance the Attached Images component to allow users to select and attach existing image assets from the project.
Display the results of the scans for each checklist item, providing users with detailed outputs based on the automated analysis.

Stay tuned for further updates as we continue developing this feature set! 🚀

Final Post: Enhancing Reproducibility and Portability in Network Experiments

Thu, 05 Sep 2024 00:00:00 +0000

Introduction

As my project with the Summer of Reproducibility (SoR) 2024 comes to a close, I’d like to reflect on the journey and the outcomes achieved. My project focused on enhancing the reproducibility and portability of network experiments by integrating the RO-Crate standard into the TUM intern testbed pos (plain orchestrating service), and deploying this testbed on the Chameleon cloud infrastructure. The aim was to ensure that experiments conducted on one platform could be seamlessly reproduced on another, adhering to the FAIR principles (Findable, Accessible, Interoperable, Reusable) for research data.

Project Recap

The core goal was to make the experiments reproducible and portable between different testbeds like TUM’s pos and Chameleon. To achieve this, I integrated the RO-Crate standard, which ensures that all experiment data is automatically documented and stored with metadata, making it easier for others and especially for machines to understand, replicate, and build on the results. Additionally, deploying a lightweight version of pos on the Chameleon testbed enabled cross-testbed execution, allowing experiments to be replicated across both environments without significant modifications.

Key Achievements

Over the course of the project, several key milestones were achieved:

RO-Crate Integration: The first step was restructuring the results folder and automating the generation of metadata using RO-Crate. This ensured that all experiment data was comprehensively documented with details like author information, hardware configurations, and experiment scripts resulting in comprehensive ro-crate-metadata.json files as important part of each result folder.
Improved Data Management: The integration of RO-Crate greatly simplified the process of organizing and retrieving experiment data and metadata with information about the experiment and the result files. All metadata was automatically generated, making it easier to share and document the experiments for other researchers to replicate.
Automatic Upload to Zenodo: Another crucial achievement was the implementation of automatic uploading of pos experiment result folders to Zenodo, an open-access repository. This step significantly improved the reproducibility and sharing of experiment results, making them easily accessible to the broader scientific community. By utilizing Zenodo, we ensured that experiment results, along with their RO-Crate metadata, could be archived and referenced, fostering greater transparency and collaboration in scientific research.
Chameleon Deployment: Deploying the pos testbed within the Chameleon environment required managing various complexities, particularly related to Chameleon’s OpenStack API, networking setup, and hardware configurations. Coordinating the network components and infrastructure to support pos functionality in this testbed environment demanded significant adjustments to ensure smooth integration and operation.

Challenges

Like any project, this one came with its own set of challenges:

Balancing Automation and Flexibility: While automating the generation of RO-Crate metadata, it was crucial to ensure that the flexibility required by researchers for customizing their documentation was not compromised. Finding this balance required in-depth adjustments to the testbed infrastructure.
Complexity of Testbed Systems: Integrating RO-Crate into a complex system like pos, and ensuring it works seamlessly with Chameleon, involved understanding and adapting to the complexities of both testbeds.

Future Directions

As I move forward with my master’s thesis working on these challenges, we plan to expand on this work by:

Extending the Chameleon Deployment: We aim to deploy the full version of pos on Chameleon, supporting more complex and larger-scale experiments.
Supporting Complex Experiment Workflows: Future work will focus on handling more intricate and larger datasets, ensuring reproducibility for complex workflows. Only by executing more complex experiments will we be able to thoroughly analyze and compare the differences between executions in pos and the pos deployed on Chameleon, helping us better understand the impact of different testbed environments on experiment outcomes.
Automation: The ultimate goal is to fully automate the process of experiment execution, result documentation, and sharing across testbeds, reducing manual intervention and further enhancing reproducibility.

Reflections

By integrating the RO-Crate standard and deploying pos on the Chameleon testbed, we have made significant steps toward enhancing the reproducibility, accessibility, and portability of network experiments across research platforms. These efforts contribute to more shareable, and replicable research processes in the scientific community.

I am excited about the future work ahead and am grateful for the mentorship and support I received during this project.

Deliverables and Availability

Due to the current non-public status of the pos framework, the code and deliverables are not publicly available at the moment.

Previous Blogs

Make sure to check out my other blogs to see how I started this project and the challenges I faced along the way:

Servus!

Understanding Data Leakage in Machine Learning: A Focus on TF-IDF

Thu, 05 Sep 2024 00:00:00 +0000

Hello again!

This is my final blog post, and I will be discussing the second material I created for the 2024 Summer of Reproducibility Fellowship. As you may recall from my first post, I am working on the Exploring Data Leakage in Applied ML: Reproducing Examples of Irreproducibility project with Fraida Fund and Mohamed Saeed as my mentors.

This blog post will explore how data leakage can occur during feature extraction, particularly with the commonly used TF-IDF vectorizer, and its impact on model generalization.

Introduction

In machine learning, data leakage is a critical issue that can severely impact model performance. It occurs when information from outside the training dataset is improperly used to create the model, leading to overly optimistic performance during evaluation. One common source of leakage comes from how features, such as those extracted using TF-IDF (Term Frequency-Inverse Document Frequency), are handled. In this post, we’ll explore how data leakage can happen during feature extraction with TF-IDF and how it affects model accuracy.

What is TF-IDF?

TF-IDF is a method used to evaluate how important a word is in a document relative to a collection of documents. It consists of two components:

Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Reduces the importance of terms that appear frequently across many documents.

Together, they provide a weighted value for each word, reflecting its importance relative to the dataset.

How Data Leakage Occurs with TF-IDF

Data leakage with TF-IDF happens when the inverse document frequency (IDF) is calculated using the entire dataset (including the test set) before splitting it into training and test sets. This means the model has access to information from the test set during training, leading to artificially inflated results. This is a subtle form of data leakage, as it often goes unnoticed.

For example, when calculating the TF-IDF score, if the word “banana” appears more frequently in the test set but is considered during training, the model downplays its significance. As a result, the model may fail to predict correctly when “banana” is important in the test data.

Why Does This Matter?

If the test data is included when calculating the IDF, the model gains unintended insight into the test set’s word distribution. In real-world scenarios, the test data is supposed to be unseen during training. By allowing the model to see this information, you’re essentially reducing the uncertainty that the model should have about future data.

Impact of Data Leakage on Model Performance

Let’s consider two cases to understand the impact of data leakage in detail:

When a word is rare in the training set but common in the test set: The model will underestimate the importance of this word during training, leading to poor performance when the word is critical in test documents.
When a word is common in the training set but rare in the test set: The model will overemphasize the word during training, leading to poor predictions when the word doesn’t appear as often in unseen data.

Case Study: Data Leakage in TF-IDF

To see this effect in action, consider a small toy dataset where the presence of the word “banana” determines the label. If the word “banana” appears in a sentence, the label is 1; otherwise, the label is 0. Using TF-IDF to vectorize the text, we train a machine learning model to predict this label.

In the first scenario, we calculate the TF-IDF using the entire dataset before splitting it into training and testing sets. This causes data leakage since the model now knows the distribution of words across both sets. For instance, if “banana” is more common in the test set than the training set, the IDF score for “banana” will be lower across the entire dataset, leading the model to downplay its importance.

In the second scenario, we calculate TF-IDF only on the training set, ensuring that the test set remains unseen. This preserves the integrity of the test set, giving us a more realistic evaluation of the model’s performance.

In both scenarios, the model’s accuracy is drastically different. When leakage is present, performance is artificially high during training but poor when tested on unseen data. Without leakage, the model generalizes better, as it is evaluated on truly unseen data.

Avoiding Data Leakage

Avoiding data leakage is essential for building reliable machine learning models that generalize well to new data. Here are a few guidelines to help prevent leakage:

Split the dataset before feature extraction: Always divide your data into training and test sets before applying any feature engineering techniques.
Ensure proper cross-validation: When using cross-validation, ensure that the training and test splits do not overlap in any way that can leak information between them.
Be cautious with time-series data: In time-series models, avoid using future data to predict past events, as this can lead to leakage.

Conclusion

Avoiding data leakage is crucial for building robust machine learning models. In the case of TF-IDF, ensuring that feature extraction is done only on the training set and not on the entire dataset is key to preventing leakage. Properly addressing this issue leads to better generalization and more reliable models in real-world applications.

This blog post provided a case study on how TF-IDF can introduce data leakage and why it’s important to carefully handle your dataset before feature extraction. By splitting your data properly and ensuring that no test data “leaks” into the training process, you can build models that truly reflect real-world performance.

Thanks for reading!

AutoAppendix: Towards One-Click reproducibility of high-performance computing experiments

Wed, 04 Sep 2024 00:00:00 +0000

Hi everyone,

I’m excited to wrap up the AutoAppendix project with our final findings and insights. Over the course of this initiative, we’ve worked to assess the reproducibility of artifacts submitted to the SC24 conference and create guidelines that aim to improve the standard for reproducible experiments in the future. Here’s a summary of the project’s final phase and what we’ve learned.

Project Goals and Progress

The goal of AutoAppendix was to evaluate the computational artifacts provided by SC24 paper submissions, focusing on reproducibility. These artifacts accompany papers applying for the “Artifact Replicable” badge in the conference’s reproducibility initiative. Volunteer members of this initiative assess 1-2 paper appendices each. In this project, we analyzed a larger portion of artifacts to gain a broader perspective on potential improvements to the reproducibility process.

We selected 18 out of 45 submissions, focusing on experiments that could be easily replicated on Chameleon Cloud. Our evaluation criteria were based on simplicity (single-node setups) and availability of resources. The final analysis expanded on the earlier midterm findings, shedding light on various challenges and best practices related to artifact reproducibility.

Artifact Evaluation Process

During the evaluation process, we focused on examining the completeness and clarity of the provided artifacts, looking closely at documentation, setup instructions, and the degree of automation.

Our first step was to replicate the environments used in the original experiments as closely as possible using the resources from Chameleon. Many papers included instructions for creating the necessary software environments, but the clarity of these instructions varied significantly across submissions. In some cases, we even encountered challenges in reproducing results due to unclear instructions or missing dependencies, which reinforced the need for standardized, clear documentation as part of the artifact submission process.

We observed that containerization and semi-automated setups (with scripts that break down the experiment into smaller steps) were particularly effective in enhancing the reproducibility of the artifacts. One artifact particularly caught our attention due to its usage of the Chameleon JupyterHub platform, making it reproducible with a single click. This highlighted the potential for streamlining the reproducibility process and showcased that, with sufficient effort and the right tools, experiments can indeed be made replicable by anyone.

Results

Throughout the evaluation, we observed that reproducibility could vary widely based on the clarity and completeness of the documentation and the automation of setup procedures. Artifacts that were structured with clear, detailed steps for installation and execution tended to perform well in terms of replicability.

From our evaluation, we derived a set of guidelines (intended as must-haves) and best practices (recommended) for artifact reproducibility, which can be found below.

Due to our fascination of the potential of the Chameleon JupyterHub platform and its adjacent Trovi artifact repository, we decided to create several templates that can be used as a starting point for authors to make integration of their artifacts with the platform easier. In the design of these templates, we made sure that artifacts structured according to our guidelines are particularly easy to integrate.

Guidelines

Clear Documentation: Provide clear and detailed documentation for the artifact in the corresponding appendix, such that the artifact can be replicated without the need for additional information. For third-party software, it is acceptable to refer to the official documentation.
Software Setup: Clearly specify the versions of all (necessary) software components used in the creation of the artifact. This includes the operating system, libraries, and tools. Particularly, state all software setup steps to replicate the software environment
Hardware Specifications: Specify the hardware the experiment was conducted on. Importantly, state the architecture the experiments are intended to run on, and ensure that provided software (e.g. docker images) are compatible with commonly available architectures.
Expected Results: Always provide the expected outputs of the experiment, especially when run on different hardware, to make it easier for reviewers to assess the success of the replication.
Public Data: Publish the experiment data to a public repository, and make sure the data is available for download to reviewers and readers, especially during the evaluation period. Zenodo is a recommended repository for this purpose.
Automated Reproducibility: For long-running experiments, provide progress output to the reviewer to ensure the experiment is running as expected. Give an idea in the documentation of

how much time long-running steps in the reproduction will take
what the progress output looks like or how frequently it is emitted

Sample Execution: Conduct a sample evaluation with hardware and software as similar as possible to the intended reproduction environment.

Best Practices

Reproduciible Environment: Use a reproducible environment for the artifact. This can come in several forms:

Containerization: Provide instructions for building the environment, or, ideally, provide a ready-to-use image. For example, Docker, Signularity or VirtualBox images can be used for this purpose
Reproducible Builds: Package managers like Nix or Guix have recently spiked in popularity and allow their users to create reproducible environments, matching the exact software versions across different systems.

Partial Automation: It often makes sense to break an experiment down into smaller, more manageable steps. For Linux-based systems, bash scripts are particularly viable for this purpose. We recommend prefixing the scripts for each step with a number, such that the order of execution is clear.
X11 Availability: Usually, reviewers will not have access to a graphical user interface on the system where the artifact is evaluated. If the artifact requires a graphical user interface, provide a way to run the artifact without it. For example, save matplotlib plots to disk instead of showing them with plt.show().
Experiment output: Do not provide output files of the experiment in your artifact, unless explicitly intended. If provided output files are intended for comparison, they should be marked as such (e.g. in their filename). Similarly, any output logs or interactive outputs in Jupyter notebook should not be part of the artifact, but rather be initially generate during the artifact evaluation.

Trovi Templates

Our templates share a common base that features a central configuration file for modifying the Chameleon experiment parameters (such as node type). Building on this base, we provide three templates with sample experiments that each use different environments:

Docker template: This template is designed for containerized experiments and supports nvidia GPUs over the nvidia-container-toolkit integration.
Nix template: Sets up the Nix package manager with a shell.nix file that can be used to configure the environment.
Guix template: Installs the Guix package manager and executes a sample experiment from an existing reproducible paper that hinges on the reproducibility of the software environment.

Conclusion

In summary, the AutoAppendix project has been an insightful journey into the complexities of artifact reproducibility. Our evaluations highlight both the challenges and potential solutions for future reproducibility initiatives. By following these essential guidelines and implementing best practices, we aim for the research community to achieve higher standards of transparency and reliability in scientific research and help to ensure that the results of experiments can be replicated by others.

Thanks for following along with our progress! We’re excited to see the positive impact these findings will have on the research community.

If you are interested in the full project report, you can find it here, together with the Trovi templates.

Reflecting on the ScaleRep Project: Achievements and Insights

Mon, 02 Sep 2024 00:00:00 +0000

Hello everyone,

As we reach the conclusion of our ScaleRep project, I want to take a moment to reflect on the journey we’ve undertaken and the significant milestones we’ve achieved. Throughout this project, our primary focus was on identifying, reproducing, and analyzing scalability bugs in cloud systems such as Cassandra, HDFS, and Hadoop. Under the mentorship of Professor Yang Wang and Bogdan “Bo” Stoica, we have gained valuable insights into the complexities of scalability issues and their impact on large-scale distributed systems.

Key Accomplishments

Over the course of the project, we delved into various aspects of scalability bugs, reproducing some of the most challenging issues faced by cloud systems. One of our notable accomplishments was the successful reproduction and validation of developer fixes for several critical bugs in HDFS. These included:

1. Throttling Bugs in HDFS:

We investigated HDFS-17087, where the absence of a throttler in led to unregulated data reads, causing potential performance degradation. By reproducing the bug and applying the developer’s patch, we were able to observe significant improvements in system stability.DataXceiver#readBlock

2. Reducing DataNode Load:

HDFS-16386 was another crucial bug we worked on, which involved reducing the load on DataNodes when was working. By analyzing the effects of high CPU and memory usage, we proposed and validated a solution that reduced the number of concurrent threads, ultimately improving the DataNode’s performance.FsDatasetAsyncDiskService

3. Improving Log Throttling:

In HDFS-16872, we addressed excessive logging caused by unshared instances of . By making a static member, we were able to share throttling across instances, reducing unnecessary log entries and improving system efficiency.LogThrottlingHelperLogThrottlingHelper

Insights and Learnings

1. Systematic Bug Reproduction:

One of the most critical aspects of our work was developing a systematic approach to bug reproduction. This involved carefully setting up the environment, applying patches, and validating results through detailed monitoring and analysis. Our reproducible artifacts and investigation scripts will serve as a resource for future researchers and developers.

2. Impact of Throttling Mechanisms:

Our exploration of throttling bugs highlighted the importance of accurate throttling mechanisms in maintaining system performance and stability. Small issues, such as incorrect data rate calculations, can have significant ripple effects on system behavior, emphasizing the need for precise and effective solutions.

3. Collaboration and Open Source Contribution:

Working on an open-source project like ScaleRep underscored the importance of collaboration within the community. The bugs we analyzed and fixed not only improved the systems we worked on but also contributed to the broader effort of enhancing the reliability of cloud systems.

Conclusion

As we wrap up the ScaleRep project, I am proud of the progress we have made and the contributions we have delivered to the open-source community. The knowledge and experience gained from this project will undoubtedly shape our future endeavors in the field of distributed systems and cloud computing. I am grateful for the guidance and support provided by Professor Yang Wang and Bogdan “Bo” Stoica throughout this journey.

Thank you for following along, and I look forward to continuing to explore the future of scalable and reliable cloud systems!

Final Report: Stream processing support for FasTensor

Fri, 30 Aug 2024 00:00:00 +0000

Final Report: Stream processing support for FasTensor

Project Description

FasTensor is a scientific computing library specialized in performing computations over dense matrices that exhibit spatial locality, a characteristic often found in physical phenomena data. Our GSoC'24 project aimed to enhance FasTensor by enabling it to ingest and process live data streams from sensors and scientific equipment.

What is FasTensor?

Imagine you’re working on a physical simulation or solving partial differential equations (PDEs). You’ve discretized your PDE, but now you face a new challenge: you need to run your computations fast and parallelize them across massive compute clusters.

At this point, you find yourself describing a stencil [1] operation. But should you really spend your time tinkering with loop orders, data layouts, and countless other side-quests unrelated to your core problem?

This is where FasTensor comes in: Describe your computation as a stencil, and it takes care of ensuring optimal execution. FasTensor lets you focus on the science, not the implementation details.

Repository Links

FasTensor: https://github.com/BinDong314/FasTensor
My fork: https://github.com/my-name/FasTensor/tree/ftstream

PR(s)

Work done this summer

Develop Streaming simulator: FTStream

I was first entasked by Dr. Bin to develop a stream simulator for testing the streaming capability of FasTensor. For testing purposes, a stream is characterized by file size, count, and arrival interval. FTStream can generate streams of various sizes and intervals, up to the theoretical limits of disk and filesystem. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe!

Writing this tool was an adventure in throughput testing and exploring APIs. I wrote multiple drivers, each for a different whim and hijinks of systems in the HPC world. Here’s a brief journey through the APIs we explored:

HDF5 APIs: Pretty fast in flush-to-disk operation, but the API design strongly binds to file handles, which inhibits high throughput duplication.
HDF5 VFL and VOL: We dabbled in these dark arts, but there be dragons! Keeping a long-term view of maintenance, we dropped the idea.
POSIX O_DIRECT: This involved getting your buffers aligned right and handling remainders correctly. A step up, but not quite at the theoretical limits.
Linux AIO: Streaming is latency sensitive domain, to reach the theoretical limits, every syscall saved matters. Linux AIO allowed us syscall batching with io_submit(). It took a few testing sessions to get the correct combo of queue depth, buffer size, and alignment right.

We settled on O_DIRECT + Linux AIO. Feel free to modify ftstream/fastflush.h to suit your needs.

Stream Support

FasTensor has just one simple paradigm: you give it a data source, an output data store, and your transform, and it handles all the behind-the-scenes grunt work of computing over big datasets so you can focus on your research.

We aimed to achieve the same for streaming: Drop in the STREAM keyword, append a pattern identifying your stream, and use your usual transform.

Voila! Now your previous FasTensor code supports live data streams.

Technical tidbits:

Implements a manager-worker pattern to allow us flexibility in the future to implement different stream semantics such as windowing, CPU-memory based load balancing
Supports streams of indefinite size

Challenges

HPC has its fair share of challenges. Things you take for granted might not be available there, and it takes a while to adjust to paradigms of scale and parallelization.

For example, when developing FTStream, we found O_DIRECT is available on some parallel file systems like GPFS but not supported on Lustre/CFS. We developed a separate MPIO driver for FTStream that will be upstreamed once thoroughly tested on Lustre.

Future Work

Implement windowing and explore more advanced stream semantics.
Implement support for for defining workload policies
Optimize interleaving IO and Compute.

References

[1] Anshu Dubey. 2014. Stencils in Scientific Computations. In Proceedings of the Second Workshop on Optimizing Stencil Computations (WOSC ‘14). Association for Computing Machinery, New York, NY, USA, 57. https://doi.org/10.1145/2686745.2686756

Acknowledgement

I struck gold when it comes to mentors.

Dr. Bin Dong was really kind and supportive throughout the journey. From the very first steps of giving a tour around the codebase to giving me a lot of freedom to experiment, refactor, and refine.

Dr. John Wu was encouraging and nurturing of budding talent. We had great research presentations every Monday apart from usual mentor interactions, where different research groups presented their talks and students were invited to present their progress.

I’ve come across Quantum computing many times in the news, but I never thought I’d get a frontline preview from the researchers working at the bleeding edge at the Lawrence Berkeley National Laboratory (LBL).

This GSoC experience, made possible by Google and UC OSPO, has been invaluable for my growth as a developer and researcher.

For people interested in HPC, ML, Systems, or Reproducibility, I encourage you all to apply to UC OSPO. It’s been an incredible journey, and I’m grateful for every moment of it!

Static and Interactive Visualization Capture

Fri, 30 Aug 2024 00:00:00 +0000

Introduction

Hello! My name is Arya Sarkar a machine learning engineer and researcher based out of Kolkata, a city in Eastern India dubbed the City of Joy. During summer of 2024, I worked closely with Professor David Koop on the project titled Reproducibility in Data Visualization. We explored multiple existing solutions and tested different stratergies and made great progress in the capture of visualiations using a relatively less used method of embedding visualization meta-information into the final resultant visualizations jpg as a json object.

Progress and Challenges

Static Visualization Capture

We successfully developed a method to capture static visualizations as .png files along with embedded metadata in a JSON format. This approach enables seamless reproducibility of the visualization by storing all necessary metadata within the image file itself. Our method supports both Matplotlib and Bokeh libraries and demonstrated near-perfect reproducibility, with only a minimal 1-2% pixel difference in cases where jitter (randomness) was involved.

Interactive Visualization Capture

For interactive visualizations, our focus shifted to capturing state changes in Plotly visualizations on the web. We developed a script that tracks user interactions (e.g., zoom, box, lasso, slider) using event listeners and automatically captures the visualization state as both image and metadata files. This script also maintains a history of interactions to ensure reproducibility of all interaction states.

The challenge of capturing web-based visualizations from platforms like ObservableHq remains, as iframe restrictions prevent direct access to SVG elements. Further exploration is needed to create a more robust capture method for these environments.

Future Work

We aim to package our interactive capture script into a Google Chrome extension.

Temporarily store interaction session files in the browser’s local storage.

Enable users to download captured files as a zip archive, using base64 encoding for images.

Conclusion

The last summer, we made significant strides in enhancing data visualization reproducibility. Our innovative approach to embedding metadata directly into visualization files offers a streamlined method for recreating static visualizations. The progress in capturing interactive visualization states opens new possibilities for tackling a long-standing challenge in the field of reproducibility.

Final Blog: BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Thu, 29 Aug 2024 00:00:00 +0000

Hello! I’m Qianru! I have been contributing to the BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking project under the mentorship of Ziheng Duan. My project aims to provide a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics.

Motivation and Overview

The “BenchmarkST” project was driven by the need to address a critical challenge in spatial transcriptomics: the impact of sparse data on downstream tasks, such as spatial domain identification. Sparse data can significantly degrade the performance of these tasks. For example, in a 10X Visium dataset of human brain Dorsolateral Prefrontal Cortex (DLPFC), using the complete dataset with GraphST (a state-of-the-art clustering method) for clustering resulted in an ARI (Adjusted Rand Index) of 0.6347. However, when using only 20% of the data—a common scenario—the performance dropped dramatically to 0.1880. This stark difference highlights the importance of effective gene imputation, which can help restore the lost information and improve the accuracy of downstream analyses.

To tackle this issue, the BenchmarkST project led to the creation of the Impeller package. This package provides a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics, offering preprocessed datasets, reproducible evaluation methods, and flexible inference interfaces. It spans across different platforms, species, and organs, aiming to enhance the integrity and usability of spatial transcriptomics data.

What Was Accomplished

Development of the Impeller Package

Data Aggregation and Preprocessing:

We aggregated and preprocessed spatial transcriptomic datasets from multiple platforms (10X Visium, StereoSeq, SlideSeqV2), species (human, mouse), and organs (Dorsolateral Prefrontal Cortex, olfactory bulb). These datasets are readily available for download within the package.

Unified Evaluation Framework:

A reproducible framework was developed, integrating methods such as K-Nearest Neighbors (KNN) and the deep learning-based Impeller method, enabling users to easily evaluate the performance of different gene imputation techniques.

Inference Interfaces:

We provided interfaces that allow users to apply gene imputation on custom datasets, offering the flexibility to predict any gene in any cell, maximizing the utility for diverse research needs.

Code Contributions and Documentation

Repository:

All code related to the Impeller package has been committed to the Impeller repository.

Link to Versions:

Here you can find all the versions made during the project, with detailed descriptions of each change.

README.md:

Detailed documentation on how to use the Impeller package, including installation instructions, usage examples, and explanations of the key components.

Final Blog: ML in Detecting and Addressing System Drift

Thu, 29 Aug 2024 00:00:00 +0000

Hello! I’m Joanna! I have been contributing to the ML in Detecting and Addressing System Drift project under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy. My project aims to design a pipeline to evaluate drift detection algorithms on system traces.

Methodology

Here is some background on my project: Model drift, or the degradation of model performance, is typically caused by data drift, which is a shift in the input distribution, and concept drift, which is a change in the relationship between input and output. This project focuses specifically on data drift, aiming to design a pipeline for evaluating drift detection algorithms on system traces. The goal is to benchmark different drift detection algorithms and have a better understanding of the features of system traces. The project is divided into two main parts: dataset construction and algorithm benchmarking.

PART 1: Dataset Construction

To benchmark drift detection algorithms in system data, it’s important to recognize that system trace data is inherently different from other data types, often containing more noise, which can complicate detection efforts. Therefore, constructing a labeled dataset specific to system data is crucial. In our case, we utilize the Tencent I/O block trace data as the dataset. This raw data was processed to extract timestamps along with various features such as IOPS, write size ratio, read write ratio, and etc., which were then used to create a data drift dataset.

I constructed this dataset by labeling segments of the trace data as either exhibiting drift or not. To identify where the drift occurs and to help construct the dataset, I employed several offline drift detection algorithms, including Kolmogorov-Smirnov, Cramer-von Mises, KL-Divergence, and Jensen-Shannon Distance.

To enhance the accuracy of the drift detection, especially in the presence of noise common in trace data, I applied additional preprocessing steps such as Fourier transform and moving average. These techniques help to smooth the data, making it easier to detect true drift signals. Finally, a voting strategy was used in combination with post-processing methods to build and refine the final datasets.

The first figure below illustrates the segments of IOPS where drift has been detected. The second figure shows the segments of data where no drift occurs.

PART 2: Benchmark Drift Detection Algorithms

This part focuses on benchmarking the Jensen-Shannon and Wasserstein drift detection methods using system trace data. The evaluation metrics are categorized into three main areas:

Detection Accuracy Metrics

True Positive Rate (Recall)
True Negative Rate (Specificity)
Precision
F1-Score

Detection Overhead Metrics

Time Taken: The computational time required to detect drifts, critical

Stability Metrics

False Positive Rate
False Negative Rate

(Additional) Comparative Analysis:

Accuracy Across Different Features: How well the detection algorithms perform when applied to various features within the system trace data.

Discussion

The results clearly demonstrate that the Jensen-Shannon distance method outperforms the Wasserstein distance method in detecting drift. Additionally, the write size ratio proves to be a more effective feature for representing the variations in the data, offering a more nuanced understanding of the underlying changes.

Conclusion and Next Steps

In conclusion, this project establishes a pipeline that encompasses data labeling, data processing, and the benchmarking of drift detection algorithms. This just serves as the first step in detecting drift in system data.

There is significant potential for further improvement. Future work should focus on enhancing dataset construction by incorporating large language models (LLMs) and other advanced techniques to further clean and refine the datasets. Additionally, the evaluation of drift detection methods should be expanded beyond the current benchmarks, which only include two statistical methods. Incorporating additional statistical methods, as well as machine learning (ML) and deep learning (DL) approaches, could provide a more comprehensive analysis. Furthermore, exploring a broader range of evaluation metrics will ensure a more robust and accurate assessment of drift detection performance. These steps will help to advance the accuracy and reliability of drift detection in system trace data.

Deliverables

The following are the deliverables of this project:

Trovi Artifact
Github Repository: This repository contains the code for generating drift datasets with labels and notebooks with benchmarking results

Final Blogpost: Reproducibility in Data Visualization

Wed, 28 Aug 2024 00:00:00 +0000

Hello everyone!

I’m Triveni, a Master’s student in Computer Science at Northern Illinois University (NIU). I’m excited to share my progress on the OSRE 2024 project Categorize Differences in Reproduced Visualizations focusing on data visualization reproducibility. Working under the mentorship of David Koop, I’ve made some significant strides and faced some interesting challenges.

Reproducibility in data visualization

Reproducibility is crucial in data visualization, ensuring that two visualizations accurately convey the same data. This is essential for maintaining transparency and trust in data-driven decision-making. When comparing two visualizations, the challenge is not just spotting differences but determining which differences are meaningful. Tools like OpenCV are often used for image comparison, but they may detect all differences, including those that do not impact the data’s interpretation. For example, slight shifts in labels might be flagged as differences even if the underlying data remains unchanged, making it challenging to assess whether the visualizations genuinely differ in terms of the information they convey.

A Breakthrough with ChartDetective

Among various tools like ChartOCR and ChartReader, ChartDetective proved to be the most effective. This tool enabled me to extract data from a range of visualizations, including bar charts, line charts, box plots, and scatter plots. To enhance its capabilities, I modified the codebase to capture pixel values alongside the extracted data and store both in a CSV file. This enhancement allowed for a direct comparison of data values and their corresponding pixel coordinates between two visualizations, focusing on meaningful differences that truly impact data interpretation.

Example: Comparing Two Bar Plots with ChartDetective

Consider two bar plots that visually appear similar but have slight differences in their data values. Using ChartDetective, I extracted the data and pixel coordinates from both plots and stored this information in a CSV file. The tool then compared these values to identify any discrepancies.

For instance, in one bar plot, the height of a specific bars were slightly increased. By comparing the CSV files generated by ChartDetective, I was able to pinpoint these differences precisely. The final step involved highlighting these differences on one of the plots using OpenCV, making it clear where visualizations diverged.This approach ensures that only meaningful differences—those that reflect changes in the data—are considered when assessing reproducibility.

ChartDetective: SVG or PDF file of the visualization is uploaded to extract data.

- Data Extraction: Data values along with pixel details are stored in the CSV files.

- Highlighting the differences: Differences are highlighted on one of the plots using OpenCV

Understanding User Perspectives on Reproducibility

To complement the technical analysis, I created a pilot survey to understand how users perceive reproducibility in data visualizations. The survey evaluates user interpretations of two visualizations and explores which visual parameters impact their decision-making. This user-centered approach is crucial because even minor differences in visual representation can significantly affect how data is interpreted and used.

Pilot Survey Example:

Pixel Differences: In one scenario, the height of two bars was altered slightly, introducing a noticeable yet subtle change.

Label Swapping: In another scenario, the labels of two bars were swapped without changing their positions or heights.

Participants will be asked to evaluate the reproducibility of these visualizations, considering whether the differences impacted their interpretation of the data. The goal was to determine which visual parameters—such as bar height or label positioning—users find most critical when assessing the similarity of visualizations.

Future Work and Conclusion

Going forward, I plan to develop a proof of concept based on these findings and implement an extensive survey to further explore the impact of visual parameters on users’ perceptions of reproducibility. Understanding this will help refine tools and methods for comparing visualizations, ensuring they not only look similar but also accurately represent the same underlying data.

ORAssistant - LLM Assistant for OpenROAD

Tue, 27 Aug 2024 00:00:00 +0000

Introduction

Hello! I’m Palaniappan R, an undergraduate student at BITS Pilani, India. Over the past few months, I’ve been working as a GSoC contributor on the LLM Assistant for OpenROAD - Model Architecture and Prototype project, under the mentorship of Indira Iyer and Jack Luar.

The primary objective of my project is to improve the user experience within OpenROAD and OpenROAD-flow-scripts by utilizing Large Language Models(LLMs) to offer fast, relevant answers to FAQs and common issues. The ORAssistant chatbot aims to act as a first line of support, addressing basic queries in domains such as installation and command usage. Its goal is to resolve simple issues before they escalate to public forums, thereby reducing the number of support tickets on platforms like GitHub Issues.

Architecture Overview

Retrieval-augmented-generation (RAG) is a technique that improves the q&a capabilities and reliability of LLMs by incorporating factual information from external sources. When a user submits a query, the RAG process begins by fetching relevant information from a knowledge base. The retrieved content, combined with the original query is the provided to the LLM to generate a relevant, informed response.

The Knowledge Base

ORAssistant is designed to answer queries about all the major tools in the OR flow. The knowledge base primarily consists of official documentation from OpenROAD, OpenROAD-flow-scripts, and their respective manpages. Instead of scraping these primary sources from their websites, the docs are built to the desired markdown format directly from the respective GitHub repositories, using specific commit hashes for reproducibility. The knowledge base also includes documentation from other essential applications in the EDA flow, such as Yosys and OpenSTA. Additionally, it includes scraped and annotated conversational data from discussions on the OpenROAD and OpenROAD-flow-scripts GitHub pages.

The entire dataset building process has been automated, allowing for dynamic updates to accommodate any live changes.

The Tool-Based Architecture

After experimenting with multiple RAG approaches, a tool-based setup proved to be the most effective solution. Data from various domains are embedded into vector databases, and hybrid search retriever functions are applied to these vector stores. These functions are organized as individual tools that can be called by the chatbot. To maintain context, each query is rephrased while considering the chat history. This ensures a more precise and context-rich query. Please refer to my previous blog post for more information on the retrieval tools.

As depicted in the flowchart, a preliminary LLM call analyzes the input query, rephrases it based on the chat history and picks the appropriate tools for the rephrased query. Subsequently, documents are retrieved using the tool and sent to the LLM, which produces a relevant, context-aware response.

Using ORAssistant

ORAssistant is currently hosted at this link.

To set up out ORAssistant locally, find detailed instructions in the GitHub Repo. Both cloud based LLM providers (Gemini, VertexAI) and local options (Ollama) are supported.

Here’s an example of ORAssistant in action,

Future Plans

To further enhance the usability of ORAssistant, there are plans to add support for flow script generation. This will become possible after adding a dedicated script generation tool into the current tool-based workflow. Support for more tools in the EDA flow, such as KLayout will also be added in the near future.

Additionally, ORAssistant is planned to be integrated directly into OpenROAD’s CLI and GUI interfaces.

As I near the end of my GSoC, I’d like to thank the GSoC Organizing Committee, UC OSPO and The OpenROAD Project for this incredible opportunity. I’m immensely grateful to Indira Iyer and Jack Luar for their support and guidance throughout my GSoC journey. Thank You.

Final Blogpost: Drift Management Strategies Benchmark

Sat, 24 Aug 2024 00:00:00 +0000

Background

Hello there! I’m William and this is my final blog for my proposal “Developing A Comprehensive Pipeline to Benchmark Drift Management Approaches” under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy under the LAST project.

If you’re not familiar with it, this project aims to address the issue of model aging, where machine learning (ML) models experience a decline in effectiveness over time due to environmental changes, known as drift. My goal is to design an extensible pipeline that evaluates and benchmarks the robustness of state-of-the-art algorithms in addressing these drifts.

Deliverables

You can find my list of deliverables here:

Final report, this blog is a summarized version of my final report, so do take a look if you’d like to know more!
Github repository, contains code as well as the raw experiment results.
Trovi artifact

Evaluation

Here are some of the graphs that show the performance of every algorithm on the created datasets. For more graphs and figures, you can check out my final report:

CIRCLE: AUE demonstrates stability, maintaining a high accuracy even as the data drifts, which may be due to its ensemble nature. It is even more stable than baseline retraining algorithms. Matchmaker is also able to recover quickly upon experiencing drift, which maybe again due to its ranking the most high performing models to do inference, recovering faster than RetrainWin. On the other hand, DriftSurf experiences several random drops in accuracy, indicating that it can be somewhat unstable.
SINE: Similar to CIRCLE, AUE demonstrates stability throughout the dataset, maintaining a high accuracy even as the data drifts. Matchmaker however was struggling to adapt as fast when encountering such a sudden drift, as it needed some time/windows to recover from the drop. Driftsurf’s performance was notably better than baseline, as unlike them, it was able to recover successfully fairly quickly upon experiencing drift.
CovCon: In CovCon, Matchmaker was able to achieve the best accuracy, as it is able to select the models most relevant to each incoming batch (model trained on the most similar features), performing comparably to retrain window. Most of the other algorithms suffered in this dataset, particularly AUE whose performance is now becoming comparable to the rest of the algorithms and baseline.
IOAdmission: Performance on this dataset was led by AUE, which was able to maintain impressive stability amongst all of the algorithms used. This is followed closely by Matchmaker. The other algorithms used undergo a lot of fluctuations in accuracy.

Findings / Discussion

From the experiments conducted, the findings are as follows:

Matchmaker was able to perform particularly well in the CovCon dataset. This maybe due to its ability to choose the most relevant trained model from its ensemble during inference time. Its training time is also the best compared to other algorithms, especially considering that it keeps data for training an additional random forest model for ranking the models. However, its inference time was the longest amongst all other algorithms. This may be due to the fact that on inference time, one needs to traverse all of the leaf nodes of the random forest used to rank it (computing covariate shift).
AUE was able to perform particularly well in the CIRCLE and IOAdmission dataset. However, it is quite competitive on other datasets too. It’s weighting function which incentives highly relevant models and eviction of less relevant ones may be key. Its inference time is decent compared to other algorithms, being slower than most baselines and Driftsurf, but faster than Matchmaker. However, its training time took the longest amongst other competitors, as it has an expensive weighting function to weight, evict, or retrain models on every retraining.
DriftSurf was performing very similarly to the RetrainWindow baseline, in almost all datasets, except for IO Admission and SINE where it did better. This may be because of the fact that it maintains only at most 2 models every iteration, and as such, its performance was not competitive against the mult-models approach used in Matchmaker and AUE. On the plus side, its inference time is comparable to the baseline single model, having almost no inference overhead compared to most of the competitors out there. Another plausible explanation for the lack of performance is the lack of tuning, such as the number of windows retained, the length of its reactive period, and its reactivity sensitivity threshold. A better performance could be achieved if these parameters were tuned further.

Next Steps

These are some of the potential extensions for this project:

Optimize Matchmaker’s inference time improving Matchmaker’s efficiency, especially in covariate shift ranking, can reduce inference time. Simplifying the random forest traversal could make Matchmaker faster without impacting performance.
Extending the work to include other frameworks like TensorFlow or PyTorch, as it can now only support a scikit-learn base model.

Thank you for reading!

Hardware Hierarchical Dynamical Systems

Sat, 24 Aug 2024 00:00:00 +0000

Hi everyone! I am Ujjwal Shekhar, a Computer Science student at the International Institute of Information Technology - Hyderabad. I am excited to share my work on the project titled “Hardware Hierarchical Dynamical Systems” as part of the Open Source Research Experience (OSRE) program and Google Summer of Code. This project has been an incredible journey, and I’ve had the privilege of working with my mentors, Jose Renau and Sakshi Garg.

Project Overview and Goals

Abstract Syntax Trees (ASTs) are fundamental to modern compilers, serving as the backbone for parsing and transforming code. When compiling hardware code, the sheer volume of data can make compilation times a significant bottleneck. My project focuses on building a memory-optimized tree data structure specifically tailored for AST-typical queries.

The LiveHD repository, developed by the Micro Architecture Lab at UCSC, offers a compiler infrastructure optimized for hardware synthesis and simulation. The existing LHTree data structure provides a foundation, but there was significant potential for further optimization, which I explored throughout this project.

Key AST Queries

The core queries that the tree is optimized for include:

Finding the parent of a node.
Finding the first and last child of a node.
Locating the previous and next sibling of a node.
Adding a child to a node.
Inserting a sibling to a node.
Performing preorder, postorder, and sibling order traversal.
Removing a leaf or an entire subtree from the tree.

The primary goal was to create a tree class that excels at handling these queries efficiently, while still being robust enough to support less frequent operations. The new HHDS tree structure has demonstrated superior performance for specific tree configurations and continues to show potential across other types, particularly in memory consumption and cache efficiency, compared to the current LHTree.

The benchmarks were done using Google Bench to test the tree for scalability and performance. The new version of the tree is currently being integrated into the LiveHD core repository. Profiling to find bottlenecks in the tree was also done using Callgrind and KCachegrind.

Background and Motivation

Naive approach

A straightforward method for storing an n-ary tree is to maintain pointers from each node to its parent, children, and immediate siblings. While simple, this approach is memory-intensive and has poor cache efficiency due to the non-contiguous nature of nodes in memory. The variable memory usage per node, depending on the number of children, can also introduce significant overhead.

Enhancements to the Naive Approach

To reduce memory overhead, one optimization is to store only pointers to the first and last child within each node. This reduces memory usage to a constant per node. Additionally, since many AST-related queries focus on the tree’s structure rather than the data itself, we can separate the data from the structure. The tree would store only pointers to the data, allowing the tree structure to be optimized independently of the data storage.

While separating the data and the structure may seem like an obvious improvement, we will see that it can be extended to provide greater benefits.

Improving the cache efficiency

While reducing memory consumption is beneficial, the tree’s cache efficiency can still be suboptimal if the children of a node are scattered in memory. To enhance cache efficiency, storing children in contiguous memory locations is crucial. This improves spatial locality, which in turn boosts cache performance. Additionally, this approach eliminates the need to explicitly store data pointers in the tree, as the data resides at a contiguous memory index aligned with the bookkeeping.

By storing children contiguously, we can also eliminate the need for previous and next sibling pointers, as siblings are inherently adjacent in memory. Similarly, we can avoid storing the parent pointer for every child, since all children share the same parent.

Optimizations in LHTree (Old method)

The LHTree class in LiveHD was designed with these optimizations in mind. It groups siblings into chunks of four, storing the parent pointer only in the first sibling of each chunk. The last sibling in each chunk points to the next chunk, minimizing the number of pointers required and thus reducing memory overhead.

LHTree organizes the entire tree as a 2-dimensional array, where the first dimension represents the tree level and the second dimension represents the node index at that level. This structure improves cache efficiency by storing nodes contiguously in memory. Each tree position is a 48-bit ID, with the last 32 bits representing the node’s index and the first 16 bits indicating the tree level.

This explicit maintenance of level separately limits the tree’s scalability for deeper trees, due to the fixed number of bits allocated for the level.

Despite these optimizations, LHTree has some limitations, particularly in cache alignment and flexibility, which the HHDS tree aims to address.

Unfortunately, the number of bits required by each “chunk” happens to be slightly bigger than a single cache line (512 bits). This means that the cache efficiency of the tree is not optimal.

HHDS Tree : A New Approach

Eliminating Levels

The HHDS tree stores everything in a single vector, removing the need for explicit level information. This simplification not only improves cache efficiency but also eliminates restrictions on the number of nodes per level and the total number of levels.

Enhanced Cache Alignment

In the HHDS tree, each node has a 46-bit ID. Chunks in the HHDS tree contain up to eight children, with the first 43 bits of the absolute ID serving as the chunk ID and the last three bits indicating the node’s offset within the chunk.

For each chunk, which is exactly 64 bytes (or 512 bits) long—matching the size of a cache line—the following information is stored:

A 46-bit parent pointer (absolute ID).
A 43-bit first child long pointer (chunk ID).
A 43-bit last child long pointer (chunk ID).
43-bit previous and next sibling chunk pointers.
Seven 21-bit short delta pointers for the first child.
Seven 21-bit short delta pointers for the last child.

NOTE: The 0th chunk is an INVALID node, the real nodes start from the 1st chunk, with the node at an absolute ID of 8 (chunk ID of 1) being the root node.

Refer to the next section for more information on the short delta pointers.

The chunk is 512 bits long, which is 64 bytes, exactly the size of a cache line. Thus the amount of memory required in the worst case is 512 bits for a single node in the chunk, and in the best case is 46 bits for all 8 nodes in the chunk.

We utilized the __attribute__((packed, aligned(64))) attribute in C++ to ensure that each chunk aligns perfectly with a cache line. Bitfields were employed to pack the data efficiently within the chunk.

class __attribute__((packed, aligned(64))) Tree_pointers {
private:
 // We only store the exact ID of parent
 Tree_pos parent : CHUNK_BITS + CHUNK_SHIFT;
 Tree_pos next_sibling : CHUNK_BITS;
 Tree_pos prev_sibling : CHUNK_BITS;

 // Long child pointers
 Tree_pos first_child_l : CHUNK_BITS;
 Tree_pos last_child_l : CHUNK_BITS;

 // Short (delta) child pointers
 // You cannot make an array of bitfields inside a packed
 // struct, since the compiler will align each bitfield to the
 // size of the nearest power of two.
 Short_delta first_child_s_0 : SHORT_DELTA;
 Short_delta first_child_s_1 : SHORT_DELTA;
 Short_delta first_child_s_2 : SHORT_DELTA;
 Short_delta first_child_s_3 : SHORT_DELTA;
 Short_delta first_child_s_4 : SHORT_DELTA;
 Short_delta first_child_s_5 : SHORT_DELTA;
 Short_delta first_child_s_6 : SHORT_DELTA;

 Short_delta last_child_s_0 : SHORT_DELTA;
 Short_delta last_child_s_1 : SHORT_DELTA;
 Short_delta last_child_s_2 : SHORT_DELTA;
 Short_delta last_child_s_3 : SHORT_DELTA;
 Short_delta last_child_s_4 : SHORT_DELTA;
 Short_delta last_child_s_5 : SHORT_DELTA;
 Short_delta last_child_s_6 : SHORT_DELTA;
}

Build Append - Short Delta Heuristic

Empirical observations show that children are often added to a node shortly after the parent, meaning they are stored close to the parent in memory. This allows children to be stored as a delta from the parent, reducing the need for full chunk IDs.

When adding a child:

Attempt to store the child as a delta from the parent.
If not feasible, allocate a new chunk for the parent and store the pointer to the child chunk in the newly created parent chunk.

Implementing chunk breaking required careful handling to ensure that when a parent moves to a new chunk, its new chunk can still be referenced efficiently by its parent, potentially requiring recursive adjustments.

This is because the grandparent might not be able to store the parent as a delta from itself after the parent moves to a new chunk.

Compliance with the LiveHD core repository

Since the HHDS tree is an evolution of the LHTree, it was crucial to maintain compatibility with the LiveHD core repository. All necessary methods were implemented in the HHDS tree to ensure seamless integration. Naming conventions and syntax were kept consistent with the LHTree to facilitate a smooth transition.

Exposed methods in the HHDS tree are:

/**
 * Query based API (no updates)
 */
 Tree_pos get_parent (const Tree_pos& curr_index) const;
 Tree_pos get_last_child (const Tree_pos& parent_index) const;
 Tree_pos get_first_child (const Tree_pos& parent_index) const;
 bool is_last_child (const Tree_pos& self_index) const;
 bool is_first_child (const Tree_pos& self_index) const;
 Tree_pos get_sibling_next (const Tree_pos& sibling_id) const;
 Tree_pos get_sibling_prev (const Tree_pos& sibling_id) const;
 bool is_leaf (const Tree_pos& leaf_index) const;


/**
 * Update based API (Adds and Deletes from the tree)
 */
 // FREQUENT UPDATES
 Tree_pos append_sibling(const Tree_pos& sibling_id, const X& data);
 Tree_pos add_child(const Tree_pos& parent_index, const X& data);
 Tree_pos add_root(const X& data);

 void delete_leaf(const Tree_pos& leaf_index);
 void delete_subtree(const Tree_pos& subtree_root);

 // INFREQUENT UPDATES
 Tree_pos insert_next_sibling(const Tree_pos& sibling_id,
 const X& data);

Benchmarking Results

Preliminary benchmarks indicate that the HHDS tree outperforms the LHTree in both runtime efficiency (for certain cases, more on this in a later section) and memory consumption. The HHDS tree demonstrates enhanced performance across various tests, offering a more optimized solution for handling Abstract Syntax Tree (AST) operations.

I constructed identical trees using both the LHTree and HHDS tree structures and executed a series of queries on each. The benchmarks were performed using Google Benchmark to ensure accurate and consistent results. Below, I detail the specific tests conducted.

Benchmark Tests Overview

Deep Tree Test
This test simulates a line graph by repeatedly adding a child to the last node in the tree. It is designed to assess the tree’s performance when handling deep structures, where each node has a single child.
Wide Tree Test
In this scenario, a single root node is created, followed by the addition of numerous child nodes directly under the root. This test evaluates the tree’s efficiency in managing wide structures with many immediate children.
Chip-Typical Tree Test
This test models a tree commonly seen in hardware design. For each node, a random number of children (ranging from 1 to 7) are added, and the process is recursively applied to the leaf nodes up to a certain depth. This test measures the tree’s performance in realistic, varied conditions.
Chip-Typical (Long) Tree Test
Similar to the Chip-Typical Tree Test, but with a broader range of children per node (1 to 20). This test is particularly useful for examining performance when the tree is more complex and chunk splitting is more likely.

These tests provide a comprehensive analysis of the HHDS tree’s capabilities, highlighting its superiority over the LHTree for deeper trees.

Add/Append Benchmarks

Deep Tree Test

test_deep_tree_100_hhds indicates the time taken to run a benchmark on a deep tree of 100 nodes using the HHDS tree structure. This nomenclature is consistent across all tests.

Disabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_deep_tree_10_hhds 11704 ns
test_deep_tree_10_lh 19541 ns
test_deep_tree_100_hhds 85317 ns
test_deep_tree_100_lh 163058 ns
test_deep_tree_1000_hhds 760260 ns
test_deep_tree_1000_lh 1442391 ns
test_deep_tree_10000_hhds 9889199 ns
test_deep_tree_10000_lh 16215232 ns
test_deep_tree_100000_hhds 84650074 ns
test_deep_tree_100000_lh 163255882 ns
test_deep_tree_1000000_hhds 877646208 ns
test_deep_tree_1000000_lh 1659725904 ns
test_deep_tree_10000000_hhds 9256118059 ns
test_deep_tree_10000000_lh 1.4431e+10 ns

Enabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_deep_tree_10_hhds 1443 ns
test_deep_tree_10_lh 1462 ns
test_deep_tree_100_hhds 7398 ns
test_deep_tree_100_lh 17455 ns
test_deep_tree_1000_hhds 79544 ns
test_deep_tree_1000_lh 165656 ns
test_deep_tree_10000_hhds 1337406 ns
test_deep_tree_10000_lh 1494153 ns
test_deep_tree_100000_hhds 12288324 ns
test_deep_tree_100000_lh 14897463 ns
test_deep_tree_1000000_hhds 116810846 ns
test_deep_tree_1000000_lh 188815892 ns
test_deep_tree_10000000_hhds 2338596582 ns
test_deep_tree_10000000_lh 2238844395 ns

Here, the HHDS tree structure consistently outperforms the LHTree in the Deep Tree Test, showcasing its efficiency in handling deep tree structures.

Wide Tree Test

Disabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_wide_tree_10_hhds 6581 ns
test_wide_tree_10_lh 6235 ns
test_wide_tree_100_hhds 34911 ns
test_wide_tree_100_lh 35734 ns
test_wide_tree_1000_hhds 323228 ns
test_wide_tree_1000_lh 312755 ns
test_wide_tree_10000_hhds 3547963 ns
test_wide_tree_10000_lh 2975894 ns
test_wide_tree_100000_hhds 33800125 ns
test_wide_tree_100000_lh 32538424 ns
test_wide_tree_1000000_hhds 332509041 ns
test_wide_tree_1000000_lh 336261868 ns
test_wide_tree_10000000_hhds 3527352810 ns
test_wide_tree_10000000_lh 8774024963 ns

Enabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_wide_tree_10_hhds 837 ns
test_wide_tree_10_lh 512 ns
test_wide_tree_100_hhds 3394 ns
test_wide_tree_100_lh 2675 ns
test_wide_tree_1000_hhds 26019 ns
test_wide_tree_1000_lh 20141 ns
test_wide_tree_10000_hhds 319068 ns
test_wide_tree_10000_lh 245964 ns
test_wide_tree_100000_hhds 3369183 ns
test_wide_tree_100000_lh 2910862 ns
test_wide_tree_1000000_hhds 39243340 ns
test_wide_tree_1000000_lh 26777306 ns
test_wide_tree_10000000_hhds 454508781 ns
test_wide_tree_10000000_lh 331688046 ns

Here without compiler optimizations, the HHDS tree structure typically outperforms the LHTree in the Wide Tree Test for large tree sizes. For smaller tree sizes, the LHTree showed a slightly better performance. However, using compiler optimizations, the LHTree starts to perform better than HHDS.

The reason for the HHDS tree’s superior performance can be attributed to the chunk size being large, which allows for better cache utilization and reduced memory overhead. However, the LH Tree has been put through more tuning and has been in use for a longer time, which could explain its better performance with compiler optimizations. In the future, the HHDS tree could be optimized further to match or exceed the LH Tree’s performance.

Chip Typical Tree Test

Disabled compiler optimizations

--------------------------------------------------------
Benchmark Time
--------------------------------------------------------
test_chip_typical_tree_1_hhds 7109 ns
test_chip_typical_tree_1_lh 6803 ns
test_chip_typical_tree_2_hhds 22728 ns
test_chip_typical_tree_2_lh 22064 ns
test_chip_typical_tree_3_hhds 75398 ns
test_chip_typical_tree_3_lh 70910 ns
test_chip_typical_tree_4_hhds 270062 ns
test_chip_typical_tree_4_lh 254423 ns
test_chip_typical_tree_5_hhds 1110254 ns
test_chip_typical_tree_5_lh 1074439 ns
test_chip_typical_tree_6_hhds 5024264 ns
test_chip_typical_tree_6_lh 3900709 ns
test_chip_typical_tree_7_hhds/iterations:5 13290739 ns
test_chip_typical_tree_7_lh/iterations:5 22145462 ns
test_chip_typical_tree_8_hhds/iterations:5 83438683 ns
test_chip_typical_tree_8_lh/iterations:5 105475664 ns

Enabled compiler optimizations

--------------------------------------------------------
Benchmark Time
--------------------------------------------------------
test_chip_typical_tree_1_hhds 938 ns
test_chip_typical_tree_1_lh 387 ns
test_chip_typical_tree_2_hhds 1877 ns
test_chip_typical_tree_2_lh 1351 ns
test_chip_typical_tree_3_hhds 7095 ns
test_chip_typical_tree_3_lh 5052 ns
test_chip_typical_tree_4_hhds 35019 ns
test_chip_typical_tree_4_lh 21569 ns
test_chip_typical_tree_5_hhds 130915 ns
test_chip_typical_tree_5_lh 78010 ns
test_chip_typical_tree_6_hhds 522385 ns
test_chip_typical_tree_6_lh 278223 ns
test_chip_typical_tree_7_hhds/iterations:5 4015636 ns
test_chip_typical_tree_7_lh/iterations:5 1648426 ns
test_chip_typical_tree_8_hhds/iterations:5 9873724 ns
test_chip_typical_tree_8_lh/iterations:5 4607773 ns

For the Chip Typical test, the HHDS tree’s performance is better for larger tree sizes, while the LHTree performs better for smaller tree sizes. However, with compiler optimizations, the LH Tree performs better than the HHDS tree.

Chip Typical (long) Tree test

Disabled compiler optimizations

-------------------------------------------------------------
Benchmark Time
-------------------------------------------------------------
test_chip_typical_long_tree_1_hhds 8875 ns
test_chip_typical_long_tree_1_lh 8479 ns
test_chip_typical_long_tree_2_hhds 62490 ns
test_chip_typical_long_tree_2_lh 64620 ns
test_chip_typical_long_tree_3_hhds 625064 ns
test_chip_typical_long_tree_3_lh 654787 ns
test_chip_typical_long_tree_4_hhds 6128047 ns
test_chip_typical_long_tree_4_lh 6528778 ns
test_chip_typical_long_tree_5_hhds 71345448 ns
test_chip_typical_long_tree_5_lh 77170587 ns
test_chip_typical_long_tree_6_hhds/iterations:5 656595039 ns
test_chip_typical_long_tree_6_lh/iterations:5 860193491 ns

Enabled compiler optimizations

-------------------------------------------------------------
Benchmark Time
-------------------------------------------------------------
test_chip_typical_long_tree_1_hhds 1139 ns
test_chip_typical_long_tree_1_lh 692 ns
test_chip_typical_long_tree_2_hhds 8666 ns
test_chip_typical_long_tree_2_lh 5238 ns
test_chip_typical_long_tree_3_hhds 90856 ns
test_chip_typical_long_tree_3_lh 48758 ns
test_chip_typical_long_tree_4_hhds 1034346 ns
test_chip_typical_long_tree_4_lh 472964 ns
test_chip_typical_long_tree_5_hhds 13040238 ns
test_chip_typical_long_tree_5_lh 5025192 ns
test_chip_typical_long_tree_6_hhds/iterations:3 131143411 ns
test_chip_typical_long_tree_6_lh/iterations:3 68739573 ns

Similar to the previous case, the HHDS tree performs better in debug mode (without compiler optimizations). However, the LH Tree performs better with compiler optimizations.

We see that the HHDS tree has shown overall better performance without compiler optimizations, however, with compiler optimizations, the LH Tree has shown better performance. HHDS Tree has shown better performance regardless, for the Deep Tree test. This indicates an inherent trade-off between the choice of both trees. To further investigate this behaviour I conducted some profiling, which is in a later section.

Iterators Benchmarks

Deep Tree test

Disabled compiler optimizations

--------------------------------------------------------
Benchmark Time
-------------------------------------------------------
test_deep_tree_10_hhds 884 ns
test_deep_tree_10_lh 1356 ns
test_deep_tree_100_hhds 7987 ns
test_deep_tree_100_lh 11191 ns
test_deep_tree_1000_hhds 86991 ns
test_deep_tree_1000_lh 105809 ns
test_deep_tree_10000_hhds 894127 ns
test_deep_tree_10000_lh 1076983 ns
test_deep_tree_100000_hhds 7927102 ns
test_deep_tree_100000_lh 11177187 ns
test_deep_tree_1000000_hhds/iterations:4 80470145 ns
test_deep_tree_1000000_lh/iterations:4 145763040 ns
test_deep_tree_10000000_hhds/iterations:3 1055529435 ns
test_deep_tree_10000000_lh/iterations:3 995416880 ns

Enabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_deep_tree_10_hhds 202 ns
test_deep_tree_10_lh 93.1 ns
test_deep_tree_100_hhds 1595 ns
test_deep_tree_100_lh 1039 ns
test_deep_tree_1000_hhds 15663 ns
test_deep_tree_1000_lh 11000 ns
test_deep_tree_10000_hhds 164778 ns
test_deep_tree_10000_lh 107293 ns
test_deep_tree_100000_hhds 1615928 ns
test_deep_tree_100000_lh 1260507 ns
test_deep_tree_1000000_hhds 19582402 ns
test_deep_tree_1000000_lh 15954697 ns
test_deep_tree_10000000_hhds 214887559 ns
test_deep_tree_10000000_lh 179118729 ns

Wide Tree test

Disabled compiler optimizations

-------------------------------------------------------
Benchmark Time
-------------------------------------------------------
test_wide_tree_10_hhds 7171 ns
test_wide_tree_10_lh 7098 ns
test_wide_tree_100_hhds 6204 ns
test_wide_tree_100_lh 10372 ns
test_wide_tree_1000_hhds 62762 ns
test_wide_tree_1000_lh 106132 ns
test_wide_tree_10000_hhds 622999 ns
test_wide_tree_10000_lh 1124283 ns
test_wide_tree_100000_hhds 6118490 ns
test_wide_tree_100000_lh 9550170 ns
test_wide_tree_1000000_hhds/iterations:10 59438777 ns
test_wide_tree_1000000_lh/iterations:10 97842431 ns
test_wide_tree_10000000_hhds/iterations:7 778347697 ns
test_wide_tree_10000000_lh/iterations:7 1163215808 ns

Enabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_wide_tree_10_hhds 2103 ns
test_wide_tree_10_lh 1284 ns
test_wide_tree_100_hhds 1563 ns
test_wide_tree_100_lh 632 ns
test_wide_tree_1000_hhds 15627 ns
test_wide_tree_1000_lh 6410 ns
test_wide_tree_10000_hhds 149588 ns
test_wide_tree_10000_lh 56030 ns
test_wide_tree_100000_hhds 1511278 ns
test_wide_tree_100000_lh 563926 ns
test_wide_tree_1000000_hhds 17056051 ns
test_wide_tree_1000000_lh 7754815 ns
test_wide_tree_10000000_hhds 143994848 ns
test_wide_tree_10000000_lh 55040231 ns

Chip typical test

Disabled compiler optimizations

--------------------------------------------------------
Benchmark Time
--------------------------------------------------------
test_chip_typical_tree_1_hhds 344 ns
test_chip_typical_tree_1_lh 892 ns
test_chip_typical_tree_2_hhds 2192 ns
test_chip_typical_tree_2_lh 1691 ns
test_chip_typical_tree_3_hhds 13628 ns
test_chip_typical_tree_3_lh 14235 ns
test_chip_typical_tree_4_hhds 34049 ns
test_chip_typical_tree_4_lh 84096 ns
test_chip_typical_tree_5_hhds 206482 ns
test_chip_typical_tree_5_lh 203680 ns
test_chip_typical_tree_6_hhds 848996 ns
test_chip_typical_tree_6_lh 708212 ns
test_chip_typical_tree_7_hhds/iterations:5 3645372 ns
test_chip_typical_tree_7_lh/iterations:5 6657982 ns
test_chip_typical_tree_8_hhds/iterations:5 7375050 ns
test_chip_typical_tree_8_lh/iterations:5 4577351 ns

Enabled compiler optimizations

-------------------------------------------
Benchmark Time
-------------------------------------------
test_chip_typical_tree_1_hhds 93.1 ns
test_chip_typical_tree_1_lh 50.1 ns
test_chip_typical_tree_2_hhds 149 ns
test_chip_typical_tree_2_lh 212 ns
test_chip_typical_tree_3_hhds 1166 ns
test_chip_typical_tree_3_lh 554 ns
test_chip_typical_tree_4_hhds 7385 ns
test_chip_typical_tree_4_lh 3138 ns
test_chip_typical_tree_5_hhds 54477 ns
test_chip_typical_tree_5_lh 10643 ns
test_chip_typical_tree_6_hhds 215050 ns
test_chip_typical_tree_6_lh 53043 ns
test_chip_typical_tree_7_hhds 492555 ns
test_chip_typical_tree_7_lh 577120 ns
test_chip_typical_tree_8_hhds 2630675 ns
test_chip_typical_tree_8_lh 1278702 ns

Chip typical (long) test

Disabled compiler optimizations

------------------------------------------------
Benchmark Time
------------------------------------------------
test_chip_typical_long_tree_1_hhds 911 ns
test_chip_typical_long_tree_1_lh 1435 ns
test_chip_typical_long_tree_2_hhds 8161 ns
test_chip_typical_long_tree_2_lh 8619 ns
test_chip_typical_long_tree_3_hhds 76618 ns
test_chip_typical_long_tree_3_lh 132467 ns
test_chip_typical_long_tree_4_hhds 1644808 ns
test_chip_typical_long_tree_4_lh 1962406 ns
test_chip_typical_long_tree_5_hhds 7199648 ns
test_chip_typical_long_tree_5_lh 9195894 ns
test_chip_typical_long_tree_6_hhds 169002499 ns
test_chip_typical_long_tree_6_lh 207296570 ns

Enabled compiler optimizations

------------------------------------------------
Benchmark Time
------------------------------------------------
test_chip_typical_long_tree_1_hhds 223 ns
test_chip_typical_long_tree_1_lh 101 ns
test_chip_typical_long_tree_2_hhds 2270 ns
test_chip_typical_long_tree_2_lh 719 ns
test_chip_typical_long_tree_3_hhds 38291 ns
test_chip_typical_long_tree_3_lh 12547 ns
test_chip_typical_long_tree_4_hhds 294222 ns
test_chip_typical_long_tree_4_lh 187010 ns
test_chip_typical_long_tree_5_hhds 4721230 ns
test_chip_typical_long_tree_5_lh 835256 ns
test_chip_typical_long_tree_6_hhds 30302468 ns
test_chip_typical_long_tree_6_lh 10057136 ns

Overall, both add/append and iterators related benchmarks show an improvement in performance. Without compiler optimizations, HHDS tree performs better than the LH Tree. With compiler optimizations, there are similar differences in the traversal benchmarks. We will now look at some profiling that was done to identify the bottlenecks in the HHDS tree.

Exceptions, and a reminder of why they are slow.

When looking at the performance difference between the HHDS tree and LH tree (after enabling compiler optimizations), I was shocked to see that the HHDS tree was performing worse than the LH tree by multiple orders of magnitude upon using exceptions. This was a surprise to me, as I had not expected exceptions to have such a large impact on performance.

The reason this happens is because exceptions are slow. When an exception is thrown, the stack is unwound, and the program has to jump to the catch block. This is a slow process, and should be avoided in performance-critical code. Moreover, the compiler cannot optimize code with exceptions as well as it can without them. This is why the HHDS tree performs so much worse than the LH tree when exceptions are enabled. But the HHDS tree still wasn’t performing as well as it should have been.

Profiling

I used callgrind to profile the HHDS tree and identify potential bottlenecks. The profiling results provided valuable insights into the tree’s performance and areas for optimization. I generated a call graph using KCachegrind and analyzed the function calls to determine the most time-consuming operations.

The call graph clearly shows that the bottleneck is the _create_space call that is tasked with creating space for a new node. This function is called when a new node is added to the tree, and its performance directly impacts the tree’s efficiency.

inline Tree_pos _create_space(const X& data) {
 // Make space for CHUNK_SIZE number of entries at the end
 data_stack.emplace_back(data);
 for (int i = 0; i < CHUNK_MASK; i++) {
 data_stack.emplace_back();
 }

 // Add the single pointer node for all CHUNK_SIZE entries
 pointers_stack.emplace_back();

 return pointers_stack.size() - 1;
}

However, the _create_space function is relatively simple and should not be causing such a significant performance hit. This indicates that the issue may lie in the memory allocation process or the data structure itself. One possible way of dealing with this would be to increase chunk sizes, or enable dynamic chunk sizing, which would allow for more efficient memory allocation.

Another possible bottleneck, seems to be any amount of computation that will be done to find the next vacant space in the chunk (like in get_last_child()). This is because the chunk is a fixed size, and if the chunk is full, the program will have to search for the next chunk that has space. This is a linear operation, and can be slow for wide trees. To fix this, I tried to add extra bookkeeping in the Tree_pointers node structure:

class __attribute__((packed, aligned(64))) Tree_pointers {
private:
 // We only store the exact ID of parent
 Tree_pos parent : CHUNK_BITS + CHUNK_SHIFT;
 Tree_pos next_sibling : CHUNK_BITS;
 Tree_pos prev_sibling : CHUNK_BITS;

 // Long child pointers
 Tree_pos first_child_l : CHUNK_BITS;
 Tree_pos last_child_l : CHUNK_BITS;

 // Storing the last occupied index in the short delta
 // This is to avoid iterating over all short deltas
 // to find the last occupied index
 unsigned short last_occupied : CHUNK_SHIFT;

 // Short (delta) child pointers
 Short_delta first_child_s_0 : SHORT_DELTA;
 Short_delta first_child_s_1 : SHORT_DELTA;
 ...

However, the improvement in performance was marginal after making this change. This indicates that the issue may be more complex and require further investigation. This tree has also been added to the repository, in case a future contributor might be able to make use of it.

There are other possible bottlenecks that might be coming from storing separate short deltas instead of reducing the size of the delta and packing it into a single large integer type. I will be implementing this idea in the future.

Code contributions

All of my Pull requests and code changes here made on the HHDS repository. Each contribution has undergone thorough review and been successfully merged into the main repository:

Additionally, we are planning to integrate these changes into the LiveHD repository in the near future.

Conclusion and Future Work

Working on this project has been a valuable learning experience, particularly in applying core C++ features. I discovered that simple, fundamentally sound optimizations often outperform more complex ones. The greatest challenge for me was to steer through the changes in our original Plan of Action, however, due to the support and guidance from my mentors I was able to make it.

There are still areas where the HHDS tree can be improved to make it more robust. One area of future exploration is dynamic chunk sizing:

Dynamic Chunk Sizing: Instead of using fixed 8-sized chunks as we did, we could implement multiple chunk sizes. This would allow users to “hint” the HHDS tree to use specific chunk types, potentially reducing memory consumption further.

Overall, the HHDS tree has shown promise in handling deep tree structures efficiently. With further optimization and enhancements, it can become a powerful tool for handling complex tree operations.

Acknowledgements

I would like to thank my mentors, Jose Renau and Sakshi Garg for their guidance and support throughout the project. It would not have been possible without their help. Their insights and mentorship have significantly contributed to my learning and the success of this work.

Reproducing and addressing Data Leakage issue : Duplicates in dataset

Fri, 23 Aug 2024 00:00:00 +0000

Hello!

In this blog post, I will explore a common issue in machine learning called data leakage, using an example from the paper:

Benedetti, P., Perri, D., Simonetti, M., Gervasi, O., Reali, G., Femminella, M. (2020). Skin Cancer Classification Using Inception Network and Transfer Learning. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2020. ICCSA 2020. Lecture Notes in Computer Science(), vol 12249. Springer, Cham. https://doi.org/10.1007/978-3-030-58799-4_39 arXiv

Overview of the Paper

In this paper, the authors use transfer learning on a pretrained convolutional neural network (CNN) to classify skin lesions in dermatoscopic images from the HAM10000 (Human Against Machine with 10,000 training images) dataset. The paper reports a final accuracy of 78.9% on the validation set.

While this reported result appears to be impressive, there are concerns regarding the validity of this performance metric due to data leakage. Data leakage occurs when the model is trained or evaluated on data that it would not have access to during real-world deployment, leading to an overestimation of the model’s true performance.

Identifying Data Leakage in the Original Paper

Upon closer inspection, it appears that the original experiment suffers from data leakage in two significant ways:

Duplicate Images in Training and Validation Sets:

The HAM10000 dataset contains near-duplicate images of the same lesions in both the training and validation sets. This results in the model seeing very similar images during training and then again during validation. Consequently, the model’s performance is artificially inflated because it has already been “trained” on images similar to those in the validation set, making the task easier than it should be.
Using the Validation Set for Early Stopping and Final Evaluation:

Another critical issue is the use of the validation set for both early stopping and final model evaluation. Early stopping is a technique where training is halted when the model’s performance on a validation set no longer improves, preventing overfitting. However, if this same validation set is later used to evaluate the model’s final performance, it can lead to overfitting on the validation data itself, resulting in an overly optimistic estimate of model accuracy.

Our Reproduction and Results

To demonstrate the impact of these data leakage issues, we reproduced the experiment with corrected methodologies:

Corrected Data Split: We ensured that there were no duplicate images between the training and validation sets. This setup is crucial to simulate a realistic scenario where the model encounters completely unseen data during validation.
Separate Validation and Test Sets: We introduced a distinct test set to evaluate the final model performance, independent of the data used for early stopping.

Results Comparison

	Original results	Our results
Accuracy	78.9%	78.6%
Number of epochs	Approx. 42 epochs	40 epochs
Training size	Unknown	7000 samples
Validation size	478 samples	478 samples
Confusion martix

Analysis of the Results

While our reproduced accuracy of 78.6% is close to the original reported accuracy, it is based on a properly separated training and validation set, avoiding the data leakage pitfalls of the original paper. The slight drop in accuracy further highlights the overestimation of the original model’s performance due to data leakage.

Moreover, using a separate test set for final evaluation provides a more reliable measure of the model’s ability to generalize to new, unseen data. The confusion matrices show that our model’s performance is consistent across different lesion classes, confirming the robustness of the evaluation.

Conclusion

Data leakage is a common and often overlooked problem in applied machine learning, leading to misleading performance metrics and irreproducible results. By carefully examining and correcting these issues in our reproduction, we hope to provide a clearer understanding of the importance of proper data handling and validation practices.

It is crucial for researchers and practitioners to be vigilant about data leakage and ensure that their models are trained, validated, and tested under realistic conditions. This not only ensures the credibility of their results but also enhances the real-world applicability of their models.

Thank you for reading, and stay tuned for more insights on machine learning reproducibility!

Final blog: Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Thu, 22 Aug 2024 00:00:00 +0000

Introduction

Hello everyone,

I’m Archit from India, an undergraduate student at the Indian Institute of Technology, Banaras Hindu University (IIT BHU), Varanasi. As part of the Automatic Reproducibility of COMPSs Experiments through the Integration of RO-Crate in Chameleon project, my proposal, under the mentorship of Raül Sirvent, aims to develop a service that facilitates the automated replication of COMPSs experiments within the Chameleon infrastructure.

About the Project

The project proposes to create a service that can take a COMPSs crate (an artifact adhering to the RO-Crate specification) and, through analysis of the provided metadata, construct a Chameleon-compatible image for replicating the experiment on the testbed.

Final Product

The basic workflow of the COMPSs Reproducibility Service can be explained as follows:

The service takes the workflow path or link as the first argument from the user.
The program shifts the execution to a separate sub-directory, reproducibility_service_{timestamp}, to store the results from the reproducibility process.
Two main flags are required:
- Provenance flag: If you want to generate the provenance of the workflow via the runcompss runtime.
- New Dataset flag: If you want to reproduce the experiment with a new dataset instead of the one originally used.
If there are any remote datasets, they are fetched into the sub-directory.
The main work begins with parsing the metadata from ro-crate-metadata.json and verifying the files present inside the dataset, as well as any files downloaded as remote datasets. This step generates a status table for the user to check if any files are missing or have modified sizes.

The final step is to transform the compss-command-line.txt and all the paths specified inside it to match the local environment where the experiment will be reproduced. This includes:
- Mapping the paths from the old machine to new paths inside the RO-Crate.
- Changing the runtime to runcompss or enqueue_compss, depending on whether the environment is a SLURM cluster.
- Detecting if the paths specified in the command line are for results, and redirecting them to new results inside the reproducibility_service_{timestamp}\Results directory.
After this, the service prompts the user to add any additional flags to the final command. Upon final verification, the command is executed via Python’s subprocess pipe.

Logging System: All logs related to the Reproducibility Service are stored inside the reproducibility_service_{timestamp}\log.

You can view the basic pseudocode of the service.

Conclusion and Future Work

It’s been a long journey since I started this project, and now it’s finally coming to an end. I have learned a lot from this experience, from weekly meetings with my mentor to working towards long-term goals—it has all been thrilling. I would like to thank the OSRE community and my mentor for providing me with this learning opportunity.

This is only version 1.0.0 of the Reproducibility Service. If I have time from my coursework, I would like to fix any bugs or improve the service further to meet user needs.

However, the following issues still exist with the service and can be improved upon:

Third-party software dependencies: Automatic detection and loading of these dependencies on a SLURM cluster are not yet implemented. Currently, these must be handled manually by the user.
Support for workflows with data_persistence = False: There is no support for workflows where all datasets are remote files.

Deliverables

Reproducibility Service Repository: This repository contains the main service along with guidelines on how to use it. The service will be integrated with the COMPSs official distribution in its next release.
Chameleon Appliance : This is a single-node appliance with COMPSs 3.3.1 installed, so that anyone with access to Chameleon can reproduce experiments.

Previous Blogs

Make sure to check out my other blogs to see how I started this project and the challenges I faced along the way:

Thank you for reading the blog, have a nice day!!

Final Blogpost: HDEval's LLM Benchmarking for HDL Design

Wed, 21 Aug 2024 00:00:00 +0000

Introduction

Hello everyone! I’m Ashwin Bardhwaj, an undergraduate student studying at UC Berkeley. As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau and Sakshi Garg looks to create a suite of benchmark programs for HDEval.

The goal of this project is to create large-scale Verilog programs in order to benchmark that capability of LLMs to develop HDL code. Throughout this project, I have created 3 of the large Verilog testbenches called 3-Stage-RISC_V processor, Gameboy Emulator, and Sorts. The benchmark programs will lose their effectriveness if LLMs such as ChatGPT scrape over Github reposotires and learn from them. As a result, the code itself cannot be made public due to LLM scraping over repositories, this file will cover the test report for all 3 of these projects.

3 Stage RISC V Processor

This is a pipelined RISC processor developed to to handle RV32I instructions. A 3-Stage processsor will typically contain a Fetch, Decode, and Execute cycle. As a result, every instruction will take exactly 3 clock cycles. For this processor, instructions can be formatted into R, I (Load), S (Store), B (Cond), and J (Jump and Link) type instructions. Once a 32 bit instruction is fetched at the location in memory specifed by the pc (Program Counter) register, it is sent to be decoded by the “decode unit”. Through decoding an instruction, we can determine the exact operation code, register location of the 2 operands (rs1 and rs2), and the destination register (rd) at which to write the calculated result. After decoding, an activation flag is sent to the excetution cycle to then take and access the register file at address rs1 and rs2 in order to get the correct operand data. The data and operation is then sent to the ALU to compute the result based on the opcode. The result is then written back into the register file at the rd address and the program counter is incremented and the next instruction is fetched.

The prompts for each module in this processor have been generated and tested against a GPT 3 turbo and GPT 4o models as an example. In the RISC V tab in my test report, I have provided the exact prompts and results after running on MASC’s HDLAgent tool which can access the APIs of many LLMs.

Gameboy Emulator

The Gameboy Emulator is a Verilog implementation of the classic GameBoy console that was widely popular in the 1990s. The main aspects of the GameBoy that were focused on in this project were the Z-80 like CPU, memory objects like RAM, VRAM, and ROM, the PPU (Picture Processing Unit), and other peripherals. The instructions are given to the CISC (variable-length instructions) CPU where they are decoded and executed based on the details and expectations of that specific instruction. In some cases, timing becomes a concern and there is significant effort made to ensure that instructions can be parsed and run predictably and effictively. Instructions from the ROM may take between 1 to 4 clock cycles to run depending on the requirements. For example, the instruction “LD B, HL” , loads the data found at the 16 bit address given by registers H and L into register B is a 2 cycle instruction. The first cycle decodes the HL address and fetches the data at the accurate location, while the second cycle takes the new input data at writes it into register B. This requires accurate timing control between different asects of the GameBoy.

The Picture Processing Unit is also an integral feature of the gameboy. Three frames called Background, Window, and Sprite are combined into the classic Gameboy screens we know today. White the Background and Window data are consistently called from the VRAM after certain clock cycle times, the Sprite and sprtite attributes are accessed using DMA (Direct Memory Access) from OAM (Object Attribute Memory). This reduces the CPU load and improves the speed of sprite data.

Deliverables

HDEval Test Report: The HDEval Test Report contains the module prompts for each testbench, the results after testing on GPT 3 turbo and 4o, and test cases to ensure code correctness and reliability.
HDEval Repo: HDEval contains the encrypted version of the yaml files that encapsulate the code, prompts, and additional data.

Next Steps

Given these benchmarks, it is important to track the abilities of these LLMs to generate HDL code. Therefore, including GPT 3-turbo and 4o. I would like these benchmarks to be applied to more models so that we can track their growth and keep informed on their effectiveness in HDL and hardware.

Previous Blogs

Please feel free to check out my previous blogs!

Thank you for reading!

Deriving Realistic Performance Benchmarks for Python Interpreters

Sat, 17 Aug 2024 00:00:00 +0000

Hi, I am Mrigank. I am one of the Summer of Reproducibility fellows for 2024, and I will be working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah.

Background and Motivation

Recent work by Meta on a statically typed variant of Python – Static Python – which has provided immense promise in moving towards gradually typed languages without compromising on performance due to lack of complete soundness. Lu et al.¹ provide an evaluation of Static Python and conclude that the enhancement in performance reported by Meta on their web servers for Instagram is reasonable and is not just the result of refactoring. In fact, the study notes that very little refactoring is typically required for converting existing Python programs to Static Python. However, this study depends on a limited model of the language and does not represent real-world software applications.

In our project, we aim to create a realistic performance benchmark to reproduce performance improvements reported by Meta and to evaluate the performance of Static Python in real-world software applications. In addition, we will analyze partially-typed code to understand the performance implications of gradual typing in Python.

Key Objectives

We will use widely-used open-sourced applications to derive realistic performance benchmarks for evaluating Static Python. In particular, we will focus on projects that utilize the Python framework Django, which is also known to power the backend of Instagram. We plan to begin with Wagtail, a popular CMS built on Django. We have also identified other potential projects like Zulip, Plane and LibrePhotos. These are all actively maintained projects with significantly large codebases.

Further, we will analyze the performance of partially-typed code. This will be of value to the Python community as it will provide confidence in gradually moving towards Static Python for improving performance. We will make our benchmarks publicly available for the community to use, reproduce, and extend.

Methodology

Load Testing

For each project that we derive benchmarks from, we will design user pipelines that simulate real-world usage and implement them to create load tests using the open-sourced Locust framework. This will allow us to evaluate the performance of Static Python in real-world loads and scenarios. Locust can spawn thousands of users, each of which independently bombards the system with HTTP requests for a range of tasks that are defined in their user pipeline. We will host each project on a server (local or cloud) to run these load tests.

We will profile each project to ensure that our tests cover different parts of the codebase and to identify performance bottlenecks. We can then focus on these bottlenecks while gradually typing the codebase.

Gradual Typing

For typing the code in these projects, we will create two versions of each project: one with the so-called “shallow” type annotations and another with “advanced” type annotations. The former is relatively easier to implement and we can use tools like MonkeyType to generate stubs that can be quickly verified manually. The latter is quite non-trivial and will require manual effort. We will then mix-and-match the three versions of each project to create different combinations of typed and untyped code. Note that this mix-and-match can be done at both the module level and also at the function or class level.

Conclusion

This is my first time working on performance-benchmarking and I am excited to pick up new skills in the process. I am also looking forward to interacting with people from the Python community, people from Meta’s Static Python team, and also with the maintainers of the projects we will be working on. I will be posting more updates on this project as we make progress. Stay tuned!

Kuang-Chen Lu, Ben Greenman, Carl Meyer, Dino Viehland, Aniket Panse, and Shriram Krishnamurthi. Gradual soundness: Lessons from static python. The Art, Science, and Engineering of Programming. ↩︎

Midterm Report: Deriving Realistic Performance Benchmarks for Python Interpreters

Sat, 17 Aug 2024 00:00:00 +0000

Hi, I am Mrigank. As a Summer of Reproducibility 2024 fellow, I am working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah. In this post, I will provide an update on the progress we have made so far.

Creating a Performance Benchmark

We are currently focusing on applications built on top of Django, a widely used Python web framework. For our first benchmark, we chose Wagtail, a popular content management system. We created a pipeline with locust to simulate real-world load on the application. All of our work is open-sourced and available on our GitHub repository.

This load-testing pipeline creates hundreds of users who independently create many blog posts on a Wagtail blog site. At the same time, thousands of users are spawned to view these blog posts. Wagtail does not have a built-in API and so it took some initial effort to figure out the endpoints to hit, which I did by inspecting the network logs in the browser while interacting with the Wagtail admin interface.

A snapshot from a run of the load test with Locust is shown in the featured image above. This snapshot was generated by spawning users from 24 different parallel locust processes. This was done on a local server, and we plan to perform the same experiments on CloudLab soon.

Profiling

On running the load tests with a profiler, we found that the bottlenecks in the performance arose not from the Wagtail codebase but from the Django codebase. In particular, we identified three modules in Django that consumed the most time during the load tests: django.db.backends.sqlite3._functions, django.utils.functional, and django.views.debug. Dibri, a graduate student in Ben’s lab, is helping us add types to these modules.

Next Steps

Based on these findings, we are now working on typing these modules to see if we can improve the performance of the application by using Static Python. Typing Django is a non-trivial task, and while there have been some efforts to do so, previous attempts like django-stubs are incomplete for our purpose.

We are also writing scripts to mix untyped, shallow-typed, and advanced-typed versions of a Python file, and run each mixed version several times to obtain a narrow confidence interval for the performance of each version.

We will be posting more updates as we make progress. Thank you for reading!

Final Blog: FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Fri, 16 Aug 2024 00:00:00 +0000

Background

Hello, I’m Lihaowen (Jayce) Zhu, a 2024 SoR contributor for the FEP-bench project, under the mentorship of Yuyang (Roy) Huang. Before we started, let’s recap the goal of our project and our progress until mid term. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. In order to solve these challenges, we have collected the basic information of various common datasets for different machine learning tasks, and corresponding preprocessing pipelines.

Methodology

Since our goal is to improve the efficiency of the machine learning preprocessing pipeline and keep the training process of the Deep Learning model busy, it means that we need to enhance the preprocessing throughput which is the feed rate from the preprocessing stage to the training stage. According to some previous works, we have a new way to look at the Deep Learning Preprocessing Pipelines. The preprocessing pipeline can be split into 2 parts. The first part contains steps that are run once (S1-Sm). We can call it the “offline” part. The second part includes all the rest steps, which are run at every iteration of training. We call it the ”online” part. After the offline preprocessing steps, the output data is written back to disk. Then the online preprocessing steps need to load that data from storage first and do the following operations. We can split the pipeline at any step, and each split is a preprocessing strategy. By using this method, some specific strategies can achieve a much higher final preprocessing throughput. Our project adopts this method to profile the performance of different strategies. And our goal is to maximize the final preprocessing throughput into training, for a specific pipeline. We want to make this an automatic process, rather than ask for extra user instructions or parameters.

Experiment

Next, we did the data preprocessing strategy experiment on the LibriSpeech dataset, which is an audio dataset for ML tasks like Auto Speech Recognition. The dataset size is 6.3 GB with almost 30000 samples. Each audio file is in a binary format FLAC. As a result, the first step of the preprocessing pipeline we use is decoding, which converts the binary data into arrays of floats. Then we applied some typical audio preprocessing steps of transformation (normalization, padding, extract loudest section) and augmentation (random cut, random shift audio, random mask, random add noise) to audio data. Finally, the audio data is converted to Log-Mel Spectrogram signal, which is commonly used in audio tasks like Speech Recognition and Speaker identification.

We have benchmarked the throughput performance and storage overhead of all possible strategy split points, and have seen some trade-offs between them. Both storage overhead and throughput speed-up use the fully online method as the baseline. What we’ve observed from our results is that the speed-up keeps increasing when we put operations into the offline part, and the storage consumption is very low for the strategies after audio decoding. Also, we analysed the performance of individual methods of transformation and augmentation steps. We find that the speed-up performance is quite stable between 1.0 and 1.2 across these methods, but some methods can have a high storage overhead, like normalization and random noise.

Another thing we observed during our experiments is that different dataset sizes can influence the preprocessing pipeline throughput. We found that the throughput speed-up of 10000 samples is almost double the speed-up of 5000 samples. It seems like a larger dataset size may lead to a higher speed-up. So, we were thinking that does every operation follows this pattern or only certain operations can have increasing throughput with increasing dataset size, and then did experiments about the throughput speed-ups on different dataset sizes of all operations in the audio preprocessing pipeline. The results showed that only the audio decoding step can have a great increase in speed-up for larger dataset sizes. But for transformation, augmentation and LMS, the throughputs always stay at a steady level. This indicates that the only audio decoding step can become faster and faster when the dataset size grows.

Conclusion

In our work, we have built up a collection of common datasets and their preprocessing pipelines for different machine-learning tasks. For the audio dataset LibriSpeech, we have done experiments about the trade-offs between throughput speed-ups and storage overhead, and dataset sizes. We have found that speed-ups keep increasing when more and more operations are divided into the offline part. Only the audio decoding step can become faster and faster when the dataset size grows.

Future works

In the near future, we still want to find the optimal preprocessing strategy by profiling only a small part of the original enormous dataset. The second thing is that besides the audio dataset, we must expand the range of our experiments on other datasets and ML tasks. Finally, we need to implement our goal of building an automatic system that decides the optimal strategy of a preprocessing pipeline.

Final Blog: FSA - Benchmarking Fail-Slow Algorithms

Wed, 14 Aug 2024 00:00:00 +0000

Introduction

Hello! I hope you’re enjoying the summer as much as I am. I’m excited to join the SOR community as a 2024 contributor. My name is Xikang Song, and I’m thrilled to collaborate with mentors Ruidan Li and Kexin Pei on the FSA-Benchmark project. This project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. Throughout this journey, we tested a broad range of algorithms, from traditional approaches to state-of-the-art techniques, using a robust evaluation system to compare their effectiveness.

In the first half of the project, I focused on implementing and testing different machine learning models for detecting disks at high risk of fail-slow anomalies. This involved setting up initial models such as the Cost-Sensitive Ranking Model and Multi-Prediction Models, and beginning to explore LSTM networks for analyzing input disk data.

In the second half, I built upon this foundation by refining the evaluation processes, exploring advanced models like PatchTST, and investigating the potential of large language models (LLMs) for detecting subtle fail-slow conditions in storage systems. This blog post will summarize the key achievements, findings, and comparisons with baseline models from this phase.

Key Achievements

Comprehensive Benchmarking and Evaluation:
- I extended the benchmarking framework to evaluate multiple algorithms across 25 different data clusters on PERSEUS. This process involved generating and analyzing heatmaps that visualized the precision and recall of each model under various settings, providing a clear understanding of each approach’s strengths and limitations.
Exploration of Advanced Machine Learning Models:
- LSTM Model: I implemented the Long Short-Term Memory (LSTM) model, specifically designed for sequential data, to capture temporal dependencies in disk performance metrics. This model was used to predict potential fail-slow anomalies by analyzing historical data. Using Mean Squared Error (MSE) as a risk indicator, the LSTM model outperformed baseline approaches like the Cost-Sensitive Ranking Model and Multi-Prediction Models, especially in clusters where latency patterns between faulty and normal disks were distinct, such as in Cluster_P. This resulted in a higher precision and fewer false positives. However, in clusters with more complex and overlapping data distributions, like Cluster_L, the LSTM model’s performance diminished, similar to that of the baseline models
- PatchTST Model: I also introduced and evaluated the PatchTST model, which is built on a transformer-based architecture known for its ability to handle sequential data by capturing long-range dependencies and intricate temporal patterns. Unlike traditional models, PatchTST processes time series data in segments or “patches,” enhancing its ability to predict disk behavior over extended periods. Like the LSTM model, PatchTST uses outlier MSE values to assess disk risk. In clusters with a clear separation between faulty and normal disks, PatchTST outperformed baseline models by effectively identifying faulty patterns. However, similar to the LSTM model, PatchTST encountered difficulties in clusters with significant data overlap, such as Cluster_L.
Investigation into Large Language Models (LLMs):
- I explored the use of GPT-4-o-mini for fail-slow detection. While large language models (LLMs) showed potential, particularly in reducing false positives and improving precision over baseline models, they did not consistently outperform specialized models like LSTM and PatchTST in this context. LLMs struggled with recall, especially as thresholds increased, revealing the challenges of adapting LLMs to time series data. This limitation arises because LLMs are primarily trained for natural language generation tasks, not for analyzing time series data. As a result, their ability to fully capture anomalies is limited. To improve their effectiveness, we need to develop methods that help LLMs better understand time series data. For example, incorporating statistical information about each disk’s performance could enhance LLMs’ understanding, leading to better precision in fail-slow detection.

Conclusion and Future Work

The work in this project demonstrated that while advanced machine learning models like LSTM and PatchTST offer significant potential for detecting fail-slow conditions, challenges remain in ensuring consistent performance across diverse clusters. Compared to baseline models, these advanced approaches generally provided better precision and recall, especially in clusters with distinct data patterns between faulty and normal disk performance time series. However, the persistent difficulties in more complex clusters indicate the need for further refinement.

Moving forward, future work will focus on refining these models, particularly in improving their performance in challenging clusters like Cluster_L. Additionally, I plan to further explore techniques such as prompt engineering for LLMs to better tailor them for time series analysis and fail-slow detection tasks.

Deliverables

Repository: All comprehensive analysis code and source code can be found in the FSA_BENCHMARK GitHub Repository.
Jupyter Notebook: A notebook to reproduce the experiments and benchmarks on Chameleon: Chameleon Experiment Notebook.
Final Report: Comprehensive algorithm performance evaluation for all methods in FSA-Benchmarking Final Report.

Data Leakage in Applied ML

Tue, 13 Aug 2024 00:00:00 +0000

Hello everyone!

I have been working on reproducing the results from Characterization of Term and Preterm Deliveries using Electrohysterograms Signatures. This paper aims to predict preterm birth using Support Vector Machine with RBF kernel. However, there is a major flaw in the methodology: preprocessing on training and test set. This happens when preprocessing is performed on the entire dataset before splitting it into training and test sets.

Reproducing the published results came with its own challenges, including updating EHG-Oversampling to extract meaningful features from EHG signals and finding optimal hyperparameters for the model. Through our work on reproducing the published results and creating toy example notebooks, we have been able to demonstrate that data leakage leads to overly optimistic measures of model performance and models trained with data leakage fail to generalize to real-world data. In such cases, performance on test set doesn’t translate to performance in the real-world.

Next, I’ll be reproducing the results published in Identification of COVID-19 Samples from Chest X-Ray Images Using Deep Learning: A Comparison of Transfer Learning Approaches.

You can follow my work on the EHG paper here.

Stay tuned for more insights on data leakage and updates on our progress!

Midterm Report : Halfway through medicinal data visulaization using PolyPhy/Polyglot

Mon, 12 Aug 2024 00:00:00 +0000

Introduction

Hello! My name is Ayush Sharma, a machine learning engineer and researcher based out of Chandigarh, a beautiful city in Northern India known for its modern architecture and green spaces. For the last month and a half I have been working closely with my mentors Oskar Elek and Kiran Deol on the project titled Unveiling Medicine Patterns: 3D Clustering with Polyphy/Polyglotas part of GSoC 2024.

Progress and Challenges

The project focuses on developing effective clustering algorithms to visualize medicine data in three dimensions using PolyPhy and Polyglot. My journey began with data preprocessing and cleaning, where unnecessary data points were removed, and missing values were addressed.

One of the primary techniques we’ve employed is UMAP (Uniform Manifold Approximation and Projection). UMAP’s ability to preserve the global structure of the data while providing meaningful clusters proved advantageous. Initial experiments with UMAP on datasets of various sizes (ranging from 1,500 to 15,000 medicines) provided valuable insights into the clustering patterns. By iteratively halving the dimensions and refining the parameters, we achieved more accurate clustering results.

To complement UMAP, we explored t-SNE (t-distributed Stochastic Neighbor Embedding). t-SNE’s focus on local relationships helped in understanding finer details within the clusters. By adjusting t-SNE parameters and conducting perturbations, we could better comprehend the data’s behavior. Combining UMAP with t-SNE in a loop, halving dimensions iteratively, showed promise, allowing us to leverage the strengths of both techniques to enhance clustering accuracy.

We also experimented with pre-trained models like BERT and Glove to create embeddings for the medicines. BERT’s splitting of salts into subparts and Glove’s limitations in recognizing specific salts led us to inaccurate clustering and we’ve been working on improving it for the time being.

Next Steps

Moving forward, I will focus on refining our clustering and embedding techniques to enhance overall accuracy. This involves integrating Jaccard distance alongside other distance measures to improve similarity assessments between medicines and clusters. Additionally, I’ll continue experimenting with advanced models like gpt,CLIP, gemini etc., for better embeddings while addressing the limitations of BERT and Glove by leveraging custom embeddings created with transformers and one-hot encoding. Optimization of UMAP and t-SNE algorithms will also be crucial, ensuring their effectiveness in clustering and visualization. These steps aim to overcome current challenges and further advance the project’s goals.

Midterm Check-In: Progress on the AutoAppendix Project

Sat, 03 Aug 2024 00:00:00 +0000

Hi all,

I’m happy to share a quick update on the AutoAppendix project as we’re about halfway through. We’ve made some steady progress on evaluating artifacts from SC24 papers, and we’re starting to think about how we can use what we’ve learned to improve the artifact evaluation process in the future.

What We’ve Been Up To

As a quick reminder, the goal of our project is to develop a set of guidelines that researchers can use to improve the reproducibility of their work. We’re focusing on papers from the Supercomputing Conference 2024 that applied for an “Artifact Replicable” badge, and we’re evaluating their artifacts to see how well the experiments can be replicated. As it was difficult to make assumptions about the exact outcomes of the project besides detailed experiment recreation, our main goal of this midterm check-in is to share what insights we have gathered so far and to set the stage for the final outcomes.

Our main task so far has been making a selection of submissions with experiments designed for Chameleon Cloud, or those that could be easily adapted to run on Chameleon. As there were 45 submissions that applied for an “Artifact Replicable” badge, it was not easy to choose which ones to evaluate, but we managed to narrow it down to 18 papers that we thought would be a good fit for our project.

We’ve chosen to focus on papers that do not require special hardware (like a specific supercomputer) or complex network setups, as it would be difficult to generalize the insights from these kinds of experiments. Instead, we’ve been looking at those that require only a single computation node, and could theoretically be run with the available hardware on Chameleon.

Observations and Learning Points

At the moment, we’re about halfway through the evaluation process. So far, we’ve noticed a range of approaches to documenting and setting up computational experiments. Even without looking at the appendices in detail, it’s clear that there’s a lot of room for standardization of the documentation format and software setup, which could make life easier for everyone involved. This particularly applies to software setups, which are often daunting to replicate, especially when there are specific version requirements, version incompatibilities or outright missing dependencies. Since the main goal of this project is to develop a set of guidelines that researchers can use to improve the reproducibility of their work, suggesting a way to deal with software versions and dependencies will be a key part of our results.

We’ve observed that submissions with well-structured and detailed appendices tend to fare better in reproducibility checks. This includes those that utilized containerization solutions like Docker, which encapsulate the computing environment needed to run the experiments and thus eliminates the need for installing specific software packages. It’s these kinds of practices that we think could be encouraged more broadly.

Looking Ahead

The next steps are pretty exciting! We’re planning to use what we’ve learned to draft some guidelines that could help future SC conference submissions be more consistent. This might include templates or checklists that ensure all the necessary details are covered.

Additionally, we’re thinking about ways to automate some parts of the artifact evaluation process. The goal here is to make it less labor-intensive and more objective. A particularly nice way of reproducible artifact evaluation is Chameleon’s JupyterHub interface, which in conglomeration with the Trovi artifact sharing platform makes it easy to share artifacts and allow interested parties to reproduce the experiments with minimal effort. We are thus looking into ways to utilize and contribute to these tools in a way that could benefit the broader research community.

Wrapping Up

That’s it for now! We are working towards getting as many insights as possible from the rest of the artifact evaluations, and hopefully, by the end of this project, we’ll have some solid recommendations and tools to show for it. Thanks for keeping up with our progress, and I’ll be back with more updates as we move into the final stages of our work.

[MidTerm] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Thu, 01 Aug 2024 00:00:00 +0000

Hey there, scalability enthusiasts and fellow researchers! I’m excited to share my progress on the ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. Here’s a glimpse into how we’re tackling scalability bugs in large-scale distributed systems.

Project Overview

Large-scale distributed systems are the backbone of modern computing, powering various applications and services. However, these systems often face challenges related to reliability and performance, particularly scalability bugs. These bugs manifest in large-scale deployments, causing issues such as system downtime, reduced responsiveness, and data loss. Traditional bug-finding methods fall short in detecting these bugs, which are triggered by factors like component count, system load, workload size, recovery protocol reliability, and intermediate failure magnitude.

Our project, ScaleRep, aims to address these challenges by analyzing recent scalability issues from ten popular open-source large-scale systems. We are providing detailed accounts of bug reproduction experiences, identifying common challenges, and developing protocols for triggering and quantifying the impact of scalability bugs.

Progress Highlights

So far, I have been working on the following bugs and have successfully uploaded some of them to Trovi. Here’s a brief overview of my progress:

Bugs Worked On:

IGNITE-20614: Uploaded to Trovi Trovi Link
IGNITE-17407: Uploaded to Trovi Trovi Link
IGNITE-20692
IGNITE-16600
IGNITE-16072

What is Chameleon and Trovi?

Chameleon is a configurable experimental environment for large-scale cloud research. It provides a platform for running and testing distributed systems at scale, allowing researchers to reproduce and study scalability issues in a controlled setting.

Trovi is a platform that facilitates the sharing of reproducible artifacts. By uploading our bug reproduction artifacts to Trovi, we enable other researchers to easily reproduce scalability bugs, fostering collaboration and advancing the field of distributed systems research.

Short Description of the Bugs

IGNITE-20614 This bug refers to an issue where the Ignite service grid experiences degradation or hangs under specific conditions related to service deployment and node restarts.

Root Causes: The root cause is a race condition during the deployment and undeployment of services in the service grid, particularly when nodes are restarted or when there is a significant amount of concurrent service deployment and undeployment activity.

Impact: The impact of this bug includes potential service grid hangs, degraded performance, and possible inability to deploy or undeploy services as expected, which can disrupt the overall operation of the Ignite cluster.

Fix: The fix involves adding proper synchronization mechanisms to handle concurrent service deployment and undeployment operations more gracefully, ensuring that race conditions are avoided.

IGNITE-17407 This issue pertains to the incorrect behavior of the Ignite thin client protocol, particularly when dealing with binary objects and schema changes.

Root Causes: The root cause lies in the way the thin client handles binary object schema changes. The thin client was not correctly updating the schema cache, leading to inconsistencies and incorrect behavior when deserializing binary objects.

Impact: Users of the thin client may experience issues with binary object deserialization, leading to potential data corruption, incorrect query results, and overall application instability.

Fix: The fix involves updating the thin client protocol to properly handle schema changes by ensuring that the schema cache is correctly updated and synchronized with the server.

IGNITE-20692 This bug is related to the performance degradation observed in the Ignite SQL engine when executing certain complex queries.

Root Causes: The root cause is identified as inefficient query planning and execution strategies for specific types of complex SQL queries, leading to excessive resource consumption and slow query performance.

Impact: Users running complex SQL queries may experience significant performance degradation, leading to slower response times, increased CPU and memory usage, and potentially impacting the overall performance of the Ignite cluster.

Fix: The fix involves optimizing the SQL query planner and executor to handle complex queries more efficiently, including better indexing strategies, improved query plan caching, and more effective resource management during query execution.

IGNITE-16600 This bug involves an issue with speed-based throttling in the checkpoint process, leading to possible starvation of the checkpoint thread under heavy load.

Root Causes: The root cause is the absence of proper mechanisms to wake up throttled threads when they no longer need to be throttled, resulting in unnecessary waiting and potential starvation of the checkpoint thread.

Impact: Under heavy load, the checkpoint process can be significantly delayed, leading to slower checkpoint completion times, increased risk of data loss, and overall degraded performance of the Ignite cluster.

Fix: The fix includes implementing methods to wake up throttled threads when they no longer need to be throttled (tryWakeupThrottledThreads and shouldThrottle), ensuring that the checkpoint process can proceed without unnecessary delays.

IGNITE-16072 This issue pertains to the incorrect handling of SQL queries involving NULL values in the Ignite SQL engine, leading to unexpected query results.

Root Causes: The root cause is an incorrect implementation of SQL semantics for handling NULL values in certain query conditions, particularly in the presence of complex joins and subqueries.

Impact: Users may experience incorrect query results when NULL values are involved, leading to potential data inconsistencies and incorrect application behavior.

Fix: The fix involves correcting the SQL engine’s implementation to properly handle NULL values according to the SQL standard, ensuring that queries involving NULL values produce the expected results.

What’s Next?

Continued Bug Reproduction:

Focus on reproducing more scalability bugs

Documentation of Challenges:

Breakdown specific challenges encountered during attempts to reproduce scalability bugs.
Categorize challenges, including technical complexities, environmental dependencies, and lack of documentation in bug reports.

Finalizing Project Deliverables:

Package artifacts using Jupyter notebook scripts for convenient replay of investigation steps.
Upload the package to Trovi for replayable artifacts, enabling other researchers to easily reproduce scalability bugs for our benchmark applications.

Conclusion

The ScaleRep project has made significant strides in reproducing and benchmarking scalability bugs in large-scale distributed systems. By successfully reproducing and documenting scalability bugs, we are contributing valuable insights to the research community, aiding in the development of more robust distributed systems. The protocols and methodologies devised in this project will serve as valuable tools for researchers exploring similar issues.

Stay tuned for more updates as we continue to tackle scalability bugs and improve the reliability and performance of large-scale distributed systems.

Midway Through GSoC

Wed, 31 Jul 2024 00:00:00 +0000

Hello everyone! I’m Joel Tony, and I’m excited to share my progress update on the Drishti project as part of my Google Summer of Code (GSoC) experience. Over the past few weeks, I’ve been diving deep into the world of I/O visualization for scientific applications, and I’m thrilled to tell you about the strides we’ve made.

What is Drishti?

For those unfamiliar with Drishti, it’s an application used to visualize I/O traces of scientific applications. When running complex scientific applications, understanding their I/O behavior can be challenging. Drishti steps in to parse logs from various sources, with a primary focus on those collected using Darshan, a lightweight I/O characterization tool for HPC applications. Drishti provides human-interpretable insights on how to improve I/O performance based on these logs. While Drishti supports multiple log sources, our current work emphasizes Darshan logs due to their comprehensive I/O information. Additionally, Drishti offers visually appealing and easy-to-understand graphs to help users better grasp their application’s I/O patterns, making it easier to identify bottlenecks and optimize performance.

Progress and Challenges

Export Directory Feature

One of the first features I implemented was the export directory functionality. In earlier versions of Drishti, users couldn’t select where they wanted their output files to be saved. This became problematic when working with read-only log locations. I familiarized myself with the codebase, created a pull request, and successfully added this feature, allowing users to choose their preferred output location.

CI Improvements and Cross-Project Dependencies

While working on Drishti, I discovered the tight coupling between various tools in the HPC I/O organization, such as Drishti and DXT Explorer. This highlighted the need for improved Continuous Integration (CI) practices. We currently run about eight GitHub Actions for each pull request, but they don’t adequately test the interactions between different branches of these interconnected tools. This is an area we’ve identified for future improvement to ensure smoother integration and fewer conflicts between projects.

Refactoring for Multi-File Support

The bulk of my time was spent refactoring Drishti to extend its framework from parsing single Darshan files to handling multiple files. This task was more complex than it initially appeared, as Drishti’s insights are based on the contents of each Darshan file. When dealing with multiple files, we needed to find a way to aggregate the data meaningfully without sacrificing on performance.

The original codebase had a single, thousand-line function for parsing Darshan files. To improve this, I implemented a data class structure in Python. This refactoring allows for:

Better separation of computation and condition checking
Easier parallelization of processing multiple traces
Finer-grained profiling of performance bottlenecks
More flexibility in data manipulation and memory management

Learnings and Skills Gained

Through this process, I’ve gained valuable insights into:

Refactoring large codebases
Understanding and improving cross-project dependencies
Implementing data classes in Python for better code organization
Balancing performance with code readability and maintainability

Next Steps

As I move forward with the project, my focus will be on:

Adding unit tests for individual methods to ensure functionality
Exploring alternative data frame implementations like Polars for better performance
Developing aggregation methods for different types of data across multiple Darshan files
Optimizing memory usage and computational efficiency for large datasets

Conclusion

Working on Drishti has been an incredible learning experience. I’ve had the opportunity to tackle real-world challenges in scientific computing and I/O visualization. As we progress, I’m excited about the potential impact of these improvements on the scientific community’s ability to optimize their applications’ I/O performance.

I’m grateful for this opportunity and looking forward to the challenges and discoveries that lie ahead in the second half of my GSoC journey. Stay tuned for more updates as we continue to enhance Drishti!

If you have any questions or would like to learn more about the project, feel free to reach out to me. Let’s keep pushing the boundaries of scientific computing together!

Streaming into the Future: Adding Real-Time Processing to FasTensor

Tue, 30 Jul 2024 00:00:00 +0000

Hey there, HPC enthusiasts and fellow coders! I’m excited to share my progress on this summer’s Google Summer of Code project under UC OSPO’s FasTensor. Here’s a glimpse into how we’re pushing the boundaries of real-time data processing.

The Big Picture: FasTensor and HPC Challenges

First, a quick refresher: FasTensor is our go-to tool for handling dense arrays in scientific computing. It tackles three major HPC challenges:

Optimizing computations
Distributing data efficiently
Balancing workloads across computing cores

FasTensor excels at these tasks, especially when dealing with data that has structural locality - a common feature in scientific computing. Here, the Stencil computations come in handy, capturing data locality for operations like solving partial differential equations in physical simulations.

The Mission: Bringing FasTensor into Real-Time

While FasTensor is great at processing existing data, the next frontier is handling live data streams from scientific instruments and sensors. That’s where my GSoC project comes in: adding stream processing capabilities to FasTensor.

Progress Highlights:

Building a Stream Simulator

We’ve created FTstream, a nifty tool that simulates data streams. It can generate streams of various sizes and intervals, pushing the limits of what your disk can handle. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe! This tool is crucial because many scientific instruments, from particle accelerators to radio telescopes, generate massive amounts of data at incredible speeds and we need to able to simulate that. For context, that’s faster than a 10MP RGB camera shooting at 35 frames per second that generates data at ~1 GiB/s.

Optimizing I/O Strategies

We’ve been experimenting with various I/O approaches to optimize high-speed data stream handling.

Exploring Streaming Semantics

We’re investigating various ways to express and execute stream transformations, to ensure that FasTensor can handle a wide range of streaming computations.

Developing I/O Drivers

We’ve developed two new I/O drivers based on LinuxAIO and MPI IO to ingest incoming data smoothly and maintain stream consistency.

What’s Next?

Putting It All Together

We’re in the final stretch of integrating all these components into a seamless stream processing system.

Rigorous Testing

We’ll push our stream processing to its limits, simulating diverse data flows to ensure rock-solid performance in any scientific setting.

HPC Environment Validation

The ultimate test will be running our new streaming capabilities in real HPC environments, checking how they perform with different I/O setups and computing paradigms.

Wrapping Up

This summer has been a whirlwind of coding, testing, and learning. We’re making significant strides in bringing real-time processing capabilities to FasTensor, which could open up exciting new possibilities in scientific computing and data analysis. Stay tuned for more updates as we finalize this feature. If you’re interested in the nitty-gritty technical details or want to check out the code, feel free to reach out or check our project repository. Happy coding, and may your computations be ever faster!

Mid-term Blog: Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Mon, 29 Jul 2024 00:00:00 +0000

Introduction

Hello everyone I’am Archit from India. An undergraduate student at the Indian Institute of Technology, Banaras Hindu University, IIT (BHU), Varanasi. As part of the Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon my proposal under mentorship of Raül Sirvent aims to develop a service that facilitates the automated replication of COMPSs experiments within the Chameleon infrastructure.

About the project:

The project proposes to create a service that will have the capability to take a COMPSs crate (an artifact adhering to the RO-Crate specification) and, through analysis of the provided metadata construct a Chameleon-compatible image for replicating the experiment on the testbed.

Progress

It has been more than six weeks since the ReproducibilityService project began, and significant progress has been made. You can test the actual service from my GitHub repository: ReproducibilityService. Let’s break down what the ReproducibilityService is capable of doing now:

Support for Reproducing Basic COMPSs Experiments: The RS program is now fully capable of reproducing basic COMPSs experiments with no third-party dependencies on any device with the COMPSs Runtime installed. Here’s how it works:
- Getting the Crate: The RS program can accept the COMPSs workflow from the user either as a path to the crate or as a link from WorkflowHub. In either case, it creates a sub-directory for further execution named reproducibility_service_{timestamp} and stores the workflow as reproducibility_service_{timestamp}/Workflow.
- Address Mapping: The ro-crate contains compss_submission_command_line.txt, which is the command originally used to execute the experiment. This command may include many paths such as runcompss flag1 flag2 ... flagn <main_workflow_file.py> input1 input2 ... inputn output. The RS program maps all the paths for <main_workflow_file.py> input1 input2 ... inputn output to paths inside the machine where we want to reproduce the experiment. The flags are dropped as they may be device-specific, and the service asks the user for any new flags they want to add to the COMPSs runtime.
- Verifying Files: Before reproducing an experiment, it’s crucial to check whether the inputs or outputs have been tampered with. The RS program cross-verifies the contentSize from the ro-crate-metadata.json and generates warnings in case of any abnormalities.
- Error Logging: In case of any problems during execution, the std_out and std_err are stored inside reproducibility_service_{timestamp}/log.
- Results: If any results do get generated by the experiment, the RS program stores them inside reproducibility_service_{timestamp}/Results. If we ask for the provenance of the workflow also, the ro-crate thus generated is also stored here only.

Support for Reproducing Remote Datasets: If a remote dataset is specified inside the metadata file, the RS program fetches the dataset from the specified link using wget, stores the remote dataset inside the crate, and updates the path in the new command line it generates.

Challenges and End-Term Goals

Support for DATA_PERSISTENCE_FALSE: The RS program still needs to support crates with dataPersistence set to false. After weeks of brainstorming ideas on how to implement this, we recently concluded that since the majority of DATA_PERSISTENCE_FALSE crates are run on SLURM clusters, and the dataset required to fetch in such a case is somewhere inside the cluster, the RS program will support this case for such clusters. Currently, I am working with the Nord3v2 cluster to further enhance the functionality of ReproducibilityService.
Chameleon Cluster Setup: I have made some progress towards creating a new COMPSs 3.3 Appliance on Chameleon to test the service. However, creating the cluster setup script needed for the service to run on a COMPSs 3.3.1 cluster to execute large experiments has been challenging.
Integrating with COMPSs Repository: After completing the support for dataPersistence false cases, we aim to launch this service as a tool inside the COMPSs repository. This will be a significant milestone in my developer journey as it will be the first real-world project I have worked on, and I hope everything goes smoothly.

Stay tuned for the next blog!!

Enhancing h5bench with HDF5 Compression Capability

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

As part of the h5bench project my Enhencing h5bench with HDF5 Compression Capability under the mentorship of Dr. Jean Luca Bez and Dr. Suren Byna aims to allow users of h5bench to incoporate compression features in their simulations by creating custom benchmarks with common scientific lossless & lossy compression algorithms such as SZ, SZ3, ZFP, and GZIP.

The problem I am trying to solve is to implement multiple data compression algorithms in h5bench core access patterns through HDF5 filters. This capability should grant users the flexibility to configure the parameters and methods of compression applied to their datasets according to their specific needs and preferences. My solution primarily involves using a user-defined HDF5 filter mechanism to implement lossless and lossy compression algorithms, such as ZFP, SZ, and cuSZ. Throughout the process, I will deliver one C source code implementing compression configuration settings, one C source code implementing lossless and lossy algorithms, a set of performance reports before and after data compression in CSV and standard output files, and a technical documentation on h5bench user manual website.

Midterm Blog

This summer, after completing my junior year, I was honored to have the opportunity working with Dr. Jean Luca Bez and Dr. Suren Byna on the h5bench, an open-source benchmarking project designed to simulate runnning sync/async HDF5 I/O on HPC machines. This post will cover mostly what I have learned, produced, planned, and thoughts over the first six weeks.

First of all, let’s define some of the terms here. HDF5 stands for Hierarchical Data Format 5. Unlike other data storage formats (JSON, CSV, XML…), HDF5 is not only a container that manages data similar to a file system, but also a powerful library that gives you the ability to perform I/O (Inputs/Outputs) operations between memory and file. One of the reasons this tool is commonly used by HPC applications is that it also supports MPI I/O, which is a protocol for parallel computing (you can think of it as the parallel version of POSIX). With exabytes of data and high frequencies of usage for analysis in scientific studies, HDF5 is perfect for the job. Essentially, h5bench is a software that tests the hardware’s performance through HDF5 (it also provides other benchmark kernels such as AMReX, E3SM-IO, MACSio, and openPMD-api, but my job focuses on using vanilla HDF5 I/O).

So, what I have done so far? Frist, my job is to allow users to tune input parameters regarding data compression, and make sure h5bench prints accurate benchmark results with the intended compression algorithm applied to their datasets. h5bench’s frondend is written in Python, which takes an input of a JSON file from user and parses it into a CFG configuration file that can be read by the backend later, which is written in C. I created a new enum struct and made user able to specify one from a range of compression algorithms (SZ3, ZFP, LZ4, GZIP, and other pre-defined algorithms). I also made it possible to apply these algorithms to the datasets, so the .h5 (an HDF5 file) would contain chunks of compressed data after multiple H5Dwrite calls.

Next, the challenges and gains. Throughout the first six weeks, 30% of the time was spent on understanding the newest version of h5bench and HDF5 by reading through C source codes and documentations, and asking many dumb questions to my mentors (thanks to their patience and great answers :D). Writing code is fairly easy after I really understood what the program is doing. By that I mean you have to understand every line in almost all functions and how each and every variables change. 40% of the time was used on debugging and testing the compression algorithm, mainly SZ3. To make code behaves correctly is another level of difficulty. Most of the issues resulted from failing to configure the application and dependent libraries correctly. Without necessary macros enabled during the build process, features like compression filter plugin will not run. As I was also new to CMake and HPC environment, I learned that new envrionment variables will be reset for every new session, even if you requested a compute node resource. Besides getting used to the standard build sequence: “cmake ..”, “make”, “make install”, I also learned to use “ccmake ..” to examine the flags of the compiled program. The rest of time I learned more about parallel computing, HDF5, compression algorithms, by reading some papers and documentations. A lot of notes were taken (I must say a good note taking system is the game changer). Last but not the least, I also spent times synchronizing online and offline with my mentors to discuess problems. Without their help, I can never make this far.

My next phase will tackle these problems, here I am just offering a list:

Test applying filter with other compression algorithms, and with different dimension layout of the dataset
Add decompression capability
Allow users to tune the auxiliary parameters for controlling the behavior of a certain compression filter H5Pset_filter(COMPRESS_INFO.dcpl_id, H5Z_FILTER_SZ3, H5Z_FLAG_MANDATORY, 0, NULL); cd_nelmts cd_values[]
Print additional benchmark results to indicate what and how the compression filter is applied, and the compression ratio

Final Blog: FetchPipe: Data Science Pipeline for ML-based Prefetching

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Peiran Qin, a CS student at the University of Chicago. This summer I worked on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. The FetchPipe project focuses on building a unified Python simulator and evaluating the existing cache-eviction policy and ML-based prefetcher under this simulator. Through this projects, we make the following contributions and get several insights that can share with the community:

We built up a simulator to evaluate various prefetchers under a unified framework, under the production level traces from Alibaba, Microsoft Research, and Tencent.
Through the evaluation, we discover several downsides that existing heuristic-based prefetchers encounter.
We draw several insights that can guide the future prefetchers’ design.

Methodology

In the first half of the SoR project, I mainly focus on the simulator building of I/O prefetcher. The simulator should mimic the real OS-level prefetching as much as possible. First, we develop a mechanism that mimics the users sending I/O requests to the underlying systems. Then, we simulate the process of page division, and memory management inside the systems. Finally, we designed a sleep-based mechanism to mimic the I/O latency of backend storage. The outcome system can eventually simulate the data path of I/O request and prefetching of real systems, and collect the crucial metrics such as hit rate, total prefetched data, bandwidth usage, prefetch accuracy, total cache eviction, etc.

In the second half of the SoR project, I concentrate on the evaluation of existing prefetchers. First, we surveyed the existing state-of-the-art prefetchers and divided them into two categories: (1) Heuristic-based prefetchers and (2) ML-based prefetchers. Next, for each category, we picked several representative prefetchers and implemented them within our simulator. Then, we evaluated those prefetchers using the production-level over 600 traces from Alibaba, Tencent, and Microsoft Research. Finally, we analyzed the performance of those prefetchers and discovered some interesting insights that might guide the future prefeters’s design.

Finally, based on the achievements of the SoR project, I will continue involving this interesting project with Prof. Haryadi S. Gunawi. We are leveraging the current insights we get to build an I/O prefetcher that mitigates the downsides of existing prefetchers.

Insights

Based on our experiments on the existing prefetchers, we would like the share the following insights:

Heuristic-based prefetchers, including Linux Readahead and Stride prefetcher, rely on strict pre-fined rules and detect straightforward access patterns. However, those prefetchers are too conservative to recognize the increasingly complex access patterns. Especially, in real-world applications, sequential accesses are interweaved with random accesses, leading to a next-level complexity that makes it difficult for Linux Readahead and Stride prefetchers to recognize.
Offline learning-based prefetchers learn the access patterns by training machine learning models on pre-collected historical access patterns. Blessed by the representational power of machine learning, these prefetchers excel at recognizing complex access patterns. However, their effectiveness is constrained by their dependence on the patterns encountered during offline training, making them less adaptable to previously unseen patterns in online scenarios. Moreover, due to not relying on the pre-defined rule of prefetching, Offline learning-based prefetchers are more prone to prefetch useless data, which causes cache pollution and extra pressure on backend storage.
We argue that a good prefetcher under nowadays complex and changing workload should have three properties: (1) Complexity-Recognition: which means the prefetcher should be able to recognize the complex access pattern of a complex workload. (2) Reliability: means the prefetcher should reduce its possibility to prefetch using less data and cause cache pollution. (3) Adaptability: means the prefetcher should adapt itself to the changing workload.

Future Works

Based on the above insights, we are now designing our own prefetchers that can mitigate the downsides of existing prefetchers. We will make our code public after we finalize our design.

Conclusion

Through the SoR project, I delved into the research area of I/O prefetching by reproducing the related works, characterizing their performance, and designing our own prefetcher. We contribute to the community with a comprehensive simulator, evaluation results of related prefetchers, and insights that can guide the future prefetchers’ design. In the future, I will continue working on the research area of prefetcher and keep making contributions.

Mid Term Blog: FetchPipe: Data Science Pipeline for ML-based Prefetching

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Peiran Qin, a CS student at the University of Chicago, currently working on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. The FetchPipe project focuses on building a unified python simulator and evaluating the existing chache-eviction and ML-Based prefetcher under this simulator.

Motivation

Existing prefetching algorithms can be categorized into (a) heuristic-based methods such as the Linux lookahead prefetcher and (b) machine learning-based methods like Long Short Term Memory (LSTM) models. However, there is a research gap in comprehensively comparing all existing ML solutions, such as Leap and LSTM Prefetcher, under a consistent evaluation setup. To ensure the fairness of evaluations, it is essential to integrate all baselines and our prefetcher into a homogeneous evaluation environment. Additionally, there is a need to evaluate cache eviction algorithms under prefetching scenarios.

Therefore, in this project, we aim to build a fair simulator, deploy state-of-the-art prefetchers and cache eviction algorithms onto this platform, and then evaluate them using comprehensive metrics. The state-of-the-art prefetchers we consider include Pythia (MICRO'21), SGDP (arXiv), and the Markov-Chain prefetcher. For cache eviction algorithms, we consider S3FIFO (SOSP'23) and SIEVE (NSDI'24). Our focus is on implementing these algorithms on our simulator and evaluating their performance using block storage datasets from Alibaba, Tencent, and MSR. Besides evaluating the prefetchers and eviction algorithms individually, we also aim to combine prefetchers with cache eviction algorithms to test overall performance.

Current Progress

In the past one and a half months, I have focused on (1) implementing our Python simulator and (2) deploying state-of-the-art prefetchers and cache eviction algorithms on this simulator. The implementation phase is now complete. The detailed progress is as follows:

The python simulator of evaluating both ML-based or heuristic-based prefetchers and cache eviction are done.
Evaluations metrics collection, such as hit rate, total prefetched data, prefetch overhead, prefetch accuracy are implemented on the simulator.
Two ML-based prefetchers, SGDP, Pythia and Markov-Chain are deployed on the simulator. SGDP is a graphed neural network based prefetcher, and Pythia is a reinforment learning based prefetcher.
State-of-the-art heuristic based eviction algorithms are implemented in the simulator, including S3FIFO and SIEVE.

With the simulator and state-of-the-art ML-based prefetchers and eviction algorithms in place, the next steps are to (1) organize a large-scale dataset (including over 600 traces from real storage servers) for testing performance and (2) evaluate the implemented prefetchers and eviction algorithms on this dataset. Finally, I will analyze the evaluation results and provide insights from the experimental outcomes. For the ML-based prefetchers, I will analyze both ML-related metrics such as accuracy and F1-score, and system metrics such as hit rate and various overheads.

Challenges

The biggest challenge is implementing existing prefetchers correctly and fairly. Since some state-of-the-art prefetchers are designed for DRAM prefetching, adapting them for SSD prefetching in the simulator is challenging. Additionally, the lack of source code for some works makes it difficult to reproduce their algorithms accurately based solely on their paper descriptions.

Halfway Blog: FSA: Benchmarking Fail-Slow Algorithms

Tue, 23 Jul 2024 00:00:00 +0000

Introduction

Hi, I’m Xikang Song, a 2024 SoR contributor to the project, working with mentors Ruidan Li and Kexin Pei. Our FSA-Benchmark project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. We will benchmark a range of machine learning algorithms, from traditional to advanced methods, and compare the results using a comprehensive evaluation system. This will provide a clear view of how machine learning impacts critical error detection in RAID systems.

Motivation

Fail-slow issues in storage systems , where a disk operates at a significantly reduced speed without completely failing, are subtle and can manifest as consistently higher latency compared to peer disks or recurrent abnormal latency spikes. These issues are challenging to detect but can significantly degrade overall system performance over time. Fixed thresholds are ineffective because latency distributions vary across different clusters, leading to thresholds that are either too low or too high, resulting in numerous false alerts. Therefore, we are enthusiastic about using machine learning models to analyze disk performance data. Machine learning algorithms can deeply learn the trends in the data, providing better detection capabilities.

Current Progress and Challenges

Algorithm Implementation:

Cost-Sensitive Ranking Model: Inspired by the paper “Improving Service Availability of Cloud Systems by Predicting Disk Error” presented at the USENIX ATC ‘18 conference, this model ranks disks based on fail-slow risk.
Multi-Prediction Models: Drawing from “Improving Storage System Reliability with Proactive Error Prediction” presented at the USENIX ATC ‘17 conference, this approach uses multiple traditional machine learning models to evaluate disk health using diverse features. Various models were tested, with the Random Forest classifier proving most effective.
LSTM Model: This model employs Long Short-Term Memory (LSTM) networks, trained on the first day’s data for each cluster and evaluated on data spanning all days. It captures temporal dependencies to accurately predict fail-slow anomalies over time.

Comprehensive Evaluation:

Collected outputs from all algorithms on Chameleon for Perseus data A to Y (25 clusters).
Parsed the outputs through a comprehensive evaluation system, recording the true/false positives/negatives.
Plotted heat maps to show precision and recall with different look-back days and alert threshold settings.
Compared the performance across different clusters to draw conclusions.

Packaging Code:

Packaged all the code into a Trovi Jupyter notebook, including the Chameleon server setup, to provide clear steps for running the code and reproducing the experiments. All algorithm testing and result parsing can be easily done here.

Challenges

Initially, I was unsure how to evaluate the performance of different algorithms. Ruidan Li provided comprehensive guidance on collecting all the results uniformly and parsing them to gather true/false positives/negatives. This approach enabled us to derive meaningful metrics and plot heatmaps for precision and recall. I learned the scientific method of benchmarking performance, and I am grateful for the guidance.

Future Steps

Further Investigation of Advanced Algorithms

We plan to explore advanced algorithms such as PatchTST. This will involve systematically collecting outputs and conducting comprehensive benchmarking to assess their performance in identifying fail-slow anomalies.

Transition to Large Language Models (LLMs)

Recognizing the limitations of traditional machine learning methods, we intend to transition to utilizing Large Language Models (LLMs). LLMs have demonstrated superior capabilities in understanding complex patterns and making accurate predictions. We anticipate that incorporating LLMs into our analysis will enhance our ability to detect and predict fail-slow anomalies more accurately, leading to better overall system reliability.

Exploring Throttling Bugs in HDFS: Reproducing Developer Fixes

Mon, 22 Jul 2024 00:00:00 +0000

Scalability is a critical concern for large-scale distributed systems like the Hadoop Distributed File System (HDFS). Throttling bugs, which affect the system’s ability to manage data transfer rates effectively, can lead to performance issues and system instability. In my recent work, I focused on reproducing the effects of two specific throttling bugs in HDFS, which were fixed by developers. This blog provides an overview of these bugs and the process of reproducing their effects to validate the fixes.

HDFS-17087: Missing Throttler in DataXceiver#readBlock

One of the throttling bugs I explored was HDFS-17087. The DataXceiver#readBlock function in HDFS lacked a throttler, resulting in unregulated data reads. This absence could lead to potential performance degradation under heavy loads. The developer fixed this issue by adding a throttler to regulate the data transfer rate. In my work, I reproduced the bug and observed the system’s behavior both before and after applying the developer’s patch. The results showed a significant improvement in stability and performance post-fix.

HDFS-17216: Incorrect Data Rate Calculation

Another crucial bug was HDFS-17216. The issue stemmed from the use of integer division in the getBytesPerSec function, which caused incorrect speed calculations and failed to trigger the throttle, resulting in overspeed. The developer addressed this by switching from integer to float for calculating the elapsed time, ensuring accurate speed measurements. I reproduced the conditions that highlighted the bug’s effects and compared the system’s performance with and without the fix. The post-fix results confirmed that the throttling mechanism worked correctly, effectively preventing overspeed.

Conclusion

Reproducing these throttling bugs and validating the developer fixes was a vital step in understanding their impact on HDFS’s scalability. The improvements observed in system stability and performance underscore the importance of accurate throttling mechanisms. This work contributes to the broader effort of maintaining robust and scalable distributed systems, ensuring they can handle increasing loads efficiently.

Trovi redesign process and low fidelity prototype in Figma

Mon, 22 Jul 2024 00:00:00 +0000

Hello! My name is Alicia Esquivel Morel, and I’m a graduate research assistant at the University of Missouri – Columbia, pursuing a PhD in Computer Science. This summer, I’m working on a project to improve user experience reproducibility through a redesign of TROVI, as part of the Summer of Reproducibility (SoR) program. I’m excited to be working with two fabulous mentors, Kate Keahey, and Mark Powers. .

Research Reproducibility with a TROVI Redesign

As researchers, we constantly face challenges replicating experiments due to limitations in current tools. TROVI, a platform designed to facilitate experiment replication, can be hindered by hard-to-follow interfaces and difficulties integrating code and data. This leads to confusion and frustration.

My SoR project tackles these issues by redesigning TROVI to enhance user experience reproducibility. Imagine a user-friendly platform where uploading code, sharing data, and collaborating with colleagues becomes easy and straighforward.

The Redesign’s Goals

Enhanced User Experience: Inspired by user-friendly platforms like Google Colab, we’ll simplify TROVI’s interface for intuitive navigation and ease of use.
Uploads and Sharing: Uploading code and data, as well as collaborating with researchers, are key goals. Integration with platforms like GitHub will further streamline collaboration.
Continuous Improvement: A built-in feedback loop will allow users to provide input and suggestions, ensuring TROVI constantly evolves based on user needs.

Progress I have made so far

The first stage of my project began with conducting User Experience (UX) research and identifying user requirements for TROVI. I then conducted a literature review on reproducibility platforms to learn about efficient methodologies and platforms for reproducibility. This helped establish a clearer project scope. Additionally, I analyzed TROVI end-user feedback to understand redesign needs.

In summary, during the first weeks of the project, I focused on research and requirements gathering, including the literature review on state-of-the-art reproducibility platforms. Before midterm assessment, my work also involved the redesign process, prioritizing improved usability and user experience. I designed wireframes following requierements and user feedback and later translated them into a low-fidelity prototypes. Front-end and back-end considerations were made, such as selecting a front-end language (Vue.js) and a collaborative design tool (Figma).

What do I plan to do over the next weeks?

During the next two weeks, I will address challenges encountered in the design process and make the necessary adjustments to ensure the success of the next steps of the project. A higher-fidelity prototype will be completed, including connections between the different objects and frames. This will facilitate the creation of a front-end with multiple flows in the prototype. Additionally, this will provide a preview of the end-user experience through the design process, without requiring the back-end to be functional or connected yet. I’m also investigating design tool API integrations to access TROVI’s APIs. This will give us the ability to access and isolate any TROVI artifact properties associated with it.

I’m halfway in the redesign process. Next steps will include the integration of both the backend and frontend components to create a cohesive and functional system. We will also facilitate initial user interactions and testing to gather valuable feedback and ensure that the system meets the needs and expectations of end users.
In addition, as I progress, my focus will shift towards enhancing the user experience and refining the final product based on the feedback received. The final two weeks of the program will be dedicated to this critical phase, where I will implement user experience techniques and conduct thorough testing to polish the product. This period will involve close analysis and iteration to address any issues, and an optimize functionality.
By the end of the program, I aim to deliver a functional and user-friendly product that not only meets the initial project goals but also exceeds user expectations.

Stay tuned to see how TROVI is built for reproducible research!!

Data Engineering and Automated Evaluation for OpenROAD's Chat Assistant: Midterm Update

Sun, 21 Jul 2024 00:00:00 +0000

Hello everyone! We’ve reached the halfway point of our Google Summer of Code 2024 journey, and it’s time for an update on our project to build a conversational chat assistant for OpenROAD. Under the guidance of our mentors, Indira Iyer and Jack Luar, we’re making significant strides in enhancing OpenROAD’s user support capabilities.

Project Focus

My project focuses on two crucial aspects of our chat assistant:

Data Engineering: Ensuring our assistant has access to comprehensive and relevant information.
Evaluation: Developing robust methods to assess and improve the assistant’s performance.

The ultimate goal is to create a more responsive and accurate chat assistant capable of aiding users with troubleshooting, installation, and general queries about OpenROAD. I’m working in tandem with Palaniappan R, who is developing the RAG architecture for our assistant.

Progress

Since our initial deployment, I’ve been concentrating on implementing automated evaluation systems for our RAG architecture. We’ve developed two primary evaluation methods:

Basic Abbreviation Evaluation

This method assesses the model’s ability to accurately identify and explain common abbreviations used within the OpenROAD community. It ensures that our assistant can effectively communicate using domain-specific terminology.

LLM Judge-Based Evaluation

For this more comprehensive evaluation, we:

Prepared a dataset of question-answer pairs relevant to OpenROAD.
Queried our model with these questions to generate answers.
Employed LLMs (including GPT-4o and Gemini 1.5 Flash) to act as judges.
Evaluated our model’s responses against ground truth answers.

Here’s a glimpse of our early benchmark results:

Exploratory Data Analysis (EDA) on GitHub OpenROAD issues

To gather more data, I performed Exploratory Data Analysis (EDA) on GitHub OpenROAD issues using GitHub’s GraphQL API. This allowed us to:

Filter data based on parameters such as:
- Minimum number of comments
- Date range
- Mentioned PRs
- Open or closed status
Structure the data, focusing on issues tagged with Build, Query, Installation, and Runtime.
Process the data into JSONL format with key fields including:
- url: URL of the GitHub issue
- id: Unique issue number
- title: Issue title
- author: Username of the issue creator
- description: Initial issue description
- content: Array of messages related to the issue
- category: General category of the issue
- subcategory: More specific category of the issue
- tool: Relevant tools or components
- date: Issue creation timestamp

After curating this dataset, I was able to run an Analysis on OpenROAD Github Issues, identifying multiple categories of issues in the form of a pie chart.

Looking Ahead

As we move into the second half of the GSOC period, our plans include:

Incorporating GitHub Discussions data into our knowledge base.
Utilizing this expanded dataset to enhance our RAG architecture.
Continually refining and improving our model’s performance based on evaluation results.

We’re excited about the progress we’ve made and look forward to delivering an even more capable and helpful chat assistant for the OpenROAD community. Stay tuned for more updates as we continue this exciting journey!

Halfway Through SoR24: Building a Scalable Performance Benchmarking Tool for Genomics Workflows

Sun, 21 Jul 2024 00:00:00 +0000

Project Overview

Hi! I’m Martin Putra, and I’m working on the “Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster” project under the supervision of In Kee Kim. We are building GenScale, a scalable benchmarking tool for genomics workfload which leverages industrial-grade cluster manager and monitoring systems. GenScale will allow us to generate performance data under a setup that is representative of large-scale production settings. Ultimately, we hope GenScale and the datasets it produces will catalyze engagement between the computer systems and bioinformatics community, thus accelerating the pace of discovery at both fields.

Progress and Challenges

We have built a prototype using Kubernetes as cluster manager and Prometheus for monitoring systems. At its current state, the prototype can support an arbitrary number of compute nodes, owing to Kubernetes’ notable scaling capability. This provides a suitable environment for small- to mid-scale experiments. We leverage ChameleonCloud to provide the necessary computational and reproducibility infrastructure. The monitoring system supports cluster-level, node-level, and container-level metrics collection and failure detection. We integrated Grafana dashboards for visualizations.

The prototype also supports the execution of user-defined workflows. During the design process, we considered integrating one of existing workflow execution systems, such as cwltool, Nextflow, or Cromwell. Each system has its own pros and cons when placed within the context of how we envision GenScale. However, we ultimately decided to build our own workflow execution system in order to provide maximum flexibility for the capabilities we plan to add in the future. For example, we believe it will be interesting to study how hardware heterogeneity affects the performance of each application in the workflow (a well-known workflow scheduling problem). Studying the problem requires capability to schedule execution on specific machines. In addition, if we want to study contention, we may need to execute on machines which are currently running specific workflows, too. While there are ways to do them with existing workflow execution systems + Kubernetes stack, we believe it will be hugely simplified if we build our own workflow execution system.


Figure 1. Proportion of execution time for DNA Alignment applications, executed on Chameleon’s cascadelake_r node with 1500MB paired-end input. y-axis: proportion of application’s exec. time out of the whole workflow’s exec. time, x-axis: top 10 applications accounting for 97% exec. time, sorted by proportion. Other applications are aggregated.

We confirmed GenScale’s capability to produce useful data by executing a DNA alignment workflow and capturing its runtime resource usage. We use Genomics Data Commons’ (GDC) DNA alignment workflow as reference, which has a total of 27 applications ranging from quality check, read trimming, actual alignment, indexing, and various metrics collection. We wrote our own simplified version of the workflow by first analyzing the execution time & resource usage of each application, then we chose 10 applications which represents 97% of the workflow execution time. We took into account that containerization is the de-facto standard for workflow execution among the bioinformatics community. Thus, we packaged each application as its own separate container, then hosted their Dockerfiles & containers in a private Github Container Registry (GHCR). We plan to make them public in the future. Our monitoring system is able to show resource usage in real time. We also built sidecar containers which use Unix’s pidstats to generate a CSV of cores, memory, and storage utilization throughout each workflow’s execution. This will allow easier analysis and data sharing for GenScale’s users.


Figure 2. CPU utilization pattern of BWA, Picard’s CollectWGSMetrics, and Picard’s ValidateSamFile collected by GenScale. y-axis: (num. cores) x 100%, x-axis: time elapsed in seconds.

One technical challenge is in automating the creation of Kubernetes cluster and in keeping it alive. We believe GenScale’s users would be interested in the performance of workflows under dynamic cluster sizes, either due to intentional scaling or machine failures. While the current prototype supports creating a cluster with arbitrary nodes, there are still steps which require a reboot when adding nodes. This makes cluster creation and horizontal scaling not fully automated yet. Keeping a cluster alive is also expensive. Since we use ChameleonCloud as our testbed, we have a choice of either keeping the cluster alive at the cost of significant service units (SU) usage, or save SUs by terminating our leases at the cost of rebuilding the cluster from scratch later. We choose a middle ground by keeping only Kubernetes’ control plane alive. The approach works well so far.

Next Steps

For the remaining weeks, we plan to work on the second workflow, namely RNA Alignment. We would also like to add simple user interfaces if time permits. Finally, we plan to package GenScale’s source code, container images, and sample benchmark results for the open-source community. We look forward to the second half of Summer of Reproducibility!

Midterm Blog: ML in Detecting and Addressing System Drift

Sun, 21 Jul 2024 00:00:00 +0000

Hello! I’m Joanna! Over the past month, I have been contributing to the ML in Detecting and Addressing System Drift project under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy. My project aims to design a pipeline to evaluate drift detection algorithms on system traces. The goal is to characterize different drifts, understand how they affect model performance, and evaluate the performance of state-of-the-art (SOTA) drift detection algorithms.

Progress

Over the past month, I’ve primarily been constructing a data drift dataset from the Tencent I/O block trace, which includes both drift and non-drift data. By combining offline drift detection algorithms such as Maximum Mean Discrepancy, Cramér-von Mises, and Kolmogorov-Smirnov, I am developing a dataset that contains segments with and without drifts for features such as IOPS (Input/Output Operations Per Second), read/write size ratio, write size, and other relevant performance metrics. The diagrams below illustrate the data segments identified with and without drifts, respectively.

In addition to constructing the datasets, I have begun evaluating some online drift detection algorithms and designing metrics to assess their performance. I have tested the performance of online drift detection algorithms such as Online Maximum Mean Discrepancy and Online Cramér-von Mises under various settings, including different window lengths and sensitivity levels. The following diagrams illustrate the drift points detected for the IOPS feature under these different settings.

Next Steps

Here are my plans for the next month:

Complete the experiments on data drift and generate improved visualizations to summarize the performance of these online drift detection algorithms, including their overhead and accuracy over time.
Characterize drifts by identifying the types of drifts that lead to model performance degradation
Evaluate drift detection algorithms in the context of concept drifts.

Stay tuned for my future updates on this project!

Enabling VAA Execution: Environment and VAA Preparation and/or Reproducibility for Dynamic Bandwidth Allocation (CONCIERGE)

Sat, 20 Jul 2024 00:00:00 +0000

Hi there!

I am Rafael Sinjunatha Wulangsih, a Telecommunication Engineering graduate from the Bandung Institute of Technology (ITB), Bandung, Indonesia. I’m currently contributing to the “EdgeRep: Reproducing and benchmarking edge analytic systems” project under the mentorship of Yuyang (Roy) Huang and Prof. Junchen Jiang. You can find more details about the project proposal here.

This project addresses the challenges posed by the massive deployment of edge devices, such as traffic or security cameras, in smart cities and other environments. In the previous Edgebench project, the team proposed a solution to dynamically allocate bandwidth and compute resources to video analytic applications (VAAs) running on edge devices. However, that project was limited to a single VAA, which may not represent the diverse applications running on edge devices. Therefore, the main goal of this project, “EdgeRep,” is to diversify the VAAs running on edge devices while utilizing a solution similar to that of the Edgebench project. EdgeRep aims to reproduce state-of-the-art self-adaptive VAAs (with seven candidates) and maintain self-adaptation in these video analytics pipelines. We will implement it ourselves if the video analytics applications do not support self-adaptation.

Halfway Through GSOC: Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Sat, 20 Jul 2024 00:00:00 +0000

Hello, I’m Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University. I’m currently working on the AIIO / Graph Neural Network project under the guidance of Bin Dong and Suren Byna. Our project focuses on enhancing the AIIO framework to automatically diagnose I/O performance bottlenecks in high-performance computing (HPC) systems using Graph Neural Networks (GNNs).

Project Overview

Our primary goal is to tackle the persistent issue of I/O bottlenecks in HPC applications. Identifying these bottlenecks manually is often labor-intensive and prone to errors. By integrating GNNs into the AIIO framework, we aim to create an automated solution that can diagnose these bottlenecks with high accuracy, ultimately improving the efficiency and reliability of HPC systems.

Progress and Challenges

Over the past few weeks, my work has been centered on developing a robust data pre-processing pipeline. This pipeline is crucial for converting raw I/O log data into a graph format suitable for GNN analysis. The data pre-processing involves extracting relevant features from Darshan I/O logs, which include job-related information and performance metrics. One of the main challenges has been dealing with the heterogeneity and sparsity of the data, which can affect the accuracy of our models. To address this, we’ve focused on using correlation analysis to identify and select the most relevant features, ensuring that the dataset is well-structured and informative for GNN processing.

We’ve also started constructing the GNN model. The model is designed to capture the complex relationships between different I/O operations and their impact on system performance. This involves defining nodes and edges in the graph that represent job IDs, counter types, and their values. We explored different graph structures, including those that focus on counter types and those that incorporate more detailed information. While more detailed graphs offer better accuracy, they also require more computational resources.

Current Achievements

Data Pre-processing Pipeline: We have successfully developed and tested the pipeline to transform Darshan I/O logs into graph-structured data. This was a significant milestone, as it sets the foundation for all subsequent GNN modeling efforts.
GNN Model Construction: The initial version of our GNN model has been implemented. This model is now capable of learning from the graph data and making predictions about I/O performance bottlenecks.
Correlation Analysis for Graph Structure Design: We have used correlation analysis on the dataset to understand the relationships between I/O counters. This analysis has been instrumental in designing a more effective graph structure, helping to better capture the dependencies and interactions critical for accurate performance diagnosis.

Training for Different Graph Structures: We are currently training our model using various graph structures to determine the most effective configuration for accurate I/O performance diagnosis. This ongoing process aims to refine our approach and improve the model’s predictive accuracy.

Next Steps

Looking ahead, we plan to focus on several key areas:

Refinement and Testing: We’ll continue refining the GNN model, focusing on improving its accuracy and efficiency. This includes experimenting with different graph structures and training techniques.
SHAP Analysis: To enhance the interpretability of our model, we’ll incorporate SHAP (SHapley Additive exPlanations) values. This will help us understand the contribution of each feature to the model’s predictions, making it easier to identify critical factors in I/O performance.
Documentation and Community Engagement: As we make progress, we’ll document our methods and findings, sharing them with the broader community. This includes contributing to open-source repositories and engaging with other researchers in the field.

This journey has been both challenging and rewarding, and I am grateful for the support and guidance from my mentors and the community. I look forward to sharing more updates as we continue to advance this exciting project.

Hardware Hierarchical Dynamical Systems

Sat, 20 Jul 2024 00:00:00 +0000

Hi everyone! I am Ujjwal Shekhar, a Computer Engineering student at the International Institute of Information Technology - Hyderabad. I am excited to share my current progress on the project titled “Hardware Hierarchical Dynamical Systems” as part of the Open Source Research Experience (OSRE) program and Google Summer of Code. I am working with my mentors, Jose Renau and Sakshi Garg, on this project.

Project Overview

With hardware compilers, it is not uncommon for the size of code that the hardware compilers need to handle to go into millions. We aim to improve the efficiency of the tree data structure to be used for representing the Abstract Syntax Tree (AST) of the input program. The tree data structure is optimized for typical AST traversal and queries. Some queries that are made to this tree are much more frequent than others.

Thus, the goal of this project is to be able to optimize the tree for frequent queries while still providing support for other infrequent queries. We use Google Bench to benchmark the tree for scalability and performance and expect it to outperform the current version of the tree. Finally, the new version of the tree will be integrated into the LiveHD core repository.

Progress and Challenges

Over the past month and a half, I have successfully finished working on the add/append methods of the tree. Moreover, I have finished writing the iterators on the tree too. There are preliminary tests already in place and the HHDS repository now has a working Bazel build system.

As shown in the figure, we can see that the tree went from storing pointers to everything that it could to only storing pointers to the nodes that are absolutely necessary. Moreover, by not maintaining multiple levels in the tree, we have been able to reduce the memory footprint of the tree. This is a significant improvement from the LHtree that was being used earlier.

Furthermore, we have also been able to improve the cache friendliness of each node of the tree. By realizing that most of the time, new children are added soon after the parent is added, we have been able to store the children in a contiguous memory location whenever possible, or access them using a shorter delta from the parent node. This has significantly improved the cache friendliness of the tree by allowing the packing of the book-keeping of up to 8 children in a single 512-bit word. This 512-bit chunk has amazing cache alignment properties.

Highlights

Finished working on the add/append methods of the tree.
Finished writing the iterators on the tree.
Preliminary tests are in place.
HHDS repository now has a working Bazel build system.

Challenges

Working out a new plan: The initial plan was to use a flattening policy to optimize the tree for frequent queries. However, this plan has been revised and we have flattened the tree not using a tour-based flattening policy, but by still storing pointers to various nodes in the tree. This has been done to ensure that the tree is still able to support infrequent queries.
Benchmarking: The benchmarking of the tree is still in progress. I am working on creating a benchmarking suite that will be able to test the tree for scalability and performance. This will allow future developers to test the tree for performance and scalability after they make changes.

Next Steps

From here, a lot of testing and benchmarking is still left to be done. Moreover, we need to add the delete methods and make sure that the integration with the LiveHD core repository is smooth. The next steps involve:

Adding the delete methods to the tree.
Benchmarking the tree for scalability and performance.
Ensuring that the syntax of the tree is in line with the LiveHD core repository.
Integrating the tree into the LiveHD core repository.
Adding documentation to the tree.
Integrating the testing of the tree into the LiveHD testing suite.

Conclusions

My experience so far has been amazing. I have been able to work on a project that is at the intersection of hardware and software. Moreover, I have been able to work with a team that is very supportive and has been able to guide me through the project. I am looking forward to the next steps and am excited to see the final version of the tree in the LiveHD core repository.

Acknowledgements

I would like to thank my mentors, Jose Renau and Sakshi Garg for their guidance and support throughout the project. It would not have been possible without their help.

Optimizing Scientific Data Streaming: Developing Reproducible Benchmarks for High-Speed Memory-to-Memory Data Transfer over SciStream

Sat, 20 Jul 2024 00:00:00 +0000

Hello there! I’m Acheme and I’m thrilled to share the progress on my project, “Optimizing Scientific Data Streaming: Developing Reproducible Benchmarks for High-Speed Memory-to-Memory Data Transfer over SciStream” under the mentorship of Joaquin Chung and Flavio Castro under the SciStream project.

Project Overview

This project aims to develop SciStream-bench, a set of benchmarks and artifacts designed to precisely evaluate the performance of scientific streaming applications across diverse traffic patterns when running over the SciStream framework.

Progress

One of the first points of call in the project was consultation with SciStream team members working at Argonne to identify use cases in scientific streaming applications and what typical traffic profiles they represent. The goal was to simulate these profiles using traffic generator tools and network configuration of network resources on the FABRIC/Chameleon testbed. The following traffic profiles were identified to meet many use-cases including one of the ESnet’s broad categorization, “The Time-Sensitive Pattern”, in integrated research workflows:

Throughput intensive startup
Intermittent burst of traffic for a duration of time
Constant rate traffic
Latency sensitive

Since data streaming applications have some unique requirements for optimum performance, the following metrics were selected as important for testing streaming performance.

Latency
Jitter
Packet loss / message loss
Throughput

Subsequently, about seventeen open-source traffic generator applications were identified and compared to determine a few suitable ones for generating our defined traffic profiles and that expose the desired performance metrics. We ultimately settled on iperf3 and pvaPy (a scientific streaming application developed at Argonne National Lab)

So far, the first set of tools for benchmarking using iperf3 as traffic generator with profiles of constant rate and intermittent bursts have been developed, the tools generate traffic, collects the metrics that iperf3 exposes metrics including throughput, jitter and datagram losses, and saved to a csv file for further analysis. A Jupyter notebook is used to setup a FABRIC slice and configure a four-node experiment suitable for benchmarking SciStream base architecture. After running the experiments on the nodes on FABRIC and collecting results in a CSV file, cells in the Jupyter notebook were coded to analyze the data. In the analysis includes average, min, max and standard deviation of the various metric performances.

Findings

From the experiments conducted so far, the findings are as follows:

We could not properly simulate some of the listed traffic profiles initially defined: for example, to simulate a latency-sensitive traffic profile, we needed the ability to set timeouts in iperf3 which is not available at the moment
It is not straightforward to implement SciStream on the Chameleon testbed at the moment.
Iperf3 does not expose the latency metric and the jitter computation is suspect.

Next Steps

Similar to the iperf3-based benchmarking tool developed and the analysis tools, I will focus next on pvaPy:

Fully develop traffic generator and metric collection tools for pvaPy for the defined traffic profiles and exposing the chosen metrics
Perform initial experiment like for iperf3 before
Repeat both iperf3 and pvaPy-based benchmarking operation in multiple scenario (LAN, METRO, WAN), compare performance and explain results.

Stay tuned for my final blog as I present deeper results and insights!

Architecture Updates - LLM Assistant for OpenROAD

Fri, 19 Jul 2024 00:00:00 +0000

Hi again! I’m Palaniappan R, a GSoC contributor working on the OpenROAD chat assistant project under the mentorship of Indira Iyer and Jack Luar. My project aims to build an LLM-powered chat assistant designed to provide seamless access to existing online resources, thereby reducing support overhead. Over the past month, I’ve been collaborating with Aviral Kaintura, on data engineering to deliver on our common project goal of an OpenROAD assistant and an open-EDA dataset that promotes further research and collaboration.

Progress

The retrieval architecture is at the heart of any retrieval-augmented generation (RAG) setup. Our current setup employs a hybrid-search technique, combining a traditional keyword search method with more advanced vector search methods. As illustrated in the diagram, we combine a simple semantic search, a Maximal Marginal Relevance (MMR) search and a text-based BM25 ranking technique to build our hybrid retriever.

flowchart LR id0([Query]) --> id1 id1([Vectorstore]) --- id2([Semantic Retriever]) id1([Vectorstore]) --- id3([MMR Retriever]) id1([Vectorstore]) --- id4([BM25 Retriever]) id2([Semantic Retriever]) -- Retrieved Docs ---> id5([Reranking]) id3([MMR Retriever]) -- Retrieved Docs ---> id5([Reranking]) id4([BM25 Retriever]) -- Retrieved Docs ---> id5([Reranking]) id5([Reranking]) ---> id6(top-n docs)

Upon receiving a query, relevant documents are sourced from each retriever, resulting in a broad set of results. We feed these results into a cross-encoder re-ranker model to get the top-n documents with maximum relevance.

After building the retriever, we utilized the LangGraph framework to develop a stateful, multi-agent workflow tailored to our use case. This allows flexibility in servicing a diverse set of user questions in an efficient and accurate manner, given the sparse nature of our dataset.

Our current dataset can be broadly classified into the following categories:

OpenROAD Documentation
OpenROAD-flow-scripts Documentation
OpenSTA Documentation
OpenROAD Manpages

These data sources are embedded into separate FAISS vector databases using open-source embeddings models (we’ve been working on fine-tuning an embeddings model for better retrieval accuracy). The hybrid search retrievers are then applied to these vector databases, creating internal tools that can be queried by our LLM as needed. Each tool has access to different data sources in various domains. For instance, the retrieve_cmds tool selectively has access to information detailing the multiple commands in the OpenROAD framework, while the retrieve_install deals with installation-related documentation. As depicted in the flowchart, a routing LLM call classifies the input query and forwards it to the appropriate retriever tool. Relevant documents are then sent back to the LLM for response generation.

graph TD __start__ --> router_agent router_agent -.-> retrieve_cmds router_agent -.-> retrieve_general router_agent -.-> retrieve_install router_agent -.-> retrieve_opensta retrieve_cmds --> generate retrieve_general --> generate retrieve_install --> generate retrieve_opensta --> generate generate --> __end__

Feel free to try out our chat assistant here. Instructions to set up and run our chatbot can be found here.

Here’s an example of our chatbot in action.

Future Plans

In the upcoming weeks, we aim to enhance our dataset by incorporating actionable information filtered from GitHub issues and discussions. We’ll be adding support to keep track of the conversation history as well.

Stay tuned for more updates!

Reproducibility in Data Visualization

Fri, 19 Jul 2024 00:00:00 +0000

Hello everyone!

Initial Approach and Challenges

I began my work by comparing original visualizations with reproduced ones using OpenCV for pixel-level comparison. This method helped highlight structural differences but also brought to light some challenges. Different versions of libraries rendered visualizations slightly differently, causing minor positional changes that didn’t affect the overall message but were still flagged as discrepancies.

To address this, I experimented with machine learning models like VGG16, ResNet, and Detectron2. These models are excellent for general image recognition but fell short for our specific needs with charts and visualizations. The results were not as accurate as I had hoped, primarily because these models aren’t tailored to handle the unique characteristics of data visualizations.

Shifting Focus to Chart-Specific Models

Recognizing the limitations of general ML models, I shifted my focus to chart-specific models like ChartQA, ChartOCR, and ChartReader. These models are designed to understand and summarize chart data, making them more suitable for our goal of comparing visualizations based on the information they convey.

Generating Visualization Variations and Understanding Human Perception

Another exciting development in my work has been generating different versions of visualizations. This will allow me to create a survey to collect human categorization of visualizations. By understanding how people perceive differences whether it’s outliers, shapes, data points, or colors. We can gain insights into what parameters impact human interpretation of visualizations.

Next Steps

Moving forward, I’ll continue to delve into chart-specific models to refine our comparison techniques. Additionally, the survey will provide valuable data on human perception, which can be used to improve our automated comparison methods. By combining these approaches, I hope to create a robust framework for reliable and reproducible data visualizations.

I’m thrilled about the progress made so far and eager to share more updates with you all. Stay tuned for more insights and developments on this exciting journey!

Reproducibility in Data Visualization

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello! My name is Arya Sarkar a machine learning engineer and researcher based out of Kolkata, a city in Eastern India dubbed the City of Joy. For the last month and a half I have been working closely with Professor David Koop on the project titled Reproducibility in Data Visualization. I’m thrilled to be able to make my own little mark on this amazing project and aid in exploring solutions to capture visualizations in hopes of making reproducibility easier in this domain.

Progress and Challenges

The last month and a half have mostly been spent trying to explore best possible solutions to facilitate the reproducibility of STATIC visualizations from local sources and/or the web. We have taken inspiration from existing work in the domain and successfully captured meta-information required to ensure reproducibility in the regenerated visualizations from the said metadata. The metadata extracted is saved into the generated .png figure of the visualization therefore allowing reproducibility as long as you have (a) The original dataset (b) The generated .png of the visualization. Every other information is stored inside the .png file as a json object and can be used to regenerate the original image with a very high accuracy.

The problem however remains with visualizations where randomness such as jitter is involved. Capturing the randomness has not been 100% successful as of now, and we are looking into options to ensure the capture of certain plots that contains randomness.

The following images can be used to highlight some results from our reproducibility experiments: Original Histogram using Matplotlib on the iris dataset:

Reproduced Histogram using metainformation from the original:

The next steps

We have already started looking into solutions and ways to capture visualizations from the web i.e. from platforms such as ObservableHq and use these experiments to transition into capturing interactive visualizations from the web.

Capturing user interactions and all states in an interactive visualization can prove to be very useful as it is a very known pain-point in the reproducibility community and has been a challenge that needs to be solved. My next steps involve working on finding a solution to capture these interactive visualizations especially those living on the web and ensuring their reproducibility.

Halfway Through GSOC: My Experience and Learnings

Thu, 18 Jul 2024 00:00:00 +0000

Hello there! I’m Qianru, and this is my mid-term blog post for the 2024 Google Summer of Code. I am working on the BenchmarkST project, focusing on benchmarking gene imputation methods in spatial transcriptomics. My goal is to create a comprehensive, reproducible platform for evaluating these methods across various datasets and conditions.

In this post, I will share some of the progress I have made so far, the challenges I have faced, and how I overcame them. I will also highlight some specific accomplishments and what I plan to do next.

Achievements:

Developed the Python Package: I created the “Impeller” Python package, which includes tools for downloading example data, processing it, and training models. This package aims to standardize gene imputation tasks in spatial transcriptomics.
Example Data Integration: Successfully integrated various spatial transcriptomics datasets into the package for benchmarking purposes.
Benchmarking Framework: Established a framework for objective comparison of different gene imputation methodologies.

Python Package: Installation and Usage

You can install the package using pip:

pip install Impeller

Download Example Data

from Impeller import download_example_data
download_example_data()

Load and Process Data

from Impeller import load_and_process_example_data, val_mask, test_mask, x, original_x = load_and_process_example_data()

Train Model

from Impeller import create_args, train args = create_args(),test_l1_distance, test_cosine_sim, test_rmse = train(args, data, val_mask, test_mask, x, original_x)

Challenges:

Reproducing the results of various gene imputation methods was not an easy task. I faced several challenges along the way:

Lack of Standardized Data: Some methods had incomplete or missing code, making it difficult to reproduce their results accurately.
Reproducibility Issues: Successfully integrated various spatial transcriptomics datasets into the package for benchmarking purposes.
Resource Limitations: Running large-scale experiments required significant computational resources, which posed constraints on the project timeline.

Future Work:

Moving forward, I plan to:

Extend the package’s functionalities to include more datasets and imputation methods.
Enhance the benchmarking framework for more comprehensive evaluations.
Collaborate with other researchers to validate and improve the package’s utility in the bioinformatics community.

I hope you found this update informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for your attention and support!

Mid Term Blog: FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Lihaowen (Jayce) Zhu, a 2024 SoR contributor for the FEP-bench project, under the mentorship of Yuyang (Roy) Huang. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. By exploring innovative caching, prefetching, and heuristic strategies, this proposal aims to optimize the preprocessing workflow, thereby enhancing efficiency and reducing the required resources of ML projects.

Motivation

Our research project is based on the context of Deep Neural Networks. To train a DNN, we first need a large amount of data. All raw data must be preprocessed by a data preprocessing pipeline, which is specific to different ML tasks. As usual, in a preprocessing pipeline, the data must be loaded from the disk and converted to the correct format, transformed and augmented. And then, it can be fed into the training stage. In common ML training tasks and datasets, the data preprocessing stage can consume almost 65% of the total training time. However, compared with the fast development of computing hardware including GPUs and TPUs, the speed of data preprocessing pipelines has not been improved by a lot and cannot keep up with these hardware innovations, which leads to a bottleneck in the efficiency of Deep Neural Network training.

The bottlenecks can be divided into 2 categories: the data side and the computation side. The data side bottleneck is mainly caused by the data transfer in the system, including data fetching, I/O bound, huge size of data, and complex data format. However, the computation side bottleneck can always happen during data preprocessing operations and data shuffling. For distributed Machine Learning training systems, gathering the distributed data can also lead to the computation side bottleneck.

Current Progress

In order to improve the efficiency of the machine learning preprocessing pipeline, we first need to understand and document the preprocessing workflows commonly used in machine learning, including pipelines of Natural Language Processing, Computer Vision, and Audio datasets. As a result, for the past month, we have built up a collection of common datasets for different machine learning tasks. The dataset types include NLP, CV, Audio, Linear Regression, Video and LiDAR. The machine learning job types are collected based on the dataset types, such as sentiment analysis for NLP, and image classification for CV. The data has either a structured or unstructured format. In addition, our collection contains the following attributes:

Data/Sample size
Typical preprocessing operations
Preprocessing difficulty: hard/easy
Input splittable
Output reusable
CPU/GPU/IO Bound
Dataset and preprocessing links.

By collecting all this data, we can gain an overview of all common preprocessing pipelines in the current machine learning research field, and build up a solid basis for the next phase of our project, which requires hard work on benchmark profiling. For example, for the Audio datasets, we focus on the LibriSpeech dataset. It contains 1000 hours of speech sampled at 16kHz, making it one of the largest publicly available datasets for speech recognition tasks. The typical preprocessing steps of the LibriSpeech dataset include feature extraction, label to integer conversion, and padding.

Challenges

During the first phase of the project, I met a lot of challenges as I had not been exposed to topics similar to this project. The first big problem was that I needed to learn the concepts of some machine learning tasks from scratch, such as NLP, so that I could have a better understanding of the common datasets and pipelines. Also, I needed to deeply review a lot of different preprocessing pipelines for each machine learning task, to make the table more comprehensive.

Midterm Blogpost: Drift Management Strategies Benchmark

Thu, 18 Jul 2024 00:00:00 +0000

Hello there! I’m William and I’m thrilled to share the progress on my project, “Developing A Comprehensive Pipeline to Benchmark Drift Management Approaches” under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy under the LAST project.

Project Overview

Progress

So far, I’ve generated various synthetic datasets, which include:

CIRCLE: This dataset contains two features x1, x2 drawn uniformly from the interval [0, 1]. Each data point is labeled as per the condition (x1 − c1)^2 + (x2 − c2)^2 <= r where the center (c1, c2) and radius r of the circular decision boundary changes gradually over a period of time introducing (gradual) concept drift.
COVCON: This 2-dimensional dataset has covariate shift and concept drift. The decision boundary at each point is given by α ∗ sin(πx1) > x2. We use 10000 points (100 batches, 1000 points per batch). Covariate shift is introduced by changing the location of x1 and x2 (for batch t x1 and x2). Concept drift is introduced by alternating the value of α.
SINE: This dataset contains two features x1, x2 drawn uniformly from the interval [0, 1]. In the first context all points below the curve y = sin(x) are classified as positive. The label for the classes are flipped after.

Additionally, I’ve also curated drifting data from the Tencent I/O block trace. These datasets will be used to benchmark model performance under different drift conditions.

The pipeline can receive a base sci-kit learn model, and evaluate their performance on these datasets prequentially. Here are some of the initial results for the performance of the models on these drifting dataset, under a never retraining and retraining, using 1 & 7 past windows. As you can see, model performance degrades upon encountering extreme drift.

Findings

From the experiments conducted so far, the findings are as follows:

A model without retraining struggles to maintain performance when drift occurs.
Retraining on data from previous drifting windows, whether abruptly (SINE) or gradually (CIRCLE), leads to poorer performance, especially evident in the retrain Window, which incorporates data up to 7 windows prior.
However, retraining on previous data proves beneficial in cases of covariate shift (CovCon), allowing the model to better align with the evolving real-world feature distributions.

Next Steps

As the base template for the pipeline and dataset curation is done, as I move forward, my focus will be on:

Implementing three advanced algorithms: AUE (Accuracy Updated Ensemble), MATCHMAKER, and Driftsurf, then integrating them into the pipeline.
Enhancing the benchmarking process by adding more metrics and plots, such as training time and inference time, to better evaluate the strategies.
Packaging the entire experiment into a Chameleon Trovi Artifact, ensuring ease of reproducibility and extension.

Stay tuned for my final blog as I delve deeper into this project!

Midterm Blogpost: HDEval's LLM Benchmarking for HDL Design

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello! My name is Ashwin Bardhwaj, an electrical engineering and computer science student based in San Diego, CA. For the past 6 weeks, I have been working closely with Professor Jose Renau on the HDEval project. The aim of this project is to create multiple project sized HDL benchmarks to evaluate how well existing LLMs can generate Verilog/Chisel code. These benchmarks will include my own “golden” HDL implementation of the project as well as respective English prompts to guide the LLM. I am excited to be able to work with these tools that have the potential to become a valuable resource for HDL design. So far, I have been successful in creating the first benchmark, a pipelined 3 stage RISC-V core, as well as working through by second project, a Gameboy Emulator.

RISC-V Implementation

Over this past month and a half, I have successfully completed my first benchmark which focuses on creating, modeling, and testing a pipelined 3-stage RISC-V core. The core uses the fetch, decode, and execute structure and is functional for most RV32I instructions. I synthesized and simulated my Verilog using Icarus Verilog and displayed the waveforms on GTKWave. After development, a good section of time was spent creating and tuning the English explanation of each Verilog module. After running these benchmark files through several LLM APIs, we compared the existing “golden” modules with the generated ones and noticed that more recent versions of LLMs such as GPT 4o and Claude 3 preform much better at creating syntactically correct and efficient code.

In addition, I have also created a tool that will parse the Verilog and instruction files into the necessary json structure to then test on various models.

Gameboy Emulator

I am also in the process of developing the second benchmark, which targets a Gameboy emulator. This will challenge the LLMs much more than the RISC-V project because apart from the custom CISC CPU, the model should also understand how to handle various other blocks of the hardware system including memory, picture processing unit (PPU), sound processing unit (SPU), various input/output systems like the buttons and cartridge, and interrupt handlers. As a result, it will challenge the model to understand the system as a whole when creating each individual module.

Next Steps

As we continue on to the second half of the project, I will continue working on my gameboy emulator. I have already completely developed and tested the Z80-esque CPU, DMA, and interrupt handler but need to continue working on the display and sound interfaces. Also, I will also continue to evaluate and run these tests over a wider range of LLMs to get a better picture of what models and versions are best suited for HDL design as well as the direction these models are going in.

Halfway Through OSRE24: My Experience and Learnings

Mon, 15 Jul 2024 00:00:00 +0000

Hello there! I’m Kilian Warmuth, a computer science student from Germany. This summer, I’m part of the 2024 Summer of Reproducibility (SoR) initiative. My project, “Reproducible Experiment Workflows in SLICES/pos,” aims to enhance reproducibility in scientific research, aligning with the FAIR principles (Findable, Accessible, Interoperable, Reusable).

Project Overview

The “Reproducible Experiment Workflows in SLICES/pos” project is part of the larger SLICES-RI initiative, designed to improve the reproducibility and reusability of large-scale experimental research. The project focuses on integrating the RO-Crate standard into the pos testbed to organize and document experiment results systematically. This integration will enhance the accessibility and comprehensibility of research findings, ensuring they adhere to the FAIR principles. Additionally, the project aims to improve the portability of pos experiments to the Chameleon testbed, facilitating collaboration and seamless execution across different research environments.

Progress and Challenges

The first half of the project is done, marked by significant progress and learnings. My initial focus was on familiarizing myself with the pos framework and the RO-Crate standard. This foundational knowledge was crucial for the subsequent steps of restructuring the results folder and integrating automated RO-Crate generation into the pos framework.

Key Achievements:

Restructured Results Folder: The structure of the results folder has been redesigned to streamline navigation and enable systematic storage of result data.
Automated RO-Crate Generation: Successfully integrated the basics of the RO-Crate standard into the pos framework, enabling the automated generation of comprehensive results documentation.
Metadata Documentation: Added comprehensive documentation to the results data, including essential metadata such as author details, user scripts, and hardware information, enhancing reproducibility and interpretability.

Challenges Encountered:

Balancing Automation with Flexibility: Ensuring the automated generation of RO-Crates did not compromise the flexibility required by researchers to customize their experiment documentation and mess with the complex requirements of a testbed.
Complexity of Testbed Systems: FIntegrating the RO-Crate implementation for a complex system like a testbed has required deep dives into the code base of the testbed.

Despite these challenges, the progress made has been rewarding, laying a solid foundation for the next phase of the project.

Learnings and Skills Gained

Understanding the Complexity of Testbeds: One of the key learnings from this project has been the realization that testbeds are complex systems. Despite their complexity, the process became manageable thanks to well-documented software and the invaluable support of top mentors who provided detailed answers to in-depth questions. Their guidance was crucial in navigating the challenges of the project.

Open Source Development in an Educational Environment: My experience in open source development has been enriched by working within an educational context. This skill is particularly important when adapting and simplifying code to ensure that users can follow along and gain a deeper understanding of the experiments, improving the quality of research experiments.

Next Steps

As we move into the second half of this project, our primary focus will be on enhancing the portability of pos experiments to the Chameleon testbed. Key tasks include:

Finetune RO-Crate Implementation: Continue refining the RO-Crate integration to handle the complexities of testbed systems more effectively like special edge cases.
Enhance Portability: Refine the integration with Trovi, ensuring seamless upload and retrieval of experiment results across testbeds.
Develop Introductory Examples: Create examples demonstrating the use of pos in various testbed environments to guide researchers.
Execute and Analyze Experiments: Design and execute a complex network experiment on both SLICES/pos and Chameleon, validating and refining portability features.

These steps are crucial to achieving our goal of making pos experiments more accessible and reproducible across different research environments.

Conclusion

Reflecting on the first half of my OSRE24 journey, I am incredibly grateful for the opportunity to work on the “Reproducible Experiment Workflows in SLICES/pos” project. The experience has been both challenging and rewarding, providing valuable insights into open-source development, machine learning techniques, and the creation of educational resources.

As we move forward, I am excited about the coming weeks. The completion of the portability enhancements and the execution of complex experiments lie ahead, marking significant milestones in our project. The skills and lessons I have acquired will guide me in future endeavors.

Data leakage in applied ML: reproducing examples from genomics, medicine and radiology

Mon, 01 Jul 2024 00:00:00 +0000

Hello everyone! I’m Shaivi Malik, a computer science and engineering student. I am thrilled to announce that I have been selected as a Summer of Reproducibility Fellow. I will be contributing to the Data leakage in applied ML: reproducing examples of irreproducibility project under the mentorship of Fraida Fund and Mohamed Saeed. You can find my proposal here.

This summer, we will reproduce studies from medicine, radiology and genomics. Through these studies, we’ll explore and demonstrate three types of data leakage:

Pre-processing on train and test sets together
Model uses features that are not legitimate
Feature selection on training and test sets

For each paper, we will replicate the published results with and without the data leakage error, and present performance metrics for comparison. We will also provide explanatory materials and example questions to test understanding. All these resources will be bundled together in a dedicated repository for each paper.

This project aims to address the need for accessible educational material on data leakage. These materials will be designed to be readily adopted by instructors teaching machine learning in a wide variety of contexts. They will be presented in a clear and easy-to-follow manner, catering to a broad range of backgrounds and raising awareness about the consequences of data leakage.

Stay tuned for updates on my progress! You can follow me on GitHub and watch out for my upcoming blog posts.

FetchPipe: Data Science Pipeline for ML-based Prefetching

Tue, 25 Jun 2024 00:00:00 +0000

Hello, I’m Peiran Qin, a first-year Pre-Doctoral student in Computer Science at the University of Chicago. In this summer I will focus working on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. This is my proposal.

Caching and prefetching are integral components of modern storage systems, aimed at reducing I/O latency by utilizing faster but less dense memory for storing data that is accessed frequently. Traditional prefetching strategies, which primarily rely on heuristic-based methods, often fall short in performance, particularly in complex scenarios. To address the complex scenarios, in recent years, machine learning solutions have emerged as a promising alternative, offering the ability to learn and predict complicated data access patterns. However, each existing ML prefetcher may bias toward different scenarios and distinct evaluation metrics. There is still a necessity to evaluate state-of-the-art machine learning based literatures comprehensively and fairly under an aligned evaluation framework and extensive performance metrics. Therefore, It becomes the motivation for me to spend my summer on this interesting project!

Developing an Efficient CMS for Polyphy Project

Fri, 21 Jun 2024 00:00:00 +0000

Hello everyone,

My name is Mohit, and I am currently a sophomore at NIT Jalandhar. As part of the Polyphy project’s team, I am determined to make data management much easier for everyone involved. My project also aims to increase the project’s social presence.

As part of the PolyPhy my proposal under the mentorship of MENTOR aims to …. You might wonder, why create a new CMS when we could use existing solutions like Strapi, Contentful, or WordPress? The answer lies in the specific requirements of our project, which I’ll cover in a separate blog post about the selection of the tech stack and code architecture.

Returning to my programming journey, while the CMS is a significant part of the project, it also includes refactoring existing React code and migrating it to Next.js, among other cool tasks. The first two weeks of my project were primarily focused on this. Now, I am shifting more towards the CMS development.

How did we start? Initially, we created a curated list of essential features needed, through discussions with my mentors. They ensured that I wouldn’t face the burden of unnecessary features, focusing instead on what was truly beneficial for both me and the project. I began by experimenting with various WYSIWYG editors such as React Quill, Tiptap, Draft.js, and Slate.

By the end of this week, I successfully created a small working prototype of the CMS. As the coding period progresses, things are beginning to take shape, and I am really excited about creating something that will help people in the long run.

Thank you for reading, and stay tuned for more updates!

Causeway: A New Approach to Web Development Teaching

Thu, 20 Jun 2024 00:00:00 +0000

As part of the Causeway team, my proposal under the mentorship of Professor David Lee aims to enhance web development education through situated learning.

Causeway addresses shortcomings in current online coding tutorials by offering a comprehensive approach to web development using an Angular, RxJS, NgRx, and Firebase stack. By breaking down the complex task of creating a website down into discrete chunks (micro-roles) and tracking individual progress, students can be assured they are acheiving their desired learning goals. With this project, our team hopes to demonstrate the potential of sitatuted learning – tacit knowledge picked up within a real-world context – instead of content-based learning approaches used in sites like Khan Academy and Coursera.

Over the course of this summer, we plan on reinvigorating the pre-existing v1 platform through the addition of new features such as dashboards, quizzes, and in-depth walkthroughs of new potential projects for users to implement. The platform will also leverage the Stackblitz WebContainer API and Firebase Cloud Functions to run full applications in the browser for interactive and secured learning.

Unveiling Medicine Patterns: 3D Clustering with Polyphy/Polyglot

Wed, 19 Jun 2024 00:00:00 +0000

Hello! My name is Ayush and this summer I’ll be contributing to Polyphy and Polyglot, a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. under the mentorship of Oskar Elek and Kiran Deol.

For the reference here’s my proposal for this project.

Polyglot offers an immersive 3D visualization experience, enabling users to zoom, rotate, and delve into complex datasets. My project aims to harness these capabilities to unlock hidden connections in the realm of medicine, specifically focusing on the relationships between drugs based on their shared salt compositions, rather than just their active ingredients. This approach promises to reveal intricate patterns and relationships that have the potential to revolutionize drug discovery, pharmacology, and personalized medicine.

In this project, I will create custom embeddings for a vast dataset of over 600,000 medicines, capturing the relationships between their salt compositions. By visualizing these embeddings in Polyglot’s 3D space, researchers can identify previously unknown connections between medicines, leading to new insights and breakthroughs. The dynamic and interactive nature of Polyglot will empower researchers to explore these complex relationships in a very efficient and cool way, potentially accelerating the discovery of new drug interactions and therapeutic applications.

I am really excited to work on this project. Keep following the blogs for further updates!.

Assessing the Computational Reproducibility of Jupyter Notebooks

Tue, 18 Jun 2024 00:00:00 +0000

Like so many authors before me, my first reproducibility study and very first academic publication started with the age-old platitude, “Reproducibility is a cornerstone of the scientific method.” My team and I participated in a competition to replicate the performance improvements promised by a paper presented at last year’s Supercomputing conference. We weren’t simply re-executing the same experiment with the same cluster; instead, we were trying to confirm that we got similar results on a different cluster with an entirely different architecture. From the very beginning, I struggled to wrap my mind around the many reasons for reproducing computational experiments, their significance, and how to prioritize them. All I knew was that there seemed to be a consensus that reproducibility is important to science and that the experience left me with more questions than answers.

Not long after that, I started a job as a research software engineer at Purdue University, where I worked heavily with Jupyter Notebooks. I used notebooks and interactive components called widgets to create a web application, which I turned into a reusable template. Our team was enthusiastic about using Jupyter Notebooks to quickly develop web applications because the tools were accessible to the laboratory researchers who ultimately needed to maintain them. I was fortunate to receive the Better Scientific Software Fellowship to develop tutorials to teach others how to use notebooks to turn their scientific workflows into web apps. I collected those and other resources and established the Jupyter4Science website, a knowledgebase and blog about Jupyter Notebooks in scientific contexts. That site aims to improve the accessibility of research data and software.

There seemed to be an important relationship between improved accessibility and reuse of research code and data and computational reproducibility, but I still had trouble articulating it. In pursuit of answers, I moved to sunny Arizona to pursue a History and Philosophy of Science degree. My research falls at the confluence of my prior experiences; I’m studying the reproducibility of scientific Jupyter Notebooks. I have learned that questions about reproducibility aren’t very meaningful without considering specific aspects such as who is doing the experiment and replication, the nature of the experimental artifacts, and the context in which the experiment takes place.

I was fortunate to have found a mentor for the Summer of Reproducibility, Tanu Malik, who shares the philosophy that the burden of reproducibility should not solely rest on domain researchers who must develop other expertise. She and her lab have developed FLINC, an application virtualization tool that improves the portability of computational notebooks. Her prior work demonstrated that FLINC provides efficient reproducibility of notebooks and takes significantly less time and space to execute and repeat notebook execution than Docker containers for the same notebooks. My work will expand the scope of this original experiment to include more notebooks to FLINC’s test coverage and show robustness across even more diverse computational tasks. We expect to show that infrastructural tools like FLINC improve the success rate of automated reproducibility.

I’m grateful to both the Summer of Reproducibility program managers and my research mentor for this incredible opportunity to further my dissertation research in the context of meaningful collaboration.

Exploring Reproducibility in High-Performance Computing Publications with the Chameleon Cloud

Sat, 15 Jun 2024 00:00:00 +0000

Hello everyone,

I’m Klaus Kraßnitzer and am currently finishing up my Master’s degree at the Technical University of Vienna. This summer, under the guidance of Sascha Hunold, I’m excited to dive into a project that aims to enhance reproducibility in high-performance computing research.

Our project, AutoAppendix, focuses on the rigorous evaluation and potential automation of Artifact Description (AD) and Artifact Evaluation (AE) appendices from publications to this year’s Supercomputing Conference (SC). Due to a sizeable chunk of SC publications utlizing Chameleon Cloud, a platform known for its robust and scalable experiment setups, the project will be focused on and creating guidelines (and potentially, software tools) that users of the Chameleon Cloud can utilize to make their research more easily reproducible. You can learn more about the project and read the full proposal here.

My fascination with open-source development and research reproducibility was sparked during my undergraduate studies and further nurtured by my role as a teaching assistant. Hands-on projects and academic courses, like those in chemistry emphasizing precise experimental protocols, have deeply influenced my approach to computational science.

Project Objectives

Analyze and Automate: Assess current AE/AD appendices submitted for SC24, focusing on their potential for automation.
Develop Guidelines: Create comprehensive guidelines to aid future SC conferences in artifact submission and evaluation.
Build Tools (Conditionally): Develop automation tools to streamline the evaluation process.

The ultimate aim of the project is to work towards a more efficient, transparent, and reproducible research environment, and I’m committed to making it simpler for researchers to demonstrate and replicate scientific work. I look forward to sharing insights and progress as we move forward.

Thanks for reading, and stay tuned for more updates!

Reproducibility in Data Visualization

Fri, 14 Jun 2024 00:00:00 +0000

Hello! My name is Arya Sarkar and I will be contributing to the research project titled Reproducibility in Data Visualization, with a focus on investigating and coming up with novel solutions to capture both static and dynamic visualizations from different sources. My project is titled Investigate Solutions for Capturing Visualizations and I am mentored by Prof. David Koop.

Open-source has always piqued my interest, but often I found it hard to get started in as a junior in university. I spent a lot of time working with data visualizations but had never dived into the problem of reproducibility before diving into this project. When I saw a plethora of unique and interesting projects during the contribution phase of OSRE-2024, I was confused at the beginning. However, the more I dived into this project and understood the significance of research in this domain to ensure reproducibility, the more did I find myself getting drawn towards it. I am glad to be presented this amazing opportunity to work in the Open-source space as a researcher in reproducibility.

This project aims to investigate, augment, and/or develop solutions to capture visualizations that appear in formats including websites and Jupyter notebooks. We have a special interest on capturing the state of interactive visualizations and preserving the user interactions required to reach a certain visualization in an interactive environment to ensure reproducibility.My proposal can be viewed here!

Artificial Intelligence Explainability Accountability

Fri, 14 Jun 2024 00:00:00 +0000

Hey! I’m Sarthak Chowdhary(Shaburu), and I am thrilled to share my incredible journey with the Open Source Program Office of UC Santa Cruz! Association as part of Google Summer of Code (GSoC) 2024. This experience marks a pivotal milestone in my career, offering me the chance to delve into an intriguing project while learning from the brightest minds in the open-source community. Allow me to guide you through my adventure thus far, from the nerve-wracking wait for results to the exhilarating commencement of the coding period.

Before we start here’s my Proposal.

Pre-GSoC Application

I had shortlisted 3 Organizations that i was working on

OSPO UC Santa Cruz - Amplifying Research Impact Through Open Source
CVAT.AI - Computer Vision Data Annotation for AI
Emory University - Biomedical Research to Advance Medical Care

On the 1st of May, like many students eagerly anticipating the results of the Google Summer of Code (GSoC) 2024, I found myself glued to my screen, anxiously awaiting the clock to strike 11:30 PM IST. After what felt like an eternity of waiting, I finally received the email that changed everything: I had been selected for GSoC 2024 with the Open Source Program Office of UC Santa Cruz!

The first month of GSoC, known as the community bonding period, is for establishing rapport with the people working on the project. I researched about my mentor Dr. Leilani H. Gilpin and build a good rapport with her, who is an Assistant Professor in Computer Science and Engineering and an affiliate of the Science & Justice Research Center at UC Santa Cruz. She is also a part of the AI group @ UCSC and leads the AI Explainability and Accountability (AIEA) Lab. Her research focuses on the design and analysis of methods for autonomous systems to explain themselves. Her work has applications to robust decision-making, system debugging, and accountability. Her current work examines how generative models can be used in iterative XAIstress testing. She guided me through the necessary documentation and explained the Project demands and requirements in detail, which was invaluable for my project.

Project

The project aims to build a system that is capable of taking some input which will be the student’s code and explaining them their mistakes from low level syntax errors, compilation errors to high level issues such as overloaded variables.

My Proposal aims to create custom novel basic questions and take it up a notch by creating custom drivers for each problem, common drivers to detect low level errors and give baseline explanations for various error cases, combining these drivers to make a robust system and use third-party open source software (like monaco code editor - the editor of the web) where necessary. Write uniform and consistent feedback/explanations for Each coding problem while covering all the possible edge cases and a pipeline which will iterate the test cases and feedbacks. This benchmark suite will be used for testing the system.

Additionally I plan on building an interface that has a roadmap from basics such as arrays, hashmaps to advanced topics such as trees, heap, backtracking along with progress bars and throws confetti on successful unit tests (important). These will be using the same benchmark suite that will be built under the hood. I will be utilizing Judge0 (open-source online code execution system) for the code execution and Monaco(open-source The Editor of the Web) as the code editor for this.

Project goals:

Project Objective: By the end of summer the software should be a novel and robust tool for helping the community of beginner and advanced programmers alike in learning programming by hyper-focusing on the mistakes they make and using AI to explain to them the how, what and why of their code. Provide clear and concise explanations accompanied by actionable suggestions for debugging and improvement.
Expected deliverables: A Robust eXplainable AI benchmark suite which will be used extensively for the undergraduate AI courses and possibly the Graduate courses as well. Along with anyone interested in learning programming with the help of personalized AI.
Future work based on project: A beautiful Gamified interface that gets people excited to learn programming which utilizes the above benchmark suite would be awesome to build!

When I Started my programming journey (before ChatGPT😨) I personally encountered problems that were way above my skill set and I had no way of knowing so, which used to result in spending countless hours without proper feedback as to where I was going wrong. This project has a real impact on people in an innovative way which I wish I had access to at the start of my Programming journey, so working on it comes from a place of passion. Also this specific project will test my own understanding of programming and spending the summer solidifying it, that too under the guidance of Leilani H. Gilpin is a dream come true for me.

Data leakage in applied ML: reproducing examples of irreproducibility

Fri, 14 Jun 2024 00:00:00 +0000

Hello,

I am Kyrillos Ishak I am happy to be part of SOR 2024, I am working on Data leakage in applied ML: reproducing examples of irreproducibility project. My proposal was accepted.

I am excited to work with Fraida Fund and Mohamed Saeed as my mentors. The objective of the project is to develop educational resources that can be adjusted by professors/instructors to explain specific data leakage problems. This involves ensuring the reproducibility of certain research papers that contain data preprocessing issues, then fixing these issues to demonstrate how they can affect the results.

Data leakage is a problem caused when information from outside the training dataset is used to create the model. This issue can lead to overly optimistic performance estimates and, ultimately, models that do not perform well on new, unseen data.

Despite the importance of addressing data leakage, many people from fields not closely related to computer science, are often unfamiliar with it, even if they are aware of best practices for data preprocessing. Developing educational materials on this topic will greatly benefit them.

I am excited to dive into the topic of data leakage in machine learning. Throughout the summer, I will be sharing regular updates and insightful blog posts on this subject. Stay tuned for more information!

Developing Trustworthy Large Language Models

Fri, 14 Jun 2024 00:00:00 +0000

Hi! Thanks for stopping by.

In this first blog post of a series of three, I’d like to introduce myself, my mentor, and my project.

My name is Nikhil. I am an ML researcher who works at the intersection of NLP, ML, and HCI. I previously worked as a Machine Learning Engineer II at VMware and spent some wonderful summers interning with ML teams at NVIDIA and IIT Bombay. I also recently graduated from the University of Southern California (USC) with honors in Computer Science and a master’s thesis.

This year at Google Summer of Code (GSoC 24), I will be working on developing trustworthy large language models. I’m very grateful to be mentored by Leilani H. Gilpin at the AIEA lab, UC Santa Cruz. I truly admire the flexibility and ownership she allows me in pursuing my ideas independently within this project. Please feel free to peruse my accepted GSoC proposal here.

Project: My project has a tangible outcome: An open-source, end-to-end, full-stack web app with a hybrid trustworthy LLM in the backend.

This open-source web app will be a lightweight tool that not only has the ability to take diverse textual prompts and connect with several LLMs and a database but also the capability to gather qualitative and quantitative user feedback. Users will be able to see how this feedback affects the LLMs’ responses and impacts its reasoning and explanations (xAI). The tool will be thoroughly tested to ensure that the unit tests are passing and there is complete code coverage.

At the moment, we are investigating LLMs and making them more trustworthy in constraint satisfaction tasks like logical reasoning and misinformation detection tasks. However, our work has applicability in other areas of Responsible AI, such as Social Norms (toxicity detection and cultural insensitivity), Reliability (misinformation, hallucination, and inconsistency), Explainability & Reasoning (lack of interpretability, limited logical, and causal reasoning), Safety (privacy violation and violence), and Robustness (prompt attacks and distribution shifts).

Impact:

Responsible AI research teams across industry and academia can use this as a boilerplate for their user study projects.
Diverse PhD students and academic researchers looking to study LLM and user interaction research will find this useful.
LLM alignment researchers and practitioners can find this resourceful as user feedback affects the inherent rewards model of the internal LLMs.
Explainable AI (xAI) researchers can find value in the explanations that this tool generates, which reveal interpretable insights into how modern LLMs think and use their memory. These are just a few use cases; however, there are several others that we look forward to describing in the upcoming posts.

This was my first blog in the series of three for the UC OSPO. Stay tuned for the upcoming blogs, which will detail my progress at the halfway mark and the final one concluding my work.

If you find this work interesting and would love to share your thoughts, I am happy to chat! :) Feel free to connect on LinkedIn and mention that you are reaching out from this blog post.

It is great to meet the UC OSPO community, and thanks for reading. Bye for now.

Enhancing Usability and Expandability of the Open Sensing Platform project

Fri, 14 Jun 2024 00:00:00 +0000

Greetings everyone,

I am Ahmed Falah and I am delighted to be part of the 2024 Google Summer of Code program, where I am contributing to the Open Sensing Platform project.

My proposal was accepted, and I am fortunate to have Colleen Josephson and John Madden as my mentors. The objective of my project is to enhance usability and expandability of the Open Sensing Platform, a hardware solution for deploying sensor networks in outdoor environments. This platform utilizes low-power, long-range communication to transmit data from various sensors to a visualization dashboard. While the platform effectively collects data, its configuration process requires modifying source code to make it more user-friendly. My first steps to enhance usability of the project:

Improve User Interface (UI): Develop a user-friendly interface to interact with the platform, enabling researchers to configure the device without modifying code.
Conversion of user configuration: convert user configuration data to the Protobuf format for efficient storage and transmission.

Additionally, I will explore updating the NVRAM functions to interact with Protobuf messages instead of directly writing/reading raw data to NVRAM. I will also implement functions to serialize user configuration data into a Protobuf message and deserialize the message back into a data structure for use within the firmware.

I will be posting regular updates and informative blogs throughout the summer, so stay tuned!

Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Fri, 14 Jun 2024 00:00:00 +0000

Hello, I am Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University, specializing in Artificial Intelligence. This summer, I will be working on the project AIIO / Graph Neural Network under the mentorship of Bin Dong and Suren Byna.

High-Performance Computing (HPC) applications often face performance issues due to I/O bottlenecks. Manually identifying these bottlenecks is time-consuming and error-prone. My project aims to enhance the AIIO framework by integrating a Graph Neural Network (GNN) model to automatically diagnose I/O performance bottlenecks at the job level. This involves developing a comprehensive data pre-processing pipeline, constructing and validating a tailored GNN model, and rigorously testing the model’s accuracy using test cases from the AIIO dataset.

Through this project, I seek to provide a sophisticated, AI-driven approach to understanding and improving I/O performance in HPC systems, ultimately contributing to more efficient and reliable HPC applications.

StatWrap: Automated Reproducibility Checklists Generation

Fri, 14 Jun 2024 00:00:00 +0000

Namaste🙏🏻! I am Adi Akhilesh Singh, currently pursuing a degree in Computer Science and Engineering at IIT(BHU). This summer, I will be working on the StatWrap: Automated Reproducibility Checklists Generation project under the mentorship of Luke Rasmussen. You can view my project proposal for more details.

My project aims to integrate customizable reproducibility checklists into StatWrap, using metadata and user input to automate their generation. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklists to ensure their work is reproducible.

Stay tuned for updates on my progress in the coming weeks! 🚀

LLM Assistant for OpenROAD - Data Engineering and Testing

Thu, 13 Jun 2024 00:00:00 +0000

Hello! My name is Aviral Kaintura, and I will be contributing to OpenROAD, a groundbreaking open-source toolchain for digital integrated circuit automation (RTL to GDSII) during GSoC 2024.

My project, LLM Assistant for OpenROAD - Data Engineering and Testing, is jointly mentored by Indira Iyer and Jack Luar.

The aim of this project is to develop a chat assistant to improve the user experience with OpenROAD. My focus will be on developing a well-curated dataset from OpenROAD’s knowledge base. This dataset will be fundamental for another project led by Palaniappan R, which involves building the chatbot’s architecture. It will be used for training and validating the model and ensuring efficient context retrieval to generate accurate user responses, aiding in troubleshooting, installation, and other common issues to reduce the maintainers’ workload.

In addition to dataset creation, I will be working on testing and evaluation. This includes developing metrics for model evaluation, incorporating both human and automated techniques.

Our human evaluation framework will utilize chatbot feedback for valuable insights, enhancing the model and dataset. An automated batch testing application is also used to further enhance the evaluation process.

Here is an early build of the evaluation framework.

By leveraging advanced data engineering and testing methodologies, we aim to build an assistant that combines high accuracy with optimal response times. Additionally, we will collaborate with research teams at NYU and ASU to contribute to the research on AI-based chat assistants for electronic design automation.

I am thrilled to be part of this journey and look forward to making a meaningful impact on the OpenROAD project.

Stay tuned for more updates on the project!

LLM Assistant for OpenROAD - Model Architecture and Prototype

Thu, 13 Jun 2024 00:00:00 +0000

Hi there!

I’m Palaniappan R, currently an undergraduate student at the Birla Institute of Technology & Science, Pilani, India.

I’ll be working on the LLM Assistant for OpenROAD - Model Architecture and Prototype project, under the mentorship of Indira Iyer and Jack Luar.

My project aims to develop the architecture for a chat assistant built for OpenROAD and its native flow, designed to assist beginners and experienced users by giving easy access to existing resources, offering troubleshooting assistance, and providing fast and accurate responses to common questions. I plan to do this by leveraging state-of-the-art retrieval and fine-tuning techniques.

As part of this project, I will be working alongside another project to build and test on a valid dataset for training and deployment. We will also be collaborating with other research teams at NYU and ASU, working on similar projects related to OpenROAD chat assistants and flow generation using Generative AI. Our primary objective is to minimize support overhead, improve user experience by reducing response times, and provide access to updated information about OpenROAD.

Upon completion, my project will offer a viable chat assistant architecture as part of OpenROAD that benefits both the users and tool developers of OpenROAD.

An early prototype developed along with a human evaluation framework shows promising results.

Here are some responses generated by the prototype,

I’m excited about the potential of ORAssistant as part of the OpenROAD tool suite to accelerate innovation in EDA and chip design by utilizing open-source tools along with Generative AI.

Stay tuned for more updates!

Memory Compiler in OpenROAD

Thu, 13 Jun 2024 00:00:00 +0000

Greetings! I’m Yash Kumar working on the OpenROAD Memory Compiler Project for which my proposal under the mentorship of Matt and Austin aims to enhance the OpenROAD flow by integrating a DFFRAM generator that extensively uses the OpenDB database to build and layout various memory components like bits, bytes, and 32x32 configurations and more. Taking inspiration from the work of the AUCOHL repository’s DFFRAM memory compiler,

The goal is to develop a DFF/Latch-based RAM that utilizes standard cell libraries. The compiler will generate different views (HDL netlist, functional models, LEF, DEF, Timing, etc.) for specified size configurations, targeting compact design and optimal routing. The compiler should work across various PDKs satrting with Sky130. My initial works tries to test the Bit and Byte level design.

Optimizing Scientific Data Streaming: Developing Reproducible Benchmarks for High-Speed Memory-to-Memory Data Transfer over SciStream

Thu, 13 Jun 2024 00:00:00 +0000

Hello, I am Acheme, currently a PhD student in Computer Engineering at Clemson University. I will be working on SciStream, mentored by Joaquin Chung and Flavio Castro over this summer. Here is my proposal - for this project.

I am excited to meet everyone and contribute to this project!

Reproducibility in Data Visualization

Thu, 13 Jun 2024 00:00:00 +0000

Hello everyone!

I’m Triveni, a Master’s student in Computer Science at Northern Illinois University (NIU). When I came across the OSRE 2024 project Categorize Differences in Reproduced Visualizations focusing on data visualization reproducibility, I was excited because it aligned with my interest in data visualization. While my initial interest was in geospatial data visualization, the project’s goal of ensuring reliable visualizations across all contexts really appealed to me. So, I actively worked on understanding the project’s key concepts and submitted my proposal My proposal can be viewed here under mentorship of David Koop to join the project.

Early Steps and Challenges:

I began working on the project on May 27th, three weeks ago. Setting up the local environment initially presented some challenges, but I persevered and successfully completed the setup process. The past few weeks have been spent exploring the complexities of reproducibility in visualizations, particularly focusing on capturing the discrepancies that arise when using different versions of libraries to generate visualizations. Working with Dr. David Koop as my mentor has been an incredible experience. Our weekly report meetings keep me accountable and focused. While exploring different algorithms and tools to compare visualizations can be challenging at times, it’s a fantastic opportunity to learn cutting-edge technologies and refine my problem-solving skills.

Looking Ahead:

I believe this project can make a valuable contribution to the field of reproducible data visualization. By combining automated comparison tools with a user-centric interface, we can empower researchers and data scientists to make informed decisions about the impact of visualization variations. In future blog posts, I’ll share more about the specific tools and techniques being used, and how this framework will contribute to a more reliable and trustworthy approach to data visualization reproducibility.

Stay tuned!

I’m excited to embark on this journey and share my progress with all of you.

Stream Processing support for FasTensor

Thu, 13 Jun 2024 00:00:00 +0000

Hi, I’m Aditya Narayan,👋

I’m a frequent visitor to the town square of theoretical CS, operations (Ops), and robust high-performance systems. Sometimes I indulge myself with insights on Computing and Biology, and other times I enjoy the accounts of minefield experiences in the systems world. Luckily, this summer, OSRE offered an opportunity that happened to be at the perfect intersection of my interests.

This summer, I will be working on a scientific computing library called FasTensor that offers a parallel computing structure called Stencil, widely popular in the scientific computing world to solve PDEs for Physical Simulations and Convolutions on Signals, among its many uses. I am excited to introduce my mentors, Dr. Bin Dong and Dr. John Wu of the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL). They bring invaluable expertise to the project.

They recognized the need for a tensor processing library that provided dedicated support for big datasets with inherent structural locality, often found in the scientific computing world, which was lacking in popular open-source MapReduce or Key-Value based frameworks.

More often than not, the operations performed on these datasets are composed of computations involving neighboring elements. This motivated the development of the FasTensor library.

I will be working on providing a Stream Processing interface that enables online data processing of large-scale datasets as they arrive from Data Producers. The project focuses on offering rich interfaces for managing and composing streams, supporting common scientific data formats like HDF5, and integrating fault tolerance and reliability mechanisms.

I am thrilled to work on the FasTensor project because I believe it has the potential to make a significant impact by enabling researchers to implement a rich set of computations on their big datasets in an easy and intuitive manner.

After all, FasTensor has just one simple paradigm: A -> Transform(F(x), B),

and it handles all the behind-the-scenes grunt work of handling big datasets so you can focus on your research.

Stay tuned for updates and feel free to collaborate!

Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Wed, 12 Jun 2024 00:00:00 +0000

About the project:

How it all started

This journey began amidst our college’s cultural fest, in which I was participating, just 15 days before the proposal submission deadline. Many of my friends had been working for months to get selected for GSoC. I didn’t think I could participate this year because I was late, so I thought, “Better luck next year.” But during the fest, I kept hearing about UC OSPO and that a senior had been selected within a month. So, I was in my room when my friend told me, “What’s the worst that can happen? Just apply,” and so I did. I chose this project and wrote my introduction in Slack without knowing much. After that, it’s history. I worked really hard for the next 10 days learning about the project, making the proposal, and got selected.

First few weeks:

I started the project a week early from June 24, and it’s been two weeks since. The start was a bit challenging since it required setting up a lot of things on my local machine. For the past few weeks, the majority of my time has been dedicated to learning about COMPSs, RO-Crate, and Chameleon, the three technologies this project revolves around. The interaction with my mentor has also been great. From the weekly report meetings to the daily bombardment of doubts by me, he seems really helpful. It is my first time working with Chameleon or any cloud computing software, so it can be a bit overwhelming sometimes, but it is getting better with practice.

Stay tuned for progress in the next blog!!

FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I’m Lihaowen (Jayce) Zhu, currently pursuing my Master of Science in Computer Science at the University of Chicago. I will be spending my summer working on the project FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning under the mentorship of Yuyang (Roy) Huang and Swami Sundararaman, my proposal.

The landscape of machine learning (ML) is profoundly impacted by the initial stages of feature engineering and data preprocessing. This phase, critical for the success of ML projects, is often the most time-consuming, representing about 80% of the effort in typical ML workflows. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. By exploring innovative caching, prefetching, and heuristic strategies, this proposal aims to optimize the preprocessing workflow, thereby enhancing efficiency and reducing the required resources of ML projects.

First Steps in Enhancing User Experience Reproducibility through TROVI Redesign

Wed, 12 Jun 2024 00:00:00 +0000

Hello! My name is Alicia Esquivel Morel, and I’m a graduate research assistant at the University of Missouri – Columbia, pursuing a PhD in Computer Science. This summer, I’m working on a project to improve user experience reproducibility through a redesign of TROVI, as part of the Summer of Reproducibility (SoR) program. Excited to be working with two fabulous mentors; Kate Keahey, and Mark Powers.

Research Reproducibility with a TROVI Redesign

Researchers constantly face challenges replicating experiments due to limitations in current tools. TROVI, a platform designed to facilitate experiment replication, can be hindered by hard to follow interfaces and difficulties integrating code and data. This leads to confusion and frustration.

The Redesign’s Goals

Enhanced User Experience: Inspired by user-friendly platforms like Google Colab, we’ll simplify TROVI’s interface for intuitive navigation and ease of use.
Uploads and Sharing: Uploading code and data, as well as collaborating with researchers are key goals. Integration with platforms like GitHub will further streamline collaboration.
Continuous Improvement: A built-in feedback loop will allow users to provide input and suggestions, ensuring TROVI constantly evolves based on user needs.

The Road Ahead

We’re at the beginning of the redesign process. In the next blog post, I’ll describe the project’s specific goals and the deliverables you can expect.

Stay tuned to see how TROVI is built for reproducible research!!

FSA: Benchmarking Fail-Slow Algorithms

Wed, 12 Jun 2024 00:00:00 +0000

Hi everyone! I’m Xikang, a master’s CS student at UChicago. As a part of FSA benchmarking Project, I’m thrilled to be a contributor to OSRE 2024, collaborating with Kexin Pei, the assistant Professor of Computer Science at Uchicago and Ruidan, a talented PhD student at UChicago.

This summer, I will focus on integrating some advanced ML into our RAID slowdown analysis. Our aim is to assess whether LLMs can effectively identify RAID slowdown issues and to benchmark their performance against our current machine learning algorithms. We will test the algorithms on Chameleon Cloud and benchmark them.

Additionally, we will explore optimization techniques to enhance our pipeline and improve response quality. We hope this research will be a start point for future work, ultilizing LLMs to overcome the limitations of existing algorithms and provide a comprehensive analysis that enhances RAID and other storage system performance.

I’m excited to work with all of you and look forward to your suggestions. if you are interested, Here is my proposal

ML-Powered Problem Detection in Chameleon

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I am Syed Mohammad Qasim, a PhD candidate in Electrical and Computer Engineering at Boston University. I will be spending my summer working on the project ML-Powered Problem Detection in Chameleon under the mentorship of Ayse Coskun and Michael Sherman.

Currently, Chameleon Cloud monitors sites at the Texas Advanced Computing Center (TACC), University of Chicago, Northwestern University, and Argonne National Lab. They collect metrics using Prometheus at each site and feed them all to a central Mimir cluster. All the logs go to a central Loki, and Grafana is used to visualize and set alerts. Chameleon currently collects around 3000 metrics. Manually reviewing and setting alerts on them is time-consuming and labor-intensive. This project aims to help Chameleon operators monitor their systems more effectively and improve overall reliability by creating an anomaly detection service that can augment the existing alerting framework.

OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I’m Jiajun Mao, a BS/MS student at the University of Chicago studying Computer Science. I will be spending this summer working on the project OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS under the mentorship of Meng Wang and Anjus George, my proposal.

How to increase data’s durability and reliability while decreasing storage cost have always been interesting topics of research. Erasure coded storage systems in recent years have been seen as strong candidates to replace replications for colder storage tiers. In the paper “Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers”, the authors explored using theory and simulation on how a multiple tiered erasure coded system can out-perform systems using single level erasure codes in areas such as encoding throughput and network bandwidth consumed for repair, addressing a few pain points in adopting erasure coded storage systems. I will be implementing the theoretical and simulation result of this paper by building on top of HDFS and ZFS, and benchmarking the system performance.

The project will aim to achieve

HDFS understanding the underlying characteristics of ZFS as the filesystem
HDFS understanding the failure report from ZFS, and use new and special MLEC repair logic to execute parity repair
ZFS will be able to accept repair data from HDFS to repair a suspended pool caused by catastrophic data corruption

Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster

Wed, 12 Jun 2024 00:00:00 +0000

Hi! I’m Martin, and I will be working on Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster under the mentorship of In Kee Kim. Our work is driven by the scale of computing systems that hosts data commons – we believe that performance characterization of genomics workload should be done rapidly and at the scale similar to production settings. Feel free to check our proposal for more details!

We propose GenScale, a genomics workload benchmarking tool which can achieve both the scale and speed necessary for characterizing performance under large-scale settings. GenScale will be built on top of industrial-grade cluster manager (e.g. Kubernetes), metrics collection & monitoring systems (e.g. Prometheus), and will support comprehensive set of applications used in state-of-art genomics workflows. Initial version developed during this project will include DNA and RNA alignment workflows.

Finally, we believe that open access and reproducible research will greatly accelerate the pace of scientific discovery. We aim to package our artefacts and generated datasets in ways that makes it easiest to replicate, analyze, and build upon. I personally look forward to learn from & contribute to the open source community!

LAST: ML in Detecting and Addressing System Drift

Tue, 11 Jun 2024 00:00:00 +0000

Hello! I am Joanna, currently an undergraduate student studying Computer Science and Applied Mathematics and Statistics at Johns Hopkins University. I will be working on ML in Detecting and Addressing System Drift, mentoring by Ray Andrew Sinurat and Sandeep Madireddy over this summer. Here is my proposal for this project.

This project aims to build a data analysis pipeline to analyze various datasets, both system and non-system, that have shown notable changes over time. The goal is to understand the characteristics of these datasets(specifically drifts), evaluate the efficacy of Aging Detection Algorithms, and identify their limitations in computer system tasks.

I am excited to meet everyone and contribute to this project!

Developing a Pipeline to Benchmark Drift Management Strategies

Mon, 10 Jun 2024 00:00:00 +0000

With guidance from mentors Ray Andrew Sinurat and Sandeep Madireddy under the LAST project, I aim to develop a pipeline to benchmark the efficacy of various drift management algorithms.

Despite the abundance of literature on this subject, reproducibility remains a challenge due to the lack of available source code. As such, by crafting this pipeline, I aim to create standardized platform for researchers and practitioners to compare several state-of-the-art drift management approaches. Through rigorous testing and benchmarking, we seek to identify the most effective algorithms across a spectrum of drift scenarios, including gradual, sudden, and recurring drift.

This final deliverable of this pipeline will be packaged into a Chameleon Trovi Artifact. The pipeline will also be made easily extensible to cater to additional datasets or any custom-made drift-mitigation methods. This is my proposal for the project.

See you around!

Reproducing and benchmarking scalability bugs hiding in cloud systems

Mon, 10 Jun 2024 00:00:00 +0000

Hello there!

I am Shuang Liang, a third-year student studying Computer and Information Science at The Ohio State University. My passion lies in cloud computing and high-performance computing, areas I have explored extensively during my academic journey. I have participated in various projects and competitions, which have honed my technical skills and deepened my interest in distributed systems.

As part of the ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems, my proposal under the mentorship of Professor Yang Wang and Bogdan "Bo" Stoica aims to tackle the critical challenges posed by scalability bugs in systems like Cassandra, HDFS, and Hadoop. These bugs can lead to severe operational issues such as system downtime and data loss, particularly as systems scale up.

The project goals include systematically analyzing and documenting scalability bugs, developing protocols to effectively trigger and quantify the impact of these bugs, and creating reproducible artifacts and detailed investigation scripts to aid in bug analysis.

Our project will involve rigorous bug report analysis, reproduction of scalability bugs, and a comparative study of system behaviors before and after bug fixes. We aim to develop methodologies that enhance the reliability and performance of large-scale distributed systems, providing valuable insights and resources to the open-source community.

Stay tuned to explore the future of reliable and scalable distributed systems!

BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Sun, 09 Jun 2024 00:00:00 +0000

Hello! My name is Qianru, and I will be working on a project to improve spatial transcriptomics during Google Summer of Code 2024. My project, Benchmarking Gene Imputation Methods for Spatial Transcriptomics, is mentored by Ziheng Duan and Cormac Flanagan. The goal is to create a standard platform to evaluate methods for filling in missing gene data, which is a big challenge in spatial transcriptomics. My proposal can be viewed here!

Spatial transcriptomics lets us see where genes are active in tissues, giving us insight into how cells interact in their natural environment. However, current methods often miss some gene data, making it hard to get a complete picture. Gene imputation can help fill in these gaps.

My project will:

Create a benchmark dataset to standardize gene imputation tasks across different platforms, species, and organs.

Compare various gene imputation methods to see how well they work in different scenarios.

Develop a user-friendly Python package with tools for gene imputation to help researchers improve their data.

I’m excited to contribute to this project and help advance the field of spatial transcriptomics by making data analysis more accurate and comprehensive.

ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Sat, 08 Jun 2024 00:00:00 +0000

Hi! I’m Zahra, an undergraduate at Universitas Dian Nuswantoro, Indonesia. As part of the ScaleRep my proposal under the mentorship of Bogdan "Bo" Stoica and Yang Wang aims to systematically understand, characterize, and document the challenges associated with scalability bugs in large-scale distributed systems.

ScaleRep proposes a two-fold strategy to address scalability bugs in large-scale distributed systems. First, Bug Analysis and Documentation involves studying recent scalability issues across popular open-source systems such as Cassandra, Hadoop, HDFS, Ignite, and Spark to understand bug causes, symptoms, and solutions. This includes pinpointing common challenges hindering bug reproduction and devising protocols to trigger and measure scalability bug impacts. Second, Implementation and Artifact Packaging focuses on identifying, reproducing, and documenting scalability bugs, then packaging artifacts with Chameleon Trovi. This method emphasizes precise bug analysis, establishing reproducible environments, and detailed documentation to ensure artifact reliability and usability.

Drishti

Thu, 06 Jun 2024 00:00:00 +0000

Namaste everyone! 🙏🏻

I’m Joel Tony, a third-year Computer Science undergraduate at BITS Pilani, Goa, India. I’m truly honored to be part of this year’s Google Summer of Code program, working with the UC OSPO organization on a project that genuinely excites me. I’m particularly grateful to be working under the mentorship of Dr. Jean Luca Bez, a Research Scientist at Lawrence Berkeley National Laboratory, and Dr. Suren Byna, a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.

My project, “Drishti: Visualization and Analysis of AI-based Applications”, aims to extend the Drishti framework to better support AI/ML workloads, focusing specifically on optimizing their Input/Output (I/O) performance. I/O refers to the data transfer between a computer’s memory and external storage devices like hard drives (HDDs) or solid-state drives (SSDs). As AI models and datasets continue to grow exponentially in size, efficient I/O management has become a critical bottleneck that can significantly impact the overall performance of these data-intensive workloads.

Drishti is an innovative, interactive web-based framework that helps users understand the I/O behavior of scientific applications by visualizing I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualizations, making performance issues more apparent. Now, I’m working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.

Through my studies in high-performance computing and working with tools like BeeGFS and Darshan, I’ve gained insights into the intricacies of I/O performance. However, adapting Drishti for AI/ML workloads presents new challenges. In traditional HPC, computing often dominates, but in the realm of AI, the tables have turned. As models grow by billions of parameters and datasets expand to petabytes, I/O has become the critical path. Training larger models or using richer datasets doesn’t just mean more computation; it means handling vastly more data. This shift makes I/O optimisation not just a performance tweak but a fundamental enabler of AI progress. By fine-tuning Drishti for AI/ML workloads, we aim to pinpoint I/O bottlenecks precisely, helping researchers streamline their data pipelines and unlock the full potential of their hardware.

As outlined in my proposal, my tasks are threefold:

Modularize Drishti’s codebase: Currently, it’s a single 1700-line file that handles multiple functionalities. I’ll be refactoring it into focused, maintainable modules, improving readability and facilitating future enhancements.
Enable multi-trace handling: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I’ll build a layer to aggregate these, providing a comprehensive view of the application’s I/O behavior.
Craft AI/ML-specific recommendations: Current suggestions often involve MPI-IO or HDF5, which aren’t typical in ML frameworks like PyTorch or TensorFlow. I’ll create targeted recommendations that align with these frameworks’ data pipelines.

This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimize it for the unique I/O challenges that AI/ML applications face. Whether it’s dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.

From classroom theories to hands-on projects, from understanding file systems to optimizing AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.

In today’s AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimized I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behavior in AI/ML applications more interpretable through Drishti, we’re not just tweaking code. We’re providing developers with the insights they need to optimize their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.

I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimizing I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I’m deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavor and to the UC OSPO and GSoC communities for this incredible opportunity.

Causeway: Learning Web Development Through Micro-Roles

Mon, 03 Jun 2024 00:00:00 +0000

Hello! My name is Rishi and I will be contributing to Causeway, a platform for learning to develop web applications using an Angular, RxJS, NgRx, and Firebase stack , during Google Summer of Code 2024. My project is Causeway : Improving the Core Infrastructure and Experience ! , mentored by David Lee. This project aims to modernize the platform by adding various login options (Google, GitHub, email/password, passwordless) using Firebase Authentication, enhancing the landing page with an about section and improved UI, and introducing section quizzes via Firebase Firestore and Cloud Functions. It also involves developing user and learning dashboards with Angular Material UI and Firebase Cloud Functions, improving the overall UI design with application walkthroughs, providing an introductory demo for new users, incorporating generative AI features, automating deployment and monitoring with Vercel Bot, and adding contact and feedback options. These enhancements will boost user engagement, usability, and the overall learning experience. My proposal can be viewed here!

Causeway is a platform for learning to develop web applications using an Angular, RxJS, NgRx, and Firebase stack. It aims to bridge the gap in online coding tutorials by providing a holistic approach to web application development, breaking down the process into a hierarchy of micro-roles. This structure offers learners a clear pathway for learning and translates into a clear process for developing an application. In the longer future, this approach will enable learners to contribute to projects by taking on micro-roles for yet-to-be-developed projects. The platform leverages the Stackblitz WebContainer API to run full applications in the browser for interactive learning.

FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Mon, 03 Jun 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (primary contact), Swami Sundararaman
Contributor(s): Lihaowen (Jayce) Zhu

In the realm of machine learning (ML), preprocessing of data is a critical yet often underappreciated phase, consuming approximately 80% of the time in common ML tasks. This extensive time consumption can be attributed to various challenges encountered from both data and computation perspectives.

From the data side, one significant challenge is the slow retrieval of data from data lakes, which are storage repositories that hold a vast amount of raw data in its native format. However, the process of extracting this data can be slow, causing computation cycles to wait for data arrival and leading to delays in the entire preprocessing phase. Furthermore, the size of the data often exceeds the memory capacity of standard computing systems. This is a frequent occurrence in ML, as datasets are typically large and complex. Handling such large datasets requires sophisticated memory management techniques to ensure efficient preprocessing without overwhelming the system’s memory.

On the computation side, a naive solution to data operations, especially aggregation, often leads to inefficiencies. These operations may require grouping a large chunk of data as a prerequisite before performing any actual computation. This grouping, without careful configuration and management, can trigger serious data shuffling, leading to extensive remote data movement when the data is distributed across various storage systems. Such data movement is not only time-consuming but also resource-intensive.

To mitigate these challenges, there is a pressing need to design better caching, prefetching, and heuristic strategies for data preprocessing. The team aims to significantly reduce the time and resources required for preprocessing by optimizing data retrieval and computational processes.

However, prior to the design and implementation of such a system, a systematic understanding of the preprocessing workflow is essential. Hence, throughout the program, the students will need to:

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

Project Deliverable

A rolodex for the commonly used dataset and corresponding preprocess operations and expected output formats/types
A Chameleon Trovi package that preprocess the dataset with single-machine preprocessing framework like pandas
A Chameleon Trovi package that preprocess the dataset in an existing distributed computation framework like Hadoop or Spark

Enhancing h5bench with HDF5 Compression Capability

Mon, 27 May 2024 00:00:00 +0000

SLICES/pos: Reproducible Experiment Workflows

Fri, 17 May 2024 00:00:00 +0000

Servus everyone!

I’m Kilian Warmuth, currently pursuing my M.Sc. in Computer Science at the Technical University of Munich (TUM) after completing my B.Sc. in Computer Science at the same institution. Throughout my academic education, I have taken courses in Advanced Computer Networks, which have deepened my understanding and expertise in the field. I was involved in an interdisciplinary project where I created a testing toolchain for the packet generator MoonGen using the SCLICES/pos testbed. This experience provided me with extensive hands-on exposure to pos, increasing my interest in reproducible testbeds and the enhancement of pos.

As part of the SLICES/pos: Reproducible Experiment Workflows project, my proposal, under the mentorship of Sebastian Gallenmüller, Kate Keahey, and Georg Carle, aims to address the challenges of managing experiment results within the pos framework.

The project leverages the RO-Crate open standard to organize result data systematically, enhancing accessibility and comprehensibility of research findings. We aim to improve experiment documentation for the pos testbed, providing clear setup and execution instructions to ensure reproducibility. Therefore we need to simplify the dissemination of research findings by automating the creation of RO-Crates, allowing researchers to focus on experiment design without needing to be familiar with RO-Crate standards. Implementing these standards will enhance the sharing of results by automating publication processes for open repositories, promoting transparency and collaboration.

We also aim to enhance the portability of experiments across different testbeds, with a particular focus on the Chameleon Testbed. We will develop introductory examples demonstrating how to use pos in various testbed environments. Additionally, we will design and execute a portable complex network experiment based on SLICES/pos. To validate the portability enhancements, we will perform experiments on the Chameleon testbed. Finally, we will refine the portability of pos experiments within Chameleon to ensure seamless execution.

Stay tuned to explore the future of reproducible testbeds!

Hardware Hierarchical Dynamical Systems

Tue, 14 May 2024 00:00:00 +0000

As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau and Sakshi Garg aims to develop a tree data structure under HHDS to replace the current one offered by LHTree

The tree data structure is to be optimized for typical AST traversal and queries. Some queries that are made to this tree are much more frequent than others. Thus a flattening policy will be used to optimize the tree for these queries, at the potential cost of becoming slow for the infrequent queries. The tree will be benchmarked for scalability and performance and is expected to outperform the current version of the tree. Once the implementation is complete, the tree will be integrated into the LiveHD core repository.

HDEval: Benchmarking LLMs that Generate Verilog/Chisel Modules From Natural Language

Tue, 14 May 2024 00:00:00 +0000

Hi everyone!

I’m Ashwin Bardhwaj, currently pursuing a bachelors in Electrical Engineering and Computer Science at UC Berkeley. I was recently involved in a project to implement a secure hardware encryption enclave in Verilog. That’s why I was excited to work with the MASC group to evaluate how existing generalized LLMs (such as ChatGPT 4 or StarCoder) can generate accurate Verliog/Chisel code from English and assist in the hardware development process.

As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau and Sakshi Garg looks to create a suite of benchmark programs for HDEval.

The deliverable of this project is to create multiple large HDL benchmarks along with a respective set of prompts. Using yosys to implement Logic Equivalence Check, we are able to prove through formal verification that the generated code will exhibit the same behavior as the benchmark. In addition, we can also consider the performance and resource utilization of the generated code as a metric.

(Re)Evaluating Artifacts for Understanding Resource Artifacts

Wed, 20 Mar 2024 00:00:00 +0000

Project Idea Description

Topics: Virtualization, Containerization, Profiling, Reproducibility
Skills: C and Python and DevOps experience.
Difficulty: Medium
Size: Large; 350 hours
Mentors: Tanu Malik

This project aims to characterize computer-science related artifacts that are either submitted to conferences or deposited in reproducibility hubs such as Chameleon. We aim to characterize experiments into different types and understand reproducibility requirements of this rich data set, possibly leading to a benchmark. We will then understand packaging requirements, especially of distributed experiments and aim to instrument a package archiver to reproduce a distributed experiment. Finally, we will use learned experiment characteristics to develop a classifier that will determine alternative resources where experiment can be easily reproduced.

Project Deliverable Specific Tasks include: A pipeline consisting of a set of scripts to characterize artifacts. Packaged artifacts and an analysis report with open-sourced data about the best guidelines to package using Chameleon. A classifier system based on artifact and resource characteristics.

Auto Appendix

Mon, 11 Mar 2024 14:48:10 +0000

The SC Conference Series, a leading forum on High Performance Computing (HPC), supports scientific rigor through an enhanced reproducibility of accepted papers. To that end, all manuscripts submitted to the SC Technical Papers program must contain an Artifact Description. Authors of accepted papers may request reproducibility badges, for which an Appendix describing the Artifact Evaluation is required.

In recent years, Chameleon has facilitated SC’s reproducibility initiative by enabling authors to develop and share computational, reproducible artifacts through the Chameleon cloud. The Chameleon platform helps authors and reviewers to easily share computational artifacts, which are included in the papers’ artifact appendices.

The proposed project aims to assess all AD/AE appendices submitted for reproducibility badge requests. This evaluation will focus on AD/AE appendices that utilized the Chameleon cloud as the execution platform, examining their potential for automation. Our aim is to evaluate the feasibility of fully automating various components of the appendices. Students will engage directly with the chairs of the SC24 Reproducibility Initiative in this effort.

Advancing SC Conference Artifact Reproducibility via Automation

Topics: Reproducibility Reproducible Research Artifact Evaluation Open Science
Skills: HPC, Cloud computing, Chameleon, MPI, OpenMP, CUDA
Difficulty: Difficult
Size: Large
Mentors: Sascha Hunold
Tasks:
- Perform an analysis of the current limitations of AD/AE appendices submitted for Artifact Evaluation.
- Re-run the computational artifacts to identify areas for enhancement, with a primary objective of achieving full automation of Artifact Evaluation using the Chameleon cloud.
- Evaluate the existing automation capabilities of the Chameleon cloud.
- Develop a set of recommendations for structuring Computational Artifacts, aimed at benefiting future SC conferences.

ML-Powered Problem Detection in Chameleon

Wed, 06 Mar 2024 16:33:57 -0600

Today’s Continuous Integration/Continuous Development (CI/CD) trends encourage rapid design of software using a wide range of software components, followed by frequent updates that are immediately deployed on the cloud. The complexity of cloud systems along with the component diversity and break-neck pace of development amplify the difficulty in identifying or fixing problems related to performance, resilience, and security. Furthermore, existing approaches that rely on human experts—e.g., methods involving manually-written rules/scripts—have limited applicability to modern CI/CD processes, as they are fragile, costly, and often not scalable. Consequently, there is growing interest in applying machine learning (ML) based methods for identifying vulnerabilities in code, non-compliant or otherwise problematic software, and resilience problems in systems and networks. However, despite some success stories in applying AI for cloud operations (e.g., in resource management), much of cloud operations still rely on human-centric methods, which require updates as the cloud undergoes CI/CD cycles. The goal of this summer project is to explore methods of automation for the Chameleon Cloud to enable faster detection and diagnosis of problems. Overall, the project will contribute to an overarching vision of building an infrastructure that collects and synthesizes cross-layer data from large-scale cloud systems, applying ML-powered methods to automate cloud ops, and, further, making this data available to researchers through coherent APIs and analytics engines.

Currently, Chameleon uses runbooks as manual guides for operational tasks, including routine maintenance and troubleshooting. However, these traditional runbooks often fall short in dynamic and fast-paced CI/CD environments, as they lack the flexibility to adapt to changes in software versions, deployment configurations, and the unique challenges of emerging issues. To overcome these challenges, the project will leverage ML to automate anomaly detection based on telemetry data collected from Chameleon Cloud’s monitoring frameworks. This method will not only facilitate rapid identification of performance anomalies but also enable automated generation of runbooks. These runbooks can then offer operators actionable steps to resolve issues efficiently, thereby making the anomaly mitigation process more efficient. Furthermore, this approach supports the automatic creation of targeted runbooks for newly generated support tickets, enhancing response times and system reliability.

Time-permitting, using a collection of automated runbooks (each targeting a specific problem), we will analyze support tickets, common problems, and their frequency to offer insights and suggestions to help roadmapping for Chameleon Cloud to offer the best return on investment on fixing problems.

A key aspect of this summer project is enhancing the reproducibility of experiments in the cloud and improving data accessibility. We plan to design infrastructures and APIs so that the telemetry data that is essential for anomaly detection and automated runbooks is systematically documented and made available. We also aim to collect and share insights and modules on applying ML for cloud operations, including ML pipelines, data labeling strategies, data preprocessing techniques, and feature engineering. By sharing these insights, we aim to promote best practices and support reproducible experiments on public clouds, thus fostering future ML-based practices within the Chameleon Cloud community and beyond. Time permitting, we will explore applying lightweight privacy-preserving approaches on telemetry data as well.

Topics: Machine Learning, Anomaly Detection, Automated Runbooks, Telemetry Data
Skills:
- Proficiency in Machine Learning: Understanding of ML algorithms for anomaly detection and automation.
- Cloud Computing Knowledge: Familiarity with CI/CD environments and cloud architectures.
- Programming Skills: Proficiency in languages such as Python, especially in cloud and ML contexts.
- Data Analysis: Ability to analyze telemetry data using data analytics tools and libraries.
Difficulty: Hard
Size: Large
Mentors: Michael Sherman

ReproNB: Reproducibility of Interactive Notebook Systems

Mon, 26 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: HPC, MPI, distributed systems
Skills: C++, Python
Difficulty: Difficult
Size: Large; 350 hours
Mentors: Tanu Malik

Notebooks have gained wide popularity in scientific computing. A notebook is both a web-based interactive front- end to program workflows and a lightweight container for sharing code and its output. Reproducing notebooks in different target environments, however, is a challenge. Notebooks do not share the computational environment in which they are executed. Consequently, despite being shareable they are often not reproducible. We have developed FLINC (see also eScience'22 paper) to address this problem. However, it currently does not support all forms of experiments, especially those relating to HPC experiments. In this project we will extend FLINC to HPC experiments. This will involve using recording and replaying mechanisms such as ReMPI and rr within FLINC.

Project Deliverable

The project deliverable will be a set of HPC experiments that are packaged with FLINC and available on Chamaeleon.

SciStream-Rep: An Artifact for Reproducible Benchmarks of Scientific Streaming Applications

Mon, 26 Feb 2024 00:00:00 +0000

SciStream is a framework and toolkit that attempts to tackle the problem of enabling high-speed(+100Gbps), memory-to-memory data streaming in scientific environments. This task is particularly challenging because data producers (e.g., data acquisition applications on scientific instruments, simulations on supercomputers) and consumers (e.g., data analysis applications) may be in different security domains and thus require bridging of those domains. Furthermore, either producers, consumers, or both may lack external network connectivity and thus require traffic forwarding proxies. If you want to learn more, please take a look at our HPDC'22 paper.

SciStream-Rep: An Artifact for Reproducible Benchmarks of Scientific Streaming Applications

Project Idea Description:

Topics: Network Performance Testing, Benchmarking, Data Streaming, Reproducibility
Skills: Python, Scripting, Linux, Containers, Networking, benchmark tools
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

This project focuses on expanding the scope of testing SciStream’s architecture by incorporating a variety of traffic patterns based on real scientific applications. The goal is to understand how different traffic patterns influence the performance of memory-to-memory data streaming in scientific scenarios by creating artifacts for reproducible experiments. Additionally, the project will explore the use of different forwarding elements, such as Nginx and HAProxy, to assess their impact on data streaming efficiency and security.

Reproducibility is especially difficult in shared network environments such as Chameleon and FABRIC testbeds. We can expect similar results for two exact same experiments, only when the network condition (external to our traffic) is similar for both experiments. By creating reproducible artifacts for Chameleon and FABRIC, we can build statistical confidence in the measured results by multiple repetitions from other researchers.

The Specific Tasks of the Project Include:

Developing a set of benchmarks to measure the performance of scientific streaming applications across a broader range of traffic patterns.
Creating a set of artifacts for generating traffic patterns typical of data streaming applications.
Deploying various forwarding elements within the SciStream architecture for the Chameleon and FABRIC testbeds.
Compiling a best practices document detailing the optimal configurations for Scistream.

Scistream-LB: A Dynamic Load Balancing Solution Using Programmable network devices

Project Idea Description:

Topics: Network Performance Testing, Data Streaming, Reproducibility, Programmable Data Planes
Skills: Python/Scripting, Linux, Docker/Containers, Networking fundamentals, Experience with OpenFlow/P4 programming
Difficulty: Difficult
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

The aim of this project is to create a specialized forwarding element using OpenFlow (OF) or P4 programming languages, tailored to enhance the SciStream data plane. This new development seeks to enable a more flexible and hardware-based (and therefore more efficient) alternative to conventional software-based forwarding mechanisms like NGINX or HAProxy, specifically designed to support the needs of high-performance data streaming environments for scientific applications. The OF/P4 forwarding elements will be packaged as artifacts for reproducibility experiments in Chameleon and FABRIC testbeds. Reproducibility is especially difficult in shared network environments such as Chameleon and FABRIC testbeds. We can expect similar results for two exact same experiments, only when the network condition (external to our traffic) is similar for both experiments. By creating reproducible artifacts for Chameleon and FABRIC, we can build statistical confidence in the measured results by multiple repetitions from other researchers.

Specific tasks of the project include:

Design and implementation of an OF/P4-based forwarding element that can be seamlessly integrated with the data plane of SciStream’s architecture.
Forwarding logic that supports efficient and secure memory-to-memory data streaming.
A set of benchmarks for evaluating the new forwarding element against traditional options, focusing on improvements in throughput, latency, and security.
An investigation on the potential advantages of programmable network elements for detailed control over data streaming paths and security configurations.
A package of the newly developed forwarding elements as artifacts for reproducibility experiments in Chameleon and FABRIC testbeds.

Chameleon Trovi Redesign

Wed, 21 Feb 2024 13:43:55 -0600

Trovi on Chameleon is an open-source service designed to significantly enhance the practical reproducibility of computer science research. By allowing Chameleon users to upload, share, and access packaged experiments and other research artifacts, Trovi aims to streamline the process of replicating and building upon existing studies. This capability is crucial in the scientific community, where the ability to accurately reproduce research results is as fundamental to validating, critiquing, and extending scientific findings as reading papers. The importance of Trovi lies in its potential to serve as a centralized hub that facilitates the exchange of valuable research outputs, promotes transparency, and fosters collaboration among researchers. By improving the ease with which experiments can be replicated and data can be shared, Trovi supports the advancement of knowledge and innovation in the field of computer science, making it an essential tool for researchers seeking to contribute to the development of reproducible and robust scientific research.

This project will focus on the evolution of Trovi. It will aim to enhance Trovi as a tool to advance practical reproducibility in CS research. Students will evaluate the most important use cases and enabling features necessary to enhance Trovi’s functionality and user experience. With these design insights, students will then create a robust interface that allows researchers to integrate experiment code and data easily as packaged artifacts, similar to the user-friendly design of Google Colab, and build off other users’ artifacts to create novel experiments, similar to the design of GitHub. Furthermore, students will create comprehensive documentation with valuable insights into what works well and what requires improvement, creating a dynamic feedback loop to guide the ongoing redesign process. Lastly, students will actively participate in designing webinars, creating and posting video tutorials, and organizing academic events at the University of Chicago to showcase the work on Trovi. This multifaceted project ensures a well-rounded experience and fosters a collaborative learning environment.

Each of the project ideas below focuses on a different aspect of the overall goal to enhance Trovi as a tool for advancing practical reproducibility in CS research. They are designed to offer a comprehensive approach, from technical development to community engagement, ensuring a well-rounded enhancement of the service.

Topics: User Interface Design User Experience Web Development
Skills: HTML/CSS, JavaScript, UX design principles
Difficulty: Moderate to Hard
Size: Medium to Large
Mentors: Mark Powers
Tasks:
- Conduct user research to understand the needs and pain points of current and potential Trovi users.
- Design wireframes and prototypes that incorporate user feedback and aim to simplify the process of uploading, sharing, and reusing research artifacts.
- Implement the frontend redesign using a modern web framework to ensure responsiveness and ease of use.

Packaged Artifacts Integration System

Topics: Cloud Computing Data Management Web APIs
Skills: Python, RESTful APIs, Docker, Git
Difficulty: Hard
Size: Large
Mentors: Mark Powers
Tasks:
- Develop a system that allows users to easily package and upload their experimental code and data to Trovi.
- Create a standardized format or set of guidelines for packaging experiments to ensure consistency and ease of use.
- Implement API endpoints that enable automated uploads, downloads, and integration with other tools like GitHub or Zenodo.
- Test the system with real-world experiments to ensure reliability and ease of integration.

Community Engagement and Educational Materials

Topics: Educational Technology Community Building Content Creation
Skills: Video Editing, Public Speaking, Event Planning
Difficulty: Moderate
Size: Medium
Mentors: Mark Powers
Tasks:
- Design and organize webinars that introduce Trovi and its new features to the research community.
- Create engaging video tutorials that guide users through the process of using Trovi for their research needs.
- Develop comprehensive documentation that covers both basic and advanced use cases, troubleshooting, and tips for effective collaboration using Trovi.
- Organize academic events, such as workshops or hackathons, that encourage the use of Trovi for collaborative research projects.

Feedback Loop and Continuous Improvement System

Topics: Software Engineering Data Analysis User Feedback
Skills: Python, SQL, Data Visualization, Web Development
Difficulty: Moderate
Size: Medium
Mentors: Mark Powers
Tasks:
- Implement a system within Trovi for collecting, storing, and analyzing user feedback and usage data.
- Develop dashboards that visualize feedback trends and identify areas for improvement.
- Create mechanisms for users to easily report bugs, request features, and offer suggestions for the platform.
- Use the collected data to prioritize development efforts and continuously update the platform based on user needs and feedback.

Data leakage in applied ML: reproducing examples of irreproducibility

Wed, 21 Feb 2024 00:00:00 +0000

Topics: applied machine learning, data leakage, reproducibility
Skills: Python, data analysis, machine learning
Difficulty: Medium
Size: Large (350 hours)
Mentors: Fraida Fund and Mohamed Saeed

Project Idea Description

Data leakage has been identified as a major cause of irreproducibility of a paper’s findings, when machine learning techniques are applied to problems in science. Data leakage includes errors such as:

pre-processing before splitting into training/test sets
feature selection before splitting into training/test sets
duplicated data points in both training and test sets
temporal leakage (e.g. shuffled K-fold cross validation with temporal data)
group leakage (e.g. shuffled K-fold cross validation with data that has group structure)

and leads to an overly optimistic evaluation of model performance, such that the finding may no longer be the same when the error is corrected.

Despite the seriousness of this problem, data leakage is often not covered in introductory machine learning courses, and many users of machine learning across varied science domains are unaware of it. Even those who have learned “rules” for avoiding data leakage (e.g. “never do feature selection on the test set”) may not understand the reasons for these “rules”, and how important they are for ensuring that the final result is valid and reproducible.

The goal of this project is to create learning materials demonstrating how instances of data leakage invalidate a result. These materials should be easily adoptable by instructors teaching machine learning in a wide variety of contexts, including those teaching a non-CS audience. To achieve this, the project proposes to re-implement published results that have been affected by data leakage, and package these implementations along with supporting material in a format suitable for use in classrooms and by independent learners. For each “irreproducible result”, the “package” should include -

a re-implementation of the original result
an explanation of the data leakage problem affecting the result, with an implementation of a “toy example” on synthetic data
a re-implementation of the result without the data analysis error, to show how the finding is affected
and examples of exam or homework questions that an instructor adopting this package may use to assess understanding.

Writing a successful proposal for this project

A good proposal for this project should include, for at least a few “types” of data leakage mentioned above -

a specific published result that could be used as an exemplar (you may find ideas among the review papers listed here)
a brief description of the details of the experiment that will reproduce that result (e.g. what data is used, what machine learning technique is used, what are the hyperparameters used for training)
and an explanation of why this result is suitable for this use (it uses a publicly available dataset, a machine learning technique that is familiar and accessible to students in an introductory course, the paper has sufficient detail to reproduce the result, etc.)

The contributor will need to create learning materials that are written in a clear, straightforward, and concise manner, without unncessary jargon. The proposal should show evidence of the contributor’s writing abilities.

Github link

There is no pre-existing Git repository for this project - at the beginning of the summer, the contributor will create a new repository in the Teaching on Testbeds organization, and the project materials will “live” there.

To get a sense of the type of code you would be writing, here is an example of a learning module related to data leakage (however, it is not in the format described above): Beauty in the Classroom

Project Deliverables

“Packages” of learning materials for teaching about common types of data leakage
Trovi artifacts for “playing back” each of the “packages”

Evaluating congestion controls past and future

Wed, 21 Feb 2024 00:00:00 +0000

Topics: computer networks, congestion control, reproducibility
Skills: Python, Bash scripting, Linux, computer network performance evaluation
Difficulty: Medium
Size: Large (350 hours)
Mentors: Fraida Fund and Ashutosh Srivastava

Project Idea Description

In computer networks, congestion control protocols play an outsize role in determining our experience with networked applications. New congestion control algorithms are regularly proposed by researchers to improve throughput and latency performance, adapt to new types of networks, and align more closely with the needs of new applications.

However, our understanding of the benefits of a new congestion control protocol depends to a large extent on the evaluation - the network topology, the network delay and throughput, the type of flow, the type of competing traffic - and there is no single standard way to evaluate a congestion control protocol. The Pantheon project (which is no longer supported) sought to fill this gap somewhat and address the problem of reproducibility of congestion control results, but their approach is not easily adapted to evaluation scenarios representative of new types of applications or networks. Nor is it capable of representing the evaluation scenarios in most published results related to congestion control.

The goal of this project, therefore is to create an evaluation suite for congestion control protocols that can be used to reproduce existing congestion control results in the academic literature, and to evaluate new protocols under similar evaluation conditions, and to be easily extended to new scenarios. An “evaluation scenario” includes:

a Python notebook to realize the network topology on the FABRIC and/or Chameleon testbed, and configure the network characteristics,
scripts to generate the data flow(s) needed for the evaluation,
and scripts to capture data from the experiment and visualize the results.

Writing a successful proposal for this project

To write a good proposal for this project, you should review the most influential papers on TCP congestion control, and especially those related to TCP protocols that are available in the Linux kernel.

Use your findings to explain what your proposed evaluation suite will include (what network topologies, what flow generators), and justify this with reference to the academic literature. Also indicate which specific results you expect to be able to reproduce using this suite (e.g. include figures from influential papers showing evaluation results! with citation, of course).

You can also take advantage of existing open source code that reproduces a congestion control result, e.g. Replication: When to Use and When Not to Use BBR, or Some of the Internet may be heading towards BBR dominance: an experimental study.

Github link

There is no pre-existing Git repository for this project - at the beginning of the summer, the contributor will create a new repository for this project.

Project Deliverables

“Packages” of evaluation scenarios that can be used to evaluate a congestion control algorithm implemented in the Linux kernel
Trovi artifacts for realizing each evaluation scenario on Chameleon

Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Mon, 19 Feb 2024 00:00:00 +0000

Topics: Provenance, reproducibility, standards, image creation
Skills: Python, JSON, Bash scripting, Linux, image creation and deployment
Difficulty: Medium
Size: Large (350 hours)
Mentors: Raül Sirvent

Project Idea Description

The COMPSs programming model provides an interface for the programming of a sequential application that is transformed in a workflow that, thanks to the COMPSs runtime, is later scheduled in the available computing resources. Programming is enabled for different languages through the use of bindings: Java, C/C++ and Python (named PyCOMPSs). COMPSs is able to generate Workflow Provenance information after the execution of an experiment. The generated artifact (code + data + recorded metadata) enables the sharing of results through the use of tools such as the WorkflowHub portal, that provides the capacity of generating a DOI of the results to include them as permanent references in scientific papers.

The format of the metadata generated in COMPSs experiments follows the RO-Crate specification, and, more specifically, two profiles: the Workflow and Workflow Run Crate profiles. This metadata enables not only the sharing of results, but also their reproducibility.

This project proposes the creation of a service that enables the automatic reproducibility of COMPSs experiments in the Chameleon infrastructure. The service will be able to get a COMPSs crate (artifact that follows the RO-Crate specification), and, by parsing the available metadata, build a Chameleon compatible image for reproducing the experiment in the testbed. Small modifications to the COMPSs RO-Crate are foreseen (i.e. the inclusion of third party software required by the application).

Project Deliverables

Study the different environments and specifications (COMPSs, RO-Crate, Chameleon, Trovi, …).
Design the most appropriate integration, considering all the elements involved.
Integrate PyCOMPSs basic experiments reproducibility in Chameleon.
Integrate PyCOMPSs complex experiments reproducibility in Chameleon (i.e. with third party software dependencies).

BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Sat, 17 Feb 2024 00:00:00 +0000

Topics: bioinformatics, spatial transcriptomics, gene imputation, benchmarking, cross-platform/species analysis
Skills:
- Programming Languages:
  - Proficient in Python and/or R, commonly used in bioinformatics.
- Data Analysis:
  - Experience with statistical data analysis and machine learning models.
- Bioinformatics Knowledge (not required but preferred):
  - Proficiency in bioinformatics and computational biology.
  - Familiarity with spatial transcriptomics datasets and platforms.
Difficulty: Advanced
Size: Large (350 hours). Given the scope of integrating multi-platform, multi-species datasets and the complexity of benchmarking gene imputation methods, this project is substantial. It requires extensive data preparation, analysis, and validation phases, making it suitable for a larger time investment.
Mentors: Ziheng Duan (contact person)

Project Idea Description

The orchestration of cellular life is profoundly influenced by the precise control of gene activation and silencing across different spatial and temporal contexts. Understanding these complex spatiotemporal gene expression patterns is vital for advancing our knowledge of biological processes, from development and disease progression to adaptation. While single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to profile gene expression across thousands of cells simultaneously, its requirement for cell dissociation strips away the critical spatial context, limiting our comprehension of cellular interactions within their native environments. Recent strides in spatial transcriptomics have started to bridge this gap by enabling spatially resolved gene expression measurements at single-cell or even sub-cellular resolutions. These advancements offer unparalleled opportunities to delineate the intricate tapestry of gene expression within tissues, shedding light on the dynamic interactions between cells and their surroundings.

Despite these technological advances, a significant challenge remains: the datasets generated by spatial transcriptomic technologies are often incomplete, marred by missing gene expression values due to various technical and biological constraints. This limitation severely impedes our ability to fully interpret these rich datasets and extract meaningful insights from them. Gene imputation emerges as a pivotal solution to this problem, aiming to fill in these missing data points, thereby enhancing the resolution, quality, and interpretability of spatial transcriptomic datasets.

Recognizing the critical importance of this task, there is a pressing need for a unified benchmarking platform that can facilitate the evaluation and comparison of gene imputation methods across a diverse array of samples, spanning multiple sampling platforms, species, and organs. Currently, the bioinformatics and spatial transcriptomics fields lack such a standardized framework, hindering progress and innovation. To address this gap, our project aims to establish a comprehensive gene imputation dataset that encompasses a wide range of conditions and parameters. We intend to reproduce known methods and assess their efficacy, providing a solid and reproducible foundation for future advancements in this domain.

Project Deliverable

A comprehensive, preprocessed benchmark dataset that spans multiple sampling platforms, species, and organs, aimed at standardizing gene imputation tasks in spatial transcriptomics.
An objective comparison of state-of-the-art gene imputation methodologies, enhancing the understanding of their performance and applicability across diverse biological contexts.
A user-friendly Python package offering a suite of gene imputation tools, designed to fulfill the research needs of the spatial transcriptomics community by improving data completeness and reproducibility.

ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Sat, 10 Feb 2024 00:00:00 +0000

Topics: Distributed systems, Scalability, Bug analysis, Bug reproducibility
Skills: Java, Python, bash scripting, perf, Linux internals
Difficulty: Hard
Size: Large (350 hours)
Mentors: Bogdan "Bo" Stoica (contact person), Yang Wang

Project Idea Description

Large-scale distributed systems are integral to the infrastructure of a wide range of applications and services. The continuous evolution of these systems requires ongoing efforts to address inherent faults which span a variety of issues including availability, consistency, concurrency, configuration, durability, error-handling, integrity, performance, and security. Recent developments in the field and the rise of cloud computing have been marked by a notable increase in the scale at which such systems operate.

This increase in scale introduces specific challenges, particularly in terms of system reliability and performance. As distributed systems expand beyond single machines, addressing the growing demands for computation, memory and storage becomes more difficult. This underlying complexity leads to the emergence of scalability bugs — defects that surface in large-scale deployments, yet do not reveal themselves in a small-scale setting.

To better understand scalability bugs, we set out to investigate a set of scalability issues documented over the last 5 years from 10 popular open-source large-scale systems. These bugs have led to significant operational challenges, such as system downtime, reduced responsiveness, data loss, and data corruption. Moreover, addressing them required extensive collaboration and problem-solving efforts among engineers and bug reporters, with discussions often spanning a month or more.

We observed that traditional bug finding techniques are insufficient for detecting scalability bugs since these defects are triggered by a mixture of scale-related aspects not properly investigated by previous approaches. These characteristics include the number of components involved, the system load and workload size, the reliability of recovery protocols, and the magnitude of intermediate failures. Although previous research examined some of these aspects, it has typically done so either in isolation (individually), or without providing a comprehensive understanding of the fundamental bug patterns, symptoms, root causes, fixes, and, more importantly, how easily these bugs can be reproduced in-house.

Therefore, the main goal of this project is to systematically understand, characterize, and document the challenges associated with scalability bugs, at-large. Our approach is twofold: first, to analyze scalability bugs in terms of reproducibility, and second, to develop methodologies for triggering them and measuring their impact. Specifically, we aim to:

Provide detailed accounts of bug reproduction experiences for a diverse set of recently reported scalability bugs from our benchmark applications;
Identify specific challenges that prevent engineers from reproducing certain scalability bugs and investigate how prevalent these obstacles are;
Create a suite of protocols to effectively trigger and quantify the impact of scalability bugs, facilitating their investigation in smaller-scale environments.

Project Deliverable

A set of Trovi replayable artifacts enabling other researchers to easily reproduce scalability bugs for our benchmark applications;
A set of Jupyter notebook scripts allowing to conveniently replay each step in our investigation;
A detailed breakdown of the challenges faced when reproducing scalability bugs and how these obstacles differ from those related to more “traditional” types of bugs.

GPEC: An Open Emulation Platform to Evaluate GPU/ML Workloads on Erasure Coding Storage

Thu, 08 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage Systems, Machine Learning, Erasure Coding
Skills: C/C++, Python, PyTorch, Bash scripting, Linux, Erasure Coding, Machine Learning
Difficulty: Hard
Size: Large (350 hours)
Mentors: Meng Wang (primary contact), John Bent

Large-scale data centers store immense amounts of user data across a multitude of disks, necessitating redundancy strategies like erasure coding (EC) to safeguard against disk failures. Numerous research efforts have sought to assess the performance and durability of various erasure coding approaches, including single-level erasure coding, locally recoverable coding, and multi-level erasure coding.

Despite its widespread adoption, a significant research gap exists regarding the performance of large-scale erasure-coded storage systems when exposed to machine learning (ML) workloads. While conventional practice often leans towards replication for enhanced performance, this project seeks to explore whether cost-effective erasure encoding can deliver comparable performance. In this context, several fundamental questions remain unanswered, including: Can a typical erasure-coded storage system deliver sufficient throughput for ML training tasks? Can an erasure-coded storage system maintain low-latency performance for ML training and inference workloads? How does disk failure and subsequent repair impact the throughput and latency of ML workloads? What influence do various erasure coding design choices, such as chunk placement strategies and repair methods, have on the aforementioned performance metrics?

To address these questions, the most straightforward approach would involve running ML workloads on large-scale erasure coded storage systems within HPC data centers. However, this presents challenges for researchers and students due to limited access to expensive GPUs and distributed storage systems, especially when dealing with large-scale evaluations. Consequently, there is a need for a cost-effective evaluation platform.

The objective of this project is to develop an open-source platform that facilitates cheap and reproducible evaluations of erasure-coded storage systems concerning ML workloads. This platform consists of two key components: GPU Emulator: This emulator is designed to simulate GPU performance for ML workloads. Development of the GPU emulator is near completion. EC Emulator: This emulator is designed to simulate the performance characteristics of erasure-coded storage systems. It is still in the exploratory phase and requires further development.

The student’s responsibilities will include documenting the GPU emulator, progressing the development of the EC emulator, and packaging the experiments to ensure easy reproducibility. It is anticipated that this platform will empower researchers and students to conduct cost-effective and reproducible evaluations of large-scale erasure-coded storage systems in the context of ML workloads.

Project Deliverable

Build an EC emulator to emulate the performance characteristics of large-scale erasure-coded storage systems
Incorporate the EC emulator into ML workloads and GPU emulator
Conduct reproducible experiments to evaluate the performance of erasure-coded storage systems in the context of ML workloads
Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository with open-source code

Turn on, Tune in, Listen up: Maximizing Side-Channel Recovery in Cross-Platform Time-to-Digital Converters

Thu, 08 Feb 2024 00:00:00 +0000

Turn on, Tune in, Listen Up Is an open-source framework for implementing voltage flucturation sensors in FPGA devices for use in side-channel security research. Side-channels are an ever present hardware security threat. The reconfigurability of FPGAs significantly broadens the side-channel attack surface in many cloud heterogeneous systems. We have developed a highly tunable side-channel sensor, which significantly improves side-channel attack time and resolution in multiple contexts. Concurrent users sharing the same device may attack one another through the power side-channel (check out our paper), while consecutive users may attack one another through measurement of the physical wear-out state of the FPGA device (check out our paper). We have demonstrated these attack surfaces on both Intel (Altera) and AMD (Xilinx) platforms. Currently, our open-sourced sensor design and side-channel analysis flow is limited to AMD devices. We are seeking CSE/CS/CE/ECE researchers interested in FPGA design, heterogeneous computing and/or hardware security to combine our Intel and AMD side-channel sensors into a unified attack framework and comparing capabilities between vendors.

Open-source sensor repository updates

Topics: Hardware security, cloud security, heterogeneous computing, temporal and spatial side-channels
Skills: Experience with GitHub, FPGA development (AMD or Intel), and Python
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Dustin Richmond, Tyler Sheaves

Update existing open-source voltage fluctuation sensor to support both AMD and Intel devices. Currently our repository exclusively supports AMD FPGAs. We have added new features to our sensor and have demonstrated an implementation on Intel. We would like to consolidate this work into a unified repository containing side-channel analysis demonstrations using open-source target benchmark designs.

Specific tasks:

Adapt existing tooling scripts to support multiple vendor tool flows.
Adapt existing test infrastructure to target multiple SoC-type FPGA platforms (i.e. DE10-Nano, Pynq Z2, etc.).
Evaluate cross-platform sensor architecture on a collection of benchmark designs. Demonstrate each benchmark using a cross-platform unified side-channel analysis framework.
Draw a comparison between sensor implementations on different architectures.

Artificial Intelligence Explainability Accountability

Wed, 07 Feb 2024 00:00:00 +0000

Trustworthy Logical Reasoning Large Language Models (LLMs)

Logical LLMs is a project to translate the output from large language models (LLM) into a logic-based programming language (prolog) to detect inconsistencies and hallucinations automatically . The goals of this project would be to build a user interface for users to be able to give feedback which can be incorporated into the system. The project goal is to create a trustworthy hybrid open-source LLM tool that can learn from user feedback and explain its mistakes.

Collect Hallucinations and Facts

Topics: AI/ML, data collection, logic, user interfaces
Skills: javascript, html, python, bash, git
Difficulty: Easy/Moderate
Size: Large
Mentors: Leilani H. Gilpin (and a PhD student TBD).

Specific Tasks

Run queries in an LLM API with various prompts.
Create a user interface system that collects user feedback in a web browser.
Create a pipeline for storing the user data in a common format that can be shared in our database.
Document the tool for future maintenance.

Explaining failures in autograding

The eXplainable autograder (XAutograder) is a tool for autograding student coding assignments, while providing personalized explanations or feedback. The goal of this project is to create an introductory set of coding assignment with explanations of wrong answers. This benchmark suite will be used for testing our system. The project goal is to create a dynamic autograding system that can learn from student’s code and explain their mistakes

Design introductory questions and explanations

Topics: AI/ML, AI for education, XAI (Explainable AI_
Skills: python, git
Difficulty: Moderate
Size: Large
Mentors: Leilani H. Gilpin (and a PhD student TBD).

Specific Tasks

Design 5-10 basic programming questions (aggregated from online, other courses, etc).
Create tests of correctness (unit tests), and a testing framework which can input a set of answers, and provide a final assessment
Create a set of baseline explanations for various error cases, e.g., out of bounds error, syntax error, etc.
Create a pipeline for iterating on the test cases and/or explanation feedback.
Document the tool for future maintenance.

Causeway: Learning Web Development Through Micro-Roles

Wed, 07 Feb 2024 00:00:00 +0000

Causeway is a platform for learning to develop web applications using an Angular, RxJS, NgRx, and Firebase stack. Most online coding tutorials focus on covering the technical syntax or features of a language or framework, which means that new developers don’t have great resources for building a holistic picture of how everything they learn connects to actually developing a complex web application. Causeway breaks down the process of developing a web application into a hierarchy of micro-roles which provides learners with a clear pathway for learning that also translates to a clear process for developing an application. In the longer future, this would also enable learners to easily contribute to projects as they learn through taking on micro-roles for yet-to-be-developed projects. The platform uses the Stackblitz WebContainer API to run full applications in the browser for interactive learning.

Thus far, we have developed a version of the platform that walks learners through the process of developing presentational components of a web application as well as smart components / containers that contain multiple presentational components and are responsible for fetching data from the backend and handling events and updates to the database. This content is still using Angular 13 and needs to be updated to Angular 17, as well as to make some improvements in our use of RxJS, NgRx, and Firebase. We’d also like to extend the content in multiple ways including: 1) extending the walkthrough to more components and containers besides the single example we have, ideally in a way that covers a complete application, and 2) extending beyond components and containers to cover defining database entities and relationships. We’d also like to develop a learning dashboard where users can see the different micro-roles and lessons that they’ve completed or that are upcoming for the project they are working on.

Causeway / Improving the Core Infrastructure and Experience

The proposed work includes updating the platform and the example infrastructure within the platform to the latest version of Angular and other associated libraries, implementing and testing logging and analytics, implementing a learning dashboard for users, and time permitting, creating new modules to cover defining database entities and relationships. Both roles will also contribute to running usability studies and documenting the platform so that it can be open-sourced.

Topics: Web Development, Educational Technologies, Angular
Skills: Web development experience, HTML, CSS, Javascript, Angular, RxJS, NgRx, Firebase
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: David Lee

Causeway / Extend the Learning Scope and Experience

The proposed work includes extending the component and container walkthroughs to cover a complete interactive application. This means writing a separate simple application, and organizing the code required to do so into units of work organized by our micro-role structure. Both roles will also contribute to running usability studies and documenting the platform so that it can be open-sourced.

Topics: Web Development, Educational Technologies, Angular
Skills: Web development experience, HTML, CSS, Javascript, Angular, RxJS, NgRx, Firebase
Difficulty: Medium
Size: Large (350 hours)
Mentors: David Lee

LAST: Let’s Adapt to System Drift

Wed, 07 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Computer systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Data Science and Machine Learning
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ray Andrew Sinurat (primary contact), Sandeep Madireddy

The performance of computer systems is constantly evolving, a natural outcome of updating hardware, improving software, and encountering hardware quirks over time. At the same time, machine learning (ML) models are becoming increasingly popular. They are being used widely to address various challenges in computer systems, notably in speeding up decision-making. This speed is vital for a quick and flexible response, essential for meeting service-level agreements (SLAs). Yet, an interesting twist has emerged: like the computer systems they aid, ML models also experience a kind of “aging.” This results in a gradual decline in their effectiveness, a consequence of changes in their operating environment.

The phenomenon of model “aging” is a ubiquitous occurrence across various domains, not limited merely to computer systems. This process of aging can significantly impact the performance of a model, emphasizing the critical importance of early detection mechanisms to maintain optimal functionality. In light of this, numerous strategies have been formulated to mitigate the aging of models. However, the generalizability and effectiveness of these strategies across diverse domains, particularly in computer systems, remain largely unexplored. This research aims to bridge this gap by designing and implementing a comprehensive data analysis pipeline. The primary objective is to evaluate the efficacy of various strategies through a comparative analysis, focusing on their performance in detecting and addressing model aging. To achieve a better understanding of this issue, the research will address the following pivotal questions:

Data-Induced Model Aging: What specific variations within the data can precipitate the aging of a model? Understanding the nature and characteristics of data changes that lead to model deterioration is crucial for developing effective prevention and mitigation strategies.
Efficacy of Aging Detection Algorithms: How proficient are the current algorithms in identifying the signs of model aging? Assessing the accuracy and reliability of these algorithms will provide insights into their practical utility in real-world scenarios.
Failure Points in Detection: In what scenarios or under what data conditions do the aging detection mechanisms fail? Identifying the limitations and vulnerabilities of these algorithms is vital for refining their robustness and ensuring comprehensive coverage.
Scalability and Responsiveness: How do these algorithms perform in terms of robustness and speed, particularly when subjected to larger datasets? Evaluating the scalability and responsiveness of the algorithms will determine their feasibility and effectiveness in handling extensive and complex datasets, a common characteristic in computer systems.

To better understand and prevent issues related to model performance, our approach involves analyzing various datasets, both system and non-system, that have shown notable changes over time. We aim to apply machine learning (ML) models to these datasets to assess the effects of these changes on model performance. Our goal is to leverage more advanced ML techniques to create new algorithms that address these challenges effectively. This effort is expected to contribute significantly to the community, enhancing the detection of model aging and improving model performance in computer systems.

Project Deliverable

Run pipeline on several computer systems and non-computer systems dataset
A Trovi artifact for data preprocessing and model training shared on Chameleon Cloud
A GitHub repository containing the pipeline source code

Reproducibility in Data Visualization

Tue, 06 Feb 2024 15:00:00 -0500

At the heart of evaluating reproducibility is a judgment about whether two results are indeed the same. This can be complicated in the context of data visualization due to rapidly evolving technology and differences in how users perceive the results. First, due to the rapid evolution of libraries including web technologies, visualizations created in the past may look different when rendered in the future. Second, as the goal of data visualization is communicating data to people, different people may perceive visualizations in a different way. Thus, when a reproduced visualization does not exactly match the original, judging whether they are “similar enough” is complicated by these factors. For example, changes in a colormap may be deemed minor by a computer but could lead people to different understandings of the data. The goals of this research are to capture visualizations in a way that allows their reproducibility to be evaluated and to develop methods to categorize the differences when a reproduced visualization differs from the original.

Investigate Solutions for Capturing Visualizations

Topics: Reproducibility, Data Visualization
Skills: Python and/or JavaScript, Data Visualization Tools
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: David Koop

The goal of this project is to investigate, augment, and/or develop solutions to capture visualizations that appear in formats including websites and Jupyter notebooks. In past work, we implemented methods to capture thumbnails as users interacted with visualizations. Other solutions can be used to capture interactive visualizations. We wish to understand the feasibility of recording such visualizations and their utility in evaluating reproducibility in the future.

Specific tasks:

Evaluate tools for capturing static visualizations on the web
Investigate tools for capturing dynamic visualizations on the web
Investigate how data including code or metadata can be captured with visualizations
Augment or develop tools to aid in capturing reproducible visualizations

Categorize Differences in Reproduced Visualizations

Topics: Reproducibility, Data Visualization
Skills: Python and/or JavaScript, Data Visualization Tools
Difficulty: Moderate/Hard
Size: Medium or Large (175 or 350 hours)
Mentors: David Koop

The goal of this project is to organize types of differences in reproduced visualizations and create tools to detect them. Publications and computational notebooks record renderings of visualizations. When they also include the code to reproduce the visualization, we can regenerate them in order to compare them. Often, the reproduced visualization does not match the original (see examples in this manuscript). This project seeks to categorize the types of differences that can occur in order and start understanding how they impact judgments of reproducibility.

Specific tasks:

Evaluate and/or develop tools to compare two visualizations
Evaluate the utility of artificial intelligence solutions
Organize and categorize the detected differences
Develop tools to determine the types or categories of differences present in two visualizations

FSA: Benchmarking Fail-Slow Algorithms

Tue, 06 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ruidan Li (primary contact), Kexin Pei

In the realm of modern applications, achieving not only low but also predictable response times is a critical requirement. Performance instability, even when it amounts to just a few milliseconds of delay, can result in violations of Service Level Objectives (SLOs). Redundancy at the RAID group level provides a layer of protection; however, the early identification of potential slowdowns or failures is paramount in minimizing their impact on overall system latency.

Fail-Slow represents a unique type of fault within storage systems, characterized by the system’s ability to continue functioning while progressively deteriorating – its performance significantly drops below expected levels. Notably, fail-slow conditions are responsible for a considerable share of latency tails. Detecting fail-slow faults is particularly challenging, as they can be easily masked by the normal fluctuations in performance. Consequently, the identification of fail-slow faults is a critical area of research, demanding meticulous attention.

Several strategies have been developed to address the fail-slow issue, yet the question of their broad applicability remains. We plan to implement and assess various existing fail-slow detection algorithms, examining their strengths and weaknesses. Our analysis will concentrate on key questions:

How promptly can the algorithm identify a fail-slow symptom? What methods does the algorithm employ to accurately distinguish fail-slow incidents, thereby minimizing false negatives? Through what approach does the algorithm achieve the right sensitivity level to keep false positives in check?

This evaluation aims to shed light on the effectiveness of current methodologies in detecting fail-slow faults, crucial for enhancing system reliability and performance.

Building upon our evaluation of several fail-slow detection algorithms, our objective is to harness advanced machine learning (ML) models to develop a novel algorithm. This initiative seeks to address and potentially compensate for the identified weaknesses in existing methodologies. By focusing on the critical aspects of early detection, accurate differentiation, and optimal sensitivity, we aim to create a solution that reduces both false negatives and false positives, thereby enhancing overall system reliability. This approach represents a strategic effort to not only advance the current state of fail-slow detection but also to contribute significantly to the resilience and performance of storage systems.

Project Deliverable

A Trovi artifact for the existing Fail-Slow detection algorithms on Chameleon Cloud
A GitHub repository containing the full evaluation result
A Google Colab notebook for quick replay

Open Sensing Platform (OSP)

Mon, 05 Feb 2024 00:00:00 +0000

Open Sensing Platform I: Software to enable large scale outdoor sensor networks

Topics: Data Visualization, Backend, Web Development, UI/UX, Analytics
Skills:
- Required: React, Javascript, Python, SQL, Git
- Nice to have: Flask, Docker, CI/CD, AWS, Authentication
Difficulty: Medium
Size: Large (350 hours)
Mentors: Colleen Josephson, John Madden, Aaron Wu

Open Sensing Platform (OSP) is a new initiative expanding from our prior project DirtViz, a data visualization web platform for monitoring microbial fuel cell sensors (see GitHub). The mission is to scale up the current platform to support other researchers or citizen scientists in integrating their novel sensing hardware or microbial fuel cell sensors for monitoring and data analysis. Examples of the types of sensors currently deployed are sensors measuring soil moisture, temperature, current, and voltage in outdoor settings. The focus of the software half of the project involves building upon our existing visualization web platform, and adding additional features to support the mission. A live version of the website is available here.

Deliverables:
- Create a system for remote collaborators/citizen scientists to set up their sensors and upload securely, eg. designing user flow to create sensors
- Craft an intuitive navigation system so that data from deployment sites around the world can be easily viewed, eg. designing experience/system to locate deployment sites.
- Refine our web-based visualization tools to add additional features for users to analyze collected data, eg. lazy loading out-of-range data or caching queried data.
- Document the tool thoroughly for future maintenance

Open Sensing Platform II: Hardware to enable large scale outdoor sensor networks

Topics: Embedded system, wireless communication, low-power remote sensing
Skills:
- Required: C/C++, Git, Github, Platformio
- Nice to have: PCB design and debugging experience, STM32 HAL, ESP32 Arduino, protobuf, python, knowledge of standard communication protocols (I2C, SPI, and UART)
Difficulty: Hard
Size: Large (350 hours)
Mentors: Colleen Josephson, John Madden, Stephen Taylor

The Open Sensing Platform hardware aims to be a general purpose hardware platform for outdoor sensing (e.g. agriculture, ecological monitoring, etc.). The typical use case involves a sensor deployment in an agricultural field, remotely uploading measurements without interfering with farming operations. The current hardware revision (Soil Power Sensor) was originally designed for monitoring power output of microbial fuel cells using high fidelity voltage and current measurement channels, as well as auxiliary sensors such as the SDI-12 TEROS-12 soil moisture sensor. The primary activities of this project will involve low-level firmware design and implementation, but may also incorporate hardware design revisions if necessary. We are looking to expand functionality to other external sensors, as well as optimize for power consumption, via significant firmware design activities.

Long-range, low-power wireless communication is achieved through a LoRa capable STM32 microcontroller with in-lab experiments using an ESP32 microcontroller to enable the simpler WiFi interface. Both wireless interfaces communicate upload measurements to our data visualization dashboard, Open Sensing Platform I. The combined goal across both of these projects is to create a system that enables researchers to test and evaluate novel sensing solutions. We are looking to make the device usable to a wide range of researchers which may not have a background in electronics, so are interested in design activities that enhance user friendliness.

In total there will be 2-4 people working on the hardware with progress being tracked on GitHub. Broader project planning is tracked through a Jira board. We intend to have weekly meetings to provide updates on current issue progress along with assigning tasks. Please reach out to John Madden if there are any questions or specific ideas for the project.

Deliverables: Contribution via commits to the GitHub repository with documentation on completed work. A changelog of contributions to the firmware.

OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS

Mon, 05 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage Systems, Erasure Coding
Skills: C/C++, Java, Bash scripting, Linux, HDFS, ZFS, Erasure Coding
Difficulty: Hard
Size: Large (350 hours)
Mentors: Meng Wang (Main contact person) and Anjus George

Multi-Level Erasure Coding (MLEC), which performs erasure coding at both network and local levels, has seen large deployments in practice. Our recent research work has shown that MLEC can provide high durability with higher encoding throughput and less repair network traffic compared to other erasure coding methods. This makes MLEC particularly appealing for large-scale data centers, especially high-performance computing (HPC) systems.

However, current MLEC systems often rely on straightforward design choices, such as Clustered/Clustered (C/C) chunk placement and the Repair-All (RALL) method for catastrophic local failures. Our recent simulations [1] have revealed the potential benefits of more complex chunk placement strategies like Clustered/Declustered (C/D), Declustered/Clustered (D/C), and Declustered/Declustered (D/D). Additionally, advanced repair methods such as Repair Failed Chunks Only (RFCO), Repair Hybrid (RHYB), and Repair Minimum (RMIN) have shown promise for improving durability and performance according to our simulations. Despite promising simulation results, these optimized design choices have not been implemented in real systems.

In this project, we propose to develop open-source MLEC implementations in real systems, offering a range of design choices from simple to complex. Our approach leverages ZFS for local-level erasure coding and HDFS for network-level erasure coding, supporting both clustered and declustered chunk placement at each level. The student’s responsibilities include setting up HDFS on top of ZFS, configuring various MLEC chunk placements (e.g., C/D, D/C, D/D), and implementing advanced repair methods within HDFS and ZFS. The project will culminate in reproducible experiments to evaluate the performance of MLEC systems under different design choices.

We will open-source our code and aim to provide valuable insights to the community on optimizing erasure-coded systems. Additionally, we will provide comprehensive documentation of our work and share Trovi artifacts on Chameleon Cloud to facilitate easy reproducibility of our experiments.

[1] Meng Wang, Jiajun Mao, Rajdeep Rana, John Bent, Serkay Olmez, Anjus George, Garrett Wilson Ransom, Jun Li, and Haryadi S. Gunawi. Design Considerations and Analysis of Multi-Level Erasure Coding in Large- Scale Data Centers. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’23), 2023.

Project Deliverable

Open-source MLEC implementations with a diverse range of design choices.
Configuration setup for HDFS on top of ZFS, supporting various MLEC chunk placements.
Implementation of advanced repair methods within HDFS and ZFS.
Reproducible experiments to assess the performance of MLEC systems across distinct design choices.
Comprehensive documentation of the project and the provision of shared Trovi artifacts on Chameleon Cloud for ease of reproducibility.

EdgeRep: Reproducing and benchmarking edge analytic systems

Fri, 02 Feb 2024 00:00:00 +0000

Topics: video analytics, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Medium
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (contact person), Junchen Jiang

Project Idea Description

With the flourishing of ideas like smart cities and smart manufacturing, a massive number of edge devices (e.g., traffic or security cameras, thermometers, flood sensors, etc.) are deployed and connected to the network. These devices collect and analyze data across space and time, aiding stakeholders like city governments and manufacturers in optimizing their plans and operations. However, the sheer number of edge devices and the large amount of communication among the devices and central servers raises significant challenges in how to manage and schedule resources. This includes network bandwidth between the devices and computing power on both edge devices and bare metal servers, all to maintain the reliable service capability of running applications.

Moreover, given the limited resources available to edge devices, there’s an emerging trend to reduce average compute and/or bandwidth usage. This is achieved by leveraging the uneven distribution of interesting events with respect to both time and space in the input data. This, in turn, introduces further challenges in provisioning and managing the amount of resources available to edge devices. The resource demands of running applications can greatly depend on the input data, which is both dynamic and unpredictable.

Keeping these challenges in mind, the team previously designed and implemented a dynamic resource manager capable of understanding the applications and making decisions based on this understanding at runtime. However, such a resource manager has only been tested with a limited number and types of video analytic applications. Thus, through the OSRE24 project, we aim to:

Collect a wide range of videos to form a comprehensive video dataset
Reproduce other state-of-art self-adaptive video analytic applications
Package the dataset as well as the application to publish them on Chameleon Trovi site

Project Deliverable

Collect a wide range of videos to form a comprehensive video dataset
Reproduce other state-of-art self-adaptive video analytic applications
Package the dataset as well as the application to publish them on Chameleon Trovi site

FEP-Bench: Benchmarks for understanding featuring engineering and preprocessing bottlenecks

Fri, 02 Feb 2024 00:00:00 +0000

Topics: storage system, scheduling, distributed system, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (contact person), Swami Sundararaman

Project Idea Description

However, prior to the design and implementation of such a system, a systematic understanding of the preprocessing workflow is essential. Hence, throughout the program, the students will need to:

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

Project Deliverable

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

FetchPipe: Data Science Pipeline for ML-based Prefetching

Fri, 02 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: C/C++, Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Daniar H. Kurniawan (primary contact), Haryadi Gunawi

The contemporary landscape of high-performance servers, particularly those designed for data centers and AI/ML training, prominently features solid-state drives (SSDs) and spinning disks (HDDs) as primary storage devices. These components play a crucial role in shaping overall system performance, underscoring the importance of addressing and minimizing Input/Output (I/O) latency. This is particularly crucial given the widespread adoption of hybrid storage systems, where caching and prefetching strategies are instrumental in optimizing storage performance. Caching involves using faster but less dense memory to store frequently accessed data, while prefetching aims to reduce latency by fetching data from slower memory to cache before it is needed. Although both caching and prefetching present valid challenges, our primary emphasis is on the prefetching problem due to the inherent difficulty in predicting future access.

Traditional prefetchers, dating back 1-2 decades, heavily rely on predefined rules for prefetching based on LBA access sequences, limiting their adaptability to complex scenarios. For instance, the read-ahead prefetcher is confined to prefetching the next data item within a file for faster sequential access. Addressing this limitation, recent advancements include learning-based methods, such as Long Short-Term Memory (LSTM) techniques like DeepPrefetcher and Delta LSTM, which model the LBA delta to cover a broader range of LBAs. However, they are still struggling to achieve high accuracy when the workload pattern changes drastically. Although there are some sophisticated prefetchers capable of learning complex I/O access patterns using Graph structure, they face challenges in their deployment due to the computational cost.

In this project, our goal is to provide an end-to-end data science pipeline to empower the research on ML-based prefetchers. We believe that this pipeline is crucial for fostering active collaboration between the ML community and storage systems researchers. This collaboration aims to optimize existing ML-based prefetching solutions. Specifically, we will provide the dataset for training/testing and some samples of ML-based models that can further be developed by the community. Furthermore, we will also provide a setup for evaluating the ML model when deployed in storage systems.

Project Deliverable

Compile I/O traces from various open traces and open systems.
Develop a pipeline for building ML-based prefetching solutions.
Build a setup to evaluate the model in a real hybrid storage system.
Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository

Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster

Fri, 02 Feb 2024 00:00:00 +0000

Project Idea description

We aim to characterize the performance of genomic workflows on HPC clusters by conducting two research activities using a broad set of state-of-the-art genomic applications and open-source datasets.

Performance Benchmarking and Characterizing Genomic Workflows:

Topics: High Performance Computing (HPC), Data Analysis, Scientific Workflows
Skills: Linux, Python, Bash Scripting, Data Science Toolkit, Kubernetes, Container Orchestration, Genomics Applications (e.g. BWA, FastQC, Picard, GATK, STAR)
Difficulty: Medium
Size: Large (350 hours)
Mentor(s): In Kee Kim

In this activity, students will perform comprehensive performance measurements of genomic data processing on HPC clusters using state-of-the-art applications, workflows, and real-world datasets. They will collect and package datasets for I/O, memory, and compute utilization using industry-standard tools and best practices. Measurement will be done using Kubernetes container orchestration on a multi-node cluster to achieve scalability, with either custom-made metrics collection system or integration of existing industry standard tools. (e.g. Prometheus).

Quantifying Performance Interference and Assessing Their Impact on Workflow Execution Time:

Topics: Machine Learning, Data Analysis, and Scientific Workflows and Computations
Skills: Linux, Python, Bash Scripting, Data Science Toolkit, Kubernetes, Container Orchestration
Difficulty: Difficult
Size: Medium (175 hours)
Mentor(s): In Kee Kim

In this activity, students will measure the slowdown of various applications due to resource contention (e.g. CPU and I/O). Students will analyze whether an application is compute-bound, I/O bound, or both, then analyze the correlation between resource utilization and execution time. Following that, students will assess the impact of per-application slowdown to the slowdown of a whole workflow. To the best of our knowledge, this will be the first study which systematically quantifies per-application interference when running genomics workflow on an HPC cluster.

For both subprojects, all experiments will also be conducted in a reproducible manner (e.g., as a Trovi package or Chameleon VM images), and all code will be open-sourced (e.g., shared on a public Github repo).

Project Deliverable:

A Github repository and/or Chameleon VM image containing source code for application executions & metrics collection. Jupyter notebooks and/or Trovi artifacts containing analysis and mathematical models for application resource utilization & the effects of data quality.

LiveHD

Thu, 01 Feb 2024 00:00:00 +0000

The goals is to enable a more productive flow where the ASIC/FPGA designer can work with multiple hardware description languages like CHISEL, Pyrope, or Verilog.

There are several projects, some compiler infrastructure around LiveHD. Others around how to interface LLMs to improve chip design productivity.

There are the following projects available:

Slang with LiveHD
Hardware Hierarchical Dynamic Structures (hdds)
HDLEval for LLMs
C++ Profiler Optimizer with LLMs
Decompiler from Assembly to C++ with LLMs

Slang with LiveHD

Project Idea

slang is one of the best open source Verilog front-ends available. LiveHD uses slang, but only a subset of Verilog is supported. The goal is to add more slang features.

Project Deliverable

The slang/LiveHD interface creates LiveHD IR (LNAST IR). The plan is to keep extending the translation to support more features. This is a project that allows small steps. The goal is to support all Verilog 2001, and potentially some System Verilog features.

Topics: SysteVerilog, Compilers
Skills Needed: Knowledge of Verilog, C++17, some compiler background.
Difficulty: Medium
Size: Large
Mentor: Jose Renau, Sakshi Garg

Hardware Hierarchical Dynamic Structures (hdds)

Project Idea

hdds aims to build efficient tree and graph data structures commonly used by hardware compilers. A key difference is the hierarchical nature, and patterns.

Project Deliverable

There are 2 main components: Graph and Tree.

For each, there is a hierarchical implementation that allows to connect tree/graphs in a hieararchy. For example, a graph can call another graph with input and outputs like a Verilog module calls other Verilog modules.

Both classes should have iterators for traversing in topological sort.

Topics: Data structures for compilers
Skills Needed: Data structures, C++17
Difficulty: Medium
Size: Large
Mentor: Jose Renau, Sakshi Garg

HDLEval for LLMs

Project Idea

LLMs can be used to create new hardware. The goal of this project is to create multiple prompts so that LLM/compiler designers can have examples to improve their flows.

Project Deliverable

The idea is to create many sample projects where a “input” creates a Verilog artifact. The specification should not assume Verilog as output because other HDLs like Chisel could be used.

The goal is to create many sample circuits that are realistic and practical. The description can have

Topics: Verilog, LLMs
Skills Needed: Verilog or Chisel
Difficulty: Low
Size: Small or medium
Mentor: Jose Renau

C++ Profiler Optimizer with LLMs

Project Idea

Fine-tune, and/or RAG, a LLM to leverage profiling tools so that it can provide code optimization recommendations for C++ and possibly Rust code.

Project Deliverable

Create a Python package (poetry?) called aiprof that analyzes the execution of a C++ or Rust program and provide code change recommendations to improve performance.

aiprof ./binary

aiprof uses perf tools but also other tools like redspy, zerospy, and loadspy to find problematic code areas and drive the GPT optimizer.

The plan is to find several examples of transformations to have a database so that a model like CodeLlama or mixtral can be fine-tuned with code optimization recomendations.

Topics: C++, perf tools
Skills Needed: C++17, Linux performance counters
Difficulty: Medium
Size: Large
Mentor: Jose Renau

Decompiler from Assembly to C++ with LLMs

Project Idea

There are several decompilers from assembly to C like ghidra and retdec. The idea is to enhance both outputs to feed an LLM to generate nicer C++ code.

Project Deliverable

ghidra and retdec generate C code out of assembly. The idea is to start with these tools as baseline, but feed it to a LLM to generate C++ code instead of plain C.

Create a Python package (poetry?) called aidecomp that integrates both decompilers. It allows to target C or C++17.

To check that the generated code is compatible with the function translated, a fuzzer could be used. This allows aidecomp to iterate the generation if the generated code is not equivalent.

Topics: C++, decompilers
Skills Needed: C++17
Difficulty: Medium
Size: Large
Mentor: Jose Renau

Drishti

Tue, 30 Jan 2024 10:15:00 -0700

Drishti is a novel interactive web-based analysis framework to visualize I/O traces, highlight bottlenecks, and help understand the I/O behavior of scientific applications. Drishti aims to fill the gap between the trace collection, analysis, and tuning phases. The framework contains an interactive I/O trace analysis component for end-users to visually inspect their applications’ I/O behavior, focusing on areas of interest and getting a clear picture of common root causes of I/O performance bottlenecks. Based on the automatic detection of I/O performance bottlenecks, our framework maps numerous common and well-known bottlenecks and their solution recommendations that can be implemented by users.

Drishti / Server-side Visualization Service

The proposed work will include investigating and building server-side solutions to support the visualization of larger I/O traces and logs, while integrating with the existing analysis, reports, and recommendations.

Topics: I/O HPC visualization, performance analysis
Skills: Python, HTML/CSS, JavaScript
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Drishti / Visualization and Analysis of AI-based Applications

Drishti to handle metrics from non-MPI applications, specifically, AI/ML codes and applications. This work entails adapting the existing framework, heuristics, and recommendations to support metrics collected from AI/ML workloads.

Topics: I/O HPC AI visualization, performance analysis
Skills: Python, AI, performance profiling
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench

Tue, 30 Jan 2024 10:15:00 -0700

h5bench is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.

Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library’s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.

h5bench / Reporting and Enhancing

The proposed work will include standardizing and enhancing the reports generated by the suite, and integrate additional I/O kernels (e.g., HACC-IO).

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench / Compression

The proposed work will focus on including compression capabilities into the h5bench core access patterns through HDF5 filters.

Topics: I/O HPC benchmarking, compression
Skills: C/C++, Python, HDF5
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

StatWrap

Wed, 24 Jan 2024 00:00:00 +0000

StatWrap is a free and open-source assistive, non-invasive discovery and inventory tool to document research projects. It inventories project assets (e.g., code files, data files, manuscripts, documentation) and organizes information without additional input from the user. It also provides structure for users to add searchable and filterable notes connected to files to help communicate metadata about intent and analysis steps.

At its core, StatWrap helps investigators identify and track changes in a research project as it evolves - which may affect reproducibility. For example: (1) people on the project can change over time, so processes may not be consistently executed due to transitions in employment; (2) data changes over time, due to accruing additional cases, adding new variables, or correcting mistakes in existing data; (3) software (e.g. used for data preparation and statistical analysis) evolves as it is edited, improved, and optimized; and (4) software can break or produce different results due to changes ‘under the hood’ such as updates to statistical packages, compilers, or interpreters. StatWrap passively and actively documents these changes to support reproducibility.

Additional information:

Reproducibility Checklists

Topics: reproducibility, user interface, checklists
Skills: JavaScript, React
Difficulty: Medium
Size: Large (350 hours)
Mentor: Luke Rasmussen

This goal of this project is to develop support within StatWrap to generate customizable reproducibility checklists. The developer will use the metadata and user input collected by StatWrap to automatically generate checklists. This functionality will allow investigators to automatically generate a document indicating what practices they’ve followed to support reproducibility. Part of the project will involve surveying proposed reproducibility checklists and considering what to implement in StatWrap. This work will take a systematic approach to documenting reproducibility, much like PRISMA checklists for systematic reviews or CONSORT checklists for clinical trials.

The specific tasks of the project include:

Identify candidate reproducibility checklists to use as guides
Create the data structure for configuring reproducibility checklists
Display the reproducibility checklist in the user interface
Store responses and comments to the checklist as provided by the user
Generate a reproducibility checklist report from StatWrap

OpenROAD - An Open-Source, Autonomous RTL-GDSII Flow for Chip Design

Mon, 22 Jan 2024 00:00:00 +0000

The OpenROAD project is a non-profit project, originally funded by DARPA with the aim of creating open-source EDA tools; an Autonomous flow from RTL-GDSII that completes < 24 hrs, to lower cost and boost innovation in IC design. This project is now supported by Precision Innovations.

OpenROAD massively scales and supports EWD (Education and Workforce Development) and supports a broad ecosystem making it a vital tool that supports a rapidly growing Semiconductor Industry.

OpenROAD is the fastest onramp to gain knowledge, skills and create pathways for great career opportunities in chip design. You will develop important software and hardware design skills by contributing to these interesting projects. You will also have the opportunity to work with mentors from the OpenROAD project and other industry experts.

We welcome a diverse community of designers, researchers, enthusiasts, software engineers and entrepreneurs to use and contribute to OpenROAD and make a far-reaching impact in the rapidly growing, global Semiconductor Industry.

Create OpenROAD Tutorials and Videos

Topics: Documentation, Tutorials, Videos, VLSI design basics
Skills: Video/audio recording and editing, training and education
Difficulty: Medium
Size: Large (350 hours)
Mentor: Indira Iyer, Vitor Bandeira

Create short videos for training and course curriculum highlighting key features and flows in OpenROAD-flow-scripts.

Improve the OpenROAD AutoTuner Flow and documentation

Topics: OpenROAD-flow-scripts, AutoTuner, Design Exploration
Skills: Knowledge of ML for hyperparameter tuning, Cloud-based computation, Basic VLSI design and tools knowledge, python, C/C++
Difficulty: Medium
Size: Large (350 hours)
Mentor: Vitor Bandeira, Indira Iyer

Test, analyze and enhance the AutoTuner to improve usability, documentation and QoR. The Autotuner is an important tool in the OpenROAD flow - OpenROAD-flow-scripts for Chip design exploration that significantly reduces design time. You will use state-of-the-art ML tools to test the current tool exhaustively for good PPA (performance, power, area) results. You will also update existing documentation to reflect any changes to the tool and flow.

Implement a memory compiler in the OpenROAD Flow

Topics: OpenROAD-flow-scripts, Memory Compiler,
Skills: Basic VLSI design and tools knowledge, python, tcl, C/C++, memory design a plus
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Matt Liberty, Austin Rovinski

Implement a memory compiler as part of the OpenROAD flow to improve the placement and layout efficiency of large, memory-intensive designs. You will start with an existing code base to develop this feature: https://github.com/The-OpenROAD-Project-staging/OpenROAD/tree/dffram This is another option: https://github.com/AUCOHL/DFFRAM Enhance code to support DFFRAM support for the OpenROAD native flow, OpenROAD-flow-scripts.

Integrate a tcl and python linter

Topics: Linting, Workflow
Skills: tcl, python, linting
Difficulty: Easy
Size: Small (90 hours)
Mentor: Vitor Bandeira, Austin Rovinski

Integrate a tcl and python linter for tools in OpenROAD and OpenROAD-flow-scripts to enforce error checking, style and best practices.

LLM assistant for OpenROAD - Create Model Architecture and Prototype

Topics: Large Language Model, Machine Learning, Model Architecture, Model Deployment
Skills: large language model engineering, prompt engineering, fine-tuning
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Indira Iyer, Jack Luar

This project involves the creation of a conversational assistant designed around OpenROAD to answer user queries. You will be working in tandem with members of the OpenROAD team and other researchers to deliver a final deployable prototype. You will focus on the design and implementation of modular LLM architectures. You will be experimenting through different architectures and justifying which approach works the best on our domain-specific data. Open to proposals from all levels of ML practitioners.

LLM assistant for OpenROAD - Data Engineering and testing

Topics: Large Language Model, Machine Learning, Data Engineering, Model Deployment, Testing
Skills: large language model engineering, prompt engineering, fine-tuning
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Indira Iyer, Jack Luar

This project involves the creation of a conversational assistant designed around OpenROAD to answer user queries. You will be working in tandem with members of the OpenROAD team and other researchers to deliver a final deployable prototype. This project will focus on the data engineering portion of the project. This may include: training pipelines specifically tailored for fine-tuning LLM models, data annotation, preprocessing and augmentation. Open to proposals from all levels of ML practitioners.

Create Unit tests for OpenROAD tools

Topics: OpenROAD-flow-scripts, unit testing
Skills: Basic VLSI design and tools knowledge, python, tcl, C/C++, Github
Difficulty: Medium
Size: Medium ( 175 hours)
Mentor: Vitor Bandeira, Indira Iyer

You will build unit tests to test specific features of the OpenROAD tool which will become part of the regression test. Here is an example of a test for UPF support: https://github.com/The-OpenROAD-Project/OpenROAD/blob/master/test/upf/mpd_aes.upf. This is a great way to learn VLSI flow basics and the art of testing them for practical applications.

StatTag: Connecting statistical software to Microsoft Word

Mon, 22 Jan 2024 00:00:00 +0000

StatTag is a free, open-source software plug-in for conducting reproducible research. It facilitates the creation of dynamic documents using Microsoft Word documents and statistical software, such as Stata, SAS, R, and Python. Users can use StatTag to embed statistical output (estimates, tables and figures) into a Word document and then with one click individually or collectively update output with a call to the statistical program.

What makes StatTag different from other tools for creating dynamic documents is that it allows for statistical code to be edited directly from Microsoft Word. Using StatTag means that modifications to a dataset or analysis no longer require transcribing or re-copying results into a manuscript or table.

StatTag works by interpreting specially formatted comments (“tags”) within a code file. StatTag then reads the code file, executes the code through the corresponding language interpreter, formats the results, and inserts them into the Word document as a field.

There are versions of StatTag for both Microsoft Windows and macOS. Proposed projects here are specific to the Microsoft Windows version, which is developed in the C# programming language.

Additional Information:

Support Additional Programming Languages

Topics: reproducibility, statistics
Skills: C# and one of: MATLAB, Octave, SQL, Julia
Difficulty: Medium
Size: Medium or large (175 or 350 hours)
Mentor: Luke Rasmussen

Following the same structure used for other language support in StatTag, develop support for a new programming language (suggested languages are provided, but applicants can propose others). This will include:

Creating a Parser class to support StatTag-specific interpretation of results (e.g., identifying a line of code that is writing to a CSV file, then loading that CSV file)
Creating an Automation class that manages communication with the supported programming language’s interpreter. Python support uses a Jupyter kernel, and both SAS and Stata support invoke DLLs directly.
Integrating the language into the UI (e.g., allowing it to be a valid code file, adding the icon for the code file to the UI)
Additional setup/configuration as needed (e.g., SQL support would require secure configuration for connecting to the databse server).

Develop unit tests to demonstrate code is functioning. Create test scripts in the implemented language to exercise and demonstrate end-to-end execution.

Process Tags in Jupyter Notebooks

Topics: reproducibility, jupyter
Skills: C#, Jupyter Notebooks, Python
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Luke Rasmussen

StatTag uses

StatTag currently has support for Python, and utilizes the Jupyter kernel to interact with Python. However, we currently do not fully support processing StatTag ’tags’ in a Jupyter notebook.

Following the same structure used for RMarkdown integration in StatTag, develop support for Jupyter Notebooks in StatTag. StatTag should be able to:

Take as input one or more Jupyter Notebooks
Confirm that the Jupyter Notebook uses Python
Identify StatTag formatted tags within the notebook
Pass relevant code to the Python processor already implemented in StatTag

In addition, develop unit tests to demonstrate code is functioning as intended. Create test Jupyter Notebooks to exercise and demonstrate end-to-end execution.

AIIO / Graph Neural Network

Wed, 17 Jan 2024 10:15:56 -0700

[AIIO] (https://github.com/hpc-io/aiio) revolutionizes the way for users to automatically tune the I/O performance of applications on HPC systems. It currently works on linear regression models but has more opportunities to work on heterogeneous data, such as programming info. This requires extending the linear regression model to more complex models, such as heterogeneous graph neural networks. The proposed work will include developing the graph neural work-based model to predict the I/O performance and interpretation.

AIIO / Graph Neural Network

Topics: AIIO/Graph Neural Network`
Skills: Python, Github, Machine Learning
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Suren Byna

The Specific tasks of the project include:

Develop the data pre-processing pipeline to convert I/O logs into formats which are required by the Graph Neural Network
Build and test the Graph Neural Network to model the I/O performance for HPC applications.
Test and evaluate the accuracy of the Graph Neural Network with test cases from AIIO

FasTensor / Stream Processing

Wed, 17 Jan 2024 10:15:56 -0700

[FasTensor] (https://github.com/BinDong314/FasTensor) is a generic tensor processing engine with scalability from single nodes to thousands of nodes on HPC. FasTensor supports applications from traditional SQL query to complex DFT solver in scientific applications. It has a 1000X performance advantage over MapReduce and Spark in supporting generic data processing functions on tensor structure. In this project, we propose to expand FasTensor with streaming functionality to support online data processing. Specifically, participants of this project will develop a stream endpoint for retrieving live data output from applications, such as DAS. The stream endpoint performs the function to maintain the pointer of data, which could be a n-dimensional subset of a tensor.

FasTensor / Stream Processing

Topics: FasTensor/Streaming Processing
Skills: C++, github
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, John Wu

The Specific tasks of the project include:

Building a mock workflow based on our DAS application (https://github.com/BinDong314/DASSA) to test stream processing. The mock workflow comprises a data producer, which generates DAS data, and a data consumer, which processes the data.
Developing a Stream Endpoint (e.g., I/O driver) to iteratively read dynamically increasing data from a directory. The stream endpoint essentially includes open, read, and write functions, and a pointer to remember current file pointer.
Integrating the Stream Endpoint into the FasTensor library.
Evaluating the performance of the mock workflow with the new Stream Endpoint.
Documenting the execution mechanism.

SLICES/pos: Reproducible Experiment Workflows

Sat, 06 Jan 2024 00:00:00 +0000

SLICES-RI is a european research initiative aiming to create a digital research infrastructure providing an experimental platform for the upcoming decades. One of the main goals of this initiative is the creation of fully reproducible experiments. The SLICES research infrastructure will consist of different experiment sites focusing on different research domains such as AI experiments, Cloud and HPC-driven experiments, or investigations on wireless networks.

To achieve reproducibility, the research group on network architectures and services of the Technical University of Munich develops the SLICES plain orchestrating service (SLICES/pos). This framework supports a fully automated structured experiment workflow. The structure of this workflow acts as a template for the design of experiments. Users that adhere to this template will create inherently reproducible experiments, a feature we call reproducible-by-design.

The SLICES/pos framework currently exists in two versions: (1) A fully-managed pos deployment, that uses the SLICES/pos framework to manage the entire testbed and (2) a hosted SLICES/pos deployment. The hosted SLICES/pos deployment is a temporary deployment that runs inside existing testbeds such as Chameleon or CloudLab.

Additional Information:

plain orchestrating service

Support Additional Programming Languages

Topics: reproducibility, statistics
Skills: Python
Difficulty: Medium
Size: Large (350 hours)
Mentor: Sebastian Gallenmüller, Georg Carle, and Kate Keahey

Design a set of basic examples that demonstrate the usage of pos that can be executed on the SLICES/pos testbed in Munich and the Chameleon testbed. This set of basic examples acts as a demonstration of pos’ capabilities and as a tutorial for new users. Based on these introductory examples, a more complex experiment shall be designed and executed, demonstrating the portability of the experiments between testbeds. This experiment involves the entire experiment workflow consisting of the setup and configuration of the testbed infrastructure, the collection of measurement results, and finally, their evaluation and publication. Multiple results of this experiment shall be created on different testbeds and hardware configurations. The results of the experiments will differ depending on the different hardware platforms on which the experiment was executed. These results shall be evaluated and analyzed to find a common connection between the different result sets of the experiments.

Create introductory examples demonstrating the usage of pos
Design and create a portable complex network experiment based on SLICES/pos
Execute the experiment on different testbeds (Chameleon, SLICES/pos testbed)
Analysis of reproduced experiment
Automated analysis of experimental results
Deduction of a model describing the fundamental connections between different experiment executions

Static Python Perf: Measuring the Cost of Sound Gradual Types

Sat, 06 Jan 2024 00:00:00 +0000

Gradual typing is a solution to the longstanding tension between typed and untyped languages: let programmers write code in any flexible language (such as Python), equip the language with a suitable type system that can describe invariants in part of a program, and use run-time checks to ensure soundness.

For now, though, the cost of run-time checks can be enormous. Order-of-magnitude slowdowns are common. This high cost is a main reason why TypeScript is unsound by design — its types are not trustworthy in order to avoid run-time costs.

Recently, a team at Meta built a gradually-typed variant of Python called (drumroll) Static Python. They report an incredible 4% increase in CPU efficiency at Instagram thanks to the sound types in Static Python. This kind of speedup is unprecedented.

Other languages may want to follow the Static Python approach to gradual types, but there are big reasons to doubt the Instagram numbers:

the experiment code is closed source, and
the experiment itself is not easily reproducible (even for Instagram!).

Static Python needs a rigorous, reproducible performance evaluation to test whether it is indeed a fundamental advance for gradual typing.

Related Work:

Gradual Soundness: Lessons from Static Python https://programming-journal.org/2023/7/2/
Producing Wrong Data Without Doing Anything Obviously Wrong! https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
On the Cost of Type-Tag Soundness https://users.cs.utah.edu/~blg/resources/pdf/gm-pepm-2018.pdf

Design and Run an Experiment

Topics: performance, cluster computing, statistics
Skills: Python AST parsing, program generation, scripting, measuring performance
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Ben Greenman

Design an experiment that covers the space of gradually-typed Static Python programs in a fair way. Since every variable in a program can have up to 3 different types, there are easily 3^20 possibilities in small programs — far too many to measure exhaustively.

Run the experiment on an existing set of benchmarks using a cluster such as CloudLab. Manage the cluster machines across potentially dozens of reservations and combine the results into one comprehensive view of Static Python performance.

Derive Benchmarks from Python Applications

Topics: types, optimization, benchmark design
Skills: Python
Difficulty: Medium
Size: Small to Large
Mentor: Ben Greenman

Build or find realistic Python applications, equip them with rich types, and modify them to run a meaningful performance benchmark. Running a benchmark should produce timing information, and the timing should not be significantly influenced by random variables, I/O actions, or system events.

PolyPhy

Mon, 01 Jan 2024 00:00:00 +0000

PolyPhy is a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our workshop and more details about our research here.

Under the hood, PolyPhy uses a richer 3D scalar field representation of the reconstructed network, instead of a typical discrete representation like a graph or a mesh. The ultimate purpose of PolyPhy is to become a toolkit for a range of specialists across different disciplines: astronomers, neuroscientists, data scientists and even artists and designers. PolyPhy aspires to be a tool for discovering connections between different disciplines by creating quantitatively comparable structural analytics.

PolyPhy Web Presence

Topics: Web Development UX Social Media
Skills: full stack web development, Javascript, good communicator
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Oskar Elek, Ezra Huscher

The online presentation of a software project is without a doubt one of the core ingredients of its success. This project aims to develop a sustainable web presentce for PolyPhy, catering to interested contributors, active collaborators, and users alike.

Specific tasks:

Closely work with the mentors on understanding the context of the project and its detailed requirements in preparation of the proposal.
Port the existing website into a more modern Javascript framework (such as Next.js) that provides a user-friendly CMS and admin interface.
Update the contents of the website with new information from the repository repository page as well as other sources as directed by the mentors.
Develop a simple functional system for posting updates about the project to selected social media and other communication platforms (LinkedIn, Twitter/X or Mastodon, mailing list) which will also be reflected on the website.
Optional: improve the UX of the website where needed.
Optional: implement website analytics (visitor stats etc).

Data Visualization and Analysis with PolyPhy/Polyglot

Topics: Data Science Data Visualization Point Clustering 3D Neural Embeddings
Skills: data science, Python, Javascript, statistics, familiarity with AI and latent embedding spaces a big plus
Difficulty: Challenging
Size: Large (350+ hours)
Mentors: Oskar Elek, Kiran Deol

The aim of this project is to explore a novel data-scientific usecase using PolyPhy and its associated web visualization interface PolyGlot. The contributor is expected to identify a dataset they are already well familiar with, and that fits the application scope of the PolyPhy/PolyGlot tooling: a complex point cloud arising from a 3D or a higher dimensional process which will benefit from latent pattern identification and a subsequent visual as well as quantitative analysis. The contributor needs to have the rights for using the dataset - either by owning the copyright or via the open-source nature of the data.

Specific tasks:

Closely work with the mentors on understanding the context of the project and its detailed requirements in preparation of the proposal.
Become acquainted with the tooling (PolyPhy, PolyGlot) prior to the start of the project period.
Document the nature of the target dataset and define the complete data pipeline with assistance of the mentors, including the specific analytic tasks and objectives.
Implement the data pipeline in PolyPhy and PolyGlot.
Document the process and resulting findings in a publicly available report.

Writing a blog about your OSRE 2024 project

Fri, 06 Oct 2023 00:00:00 +0000

As last year, the Organization Admins will be asking students and contributors to provide regular status updates which will help us better highlight the work you are doing and track activities within our OSRE projects. These progress reports will also form the basis of blog reports prepared by students in the course of their summer. Blog reports should include links to proposals, presentations, reports, and an overview of the student’s experience.

Your experience is invaluable for future OSRE candidates and for improving the program every year.

Size and content

Keep it short and crisp. Include a short description of your project, a link to your project proposal, and, later in the program, links to the GSoC reports you provided.

Making a pull request for your blog

Fork the git repository
If you haven’t already done so, add your profile using these instructions
- IMPORTANT: Under user_groups: add - 2024 Contributors (as opposed to any of the two mentor groups)
- The short bio and any other information goes below the frontmatter
Post your blog
- Add /content/report/osre24/ORGANIZATION/PROJECTNAME/DATE-USERNAME/index.md
- Add a frontmatter to index.md, using the labels below
- Blog text goes below the frontmatter
- In that same directory include a picture and call it featured.png (also supports .jpg, .jpeg)
Commit to your fork and make a pull request and email OSRE Admins (currently: Stephanie Lieggi, Carlos Maltzahn).

Example frontmatter and text body

---
title: "YOUR TITLE"
subtitle: "YOUR SUBTITLE (OPTIONAL)"
summary:
authors:
 - USERNAME1
 - USERNAME2
tags: ["osre24"]
categories: []
date: YYYY-MM-DD
lastmod: YYYY-MM-DD
featured: false
draft: false

# Featured image
# To use, add an image named `featured.jpg/png` to your page's folder.
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
image:
 caption: ""
 focal_point: ""
 preview_only: false
---

As part of the [PROJECTNAME](/project/osre24/ORGANIZATION/PROJECTNAME) my [proposal](https://...) under the mentorship of MENTOR aims to ...