data science | UCSC OSPO

AI Data Readiness Inspector (AIDRIN)

Fri, 30 Jan 2026 10:15:00 -0700

Garbage In, Garbage Out (GIGO) is a widely accepted quote in computer science across various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.

AIDRIN (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of data readiness for AI processes, covering a broad range of dimensions from the literature. AIDRIN uses metrics from traditional data quality assessment, such as completeness, outliers, and duplicates, to evaluate data. Furthermore, AIDRIN uses metrics specific to assessing AI data, such as feature importance, feature correlations, class imbalance, fairness, privacy, and compliance with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles. AIDRIN provides visualizations and reports to assist data scientists in further investigating data readiness.

AIDRIN Multiple File Formats

The proposed work will include improvements in the AIDRIN framework to (1) add support for new file formats such as Zarr, ROOT, and HDF5; and (2) to allow providing custom data ingestion mechanisms.

Topics: data readiness, AI, data analysis
Skills: Python, C/C++, data analysis, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Drishti

Fri, 30 Jan 2026 10:15:00 -0700

Drishti is a novel interactive web-based analysis framework to visualize I/O traces, highlight bottlenecks, and help understand the I/O behavior of scientific applications. Drishti aims to fill the gap between the trace collection, analysis, and tuning phases. The framework contains an interactive I/O trace analysis component for end-users to visually inspect their applications’ I/O behavior, focusing on areas of interest and getting a clear picture of common root causes of I/O performance bottlenecks. Based on the automatic detection of I/O performance bottlenecks, our framework maps numerous common and well-known bottlenecks and their solution recommendations that can be implemented by users.

Drishti Comparisons and Heatmaps

The proposed work will include investigating and building a solution to allow comparing and finding differences between two I/O trace files (similar to a diff), covering the analysis and visualization components. It will also explore additional metrics and counters such as Darshan heatmaps in the analysis and visualization components of the framework.

Topics: I/O, HPC, data analysis, visualization, profiling, tracing
Skills: Python, data analysis, performance profiling
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Lynx Grader

Tue, 13 Jan 2026 13:00:00 -0800

The EduLinq Lynx Grader (also referred to as “autograder”) is an open source tool used by several courses at UCSC to safely and quickly grade programming assignments. Grading student code is something that may seem simple at first (you just need to run their code!), but quickly becomes exceeding complex as you get more into the details. Specifically, grading a student’s code securely while providing the “last mile” service of getting code from students and sending results to instructors/TAs and the course’s LMS (e.g., Canvas) can be very difficult. The Lynx Grader provides all of this in a free and open source project. The LINQS Lab has made many contributions to the maintain and improve the Lynx Grader.

As an open source project, there are endless opportunities for development, improvements, and collaboration. Here, we highlight some specific projects that will work well in the summer mentorship setting.

All students interested in LINQS projects for OSRE/GSoC 2026 should fill out this form. Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project. The form will stop accepting responses once the application window closes. Do not post on any of the project repositories about OSRE/GSoC (e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2026). Remember, these are active repositories that were not created for OSRE/GSoC.

LLM Detection

Topics: AI/ML LLM Research Backend
Skills: software development, backend, systems, data munging, go, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

As Large Language Model (LLM) tools like ChatGPT become more common and powerful, instructors need tools to help determine if students are the actual authors of the code they submit. More classical instances of plagiarism are often discovered by code similarity tools like MOSS. However these tools are not sufficient for detecting code written not by a student, but by an AI model like ChatGPT or GitHub Copilot.

The task for this project is to create a system that provides a score indicating the system’s confidence that a given piece of code was written by an AI tool and not a student. This will supplement the existing code analysis tools in the Lynx Grader. There are many approaches to completing this task that will be considered. A more software development approach can consist of levering exiting systems to create a production-ready system, whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.

There has been previous work on this issue, where a student did a survey of existing solutions, collection of initial datasets, and exploratory experiments on possible directions. This project would build off of this previous work.

See Also:

Code Analysis GUI

Topics: Frontend
Skills: software development, frontend, data munging, js, css, go
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Lynx Grader has existing functionality to analyze the code in a student’s submission for malicious content. Relevant to this project is that the Lynx Grader can run a pairwise similarity analysis against all submitted code. This is how most existing software plagiarism systems detect offending code. The existing infrastructure provides detailed statistics on code similarity, but does not currently have a visual way to display this data.

The task for this project is to create a web GUI using the Lynx Grader REST API to display the results of a code analysis. The size of this project depends on how many of the existing features are going to be supported by the web GUI.

See Also:

Web GUI

Topics: Frontend
Skills: software development, frontend, js, css
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Lynx Grader contains dozens of API endpoints, most directly representing a piece of functionality exposed to the user. All of these features are exposed in the Lynx Grader’s Python Interface. However, the Python interface is a purely command-line interface. And although command-line interface are objectively (read: subjectively) the best, a web GUI would be more accessible to a wider audience. The autograder already has a web GUI, but it does not cover all the features available in the Lynx Grader.

The task for this project is to augment the Lynx Grader’s web GUI with more features. Specifically, add support for more tools used to create and administer courses.

See Also:

AI Data Readiness Inspector (AIDRIN)

Tue, 11 Feb 2025 10:15:00 -0700

Garbage In Garbage Out (GIGO) is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.

AIDRIN (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of the readiness of data for AI processes, covering a broad range of readiness dimensions available in the literature. AIDRIN uses metrics in traditional data quality assessment, such as completeness, outliers, and duplicates, for data evaluation. Furthermore, AIDRIN uses metrics specific to assess data for AI, such as feature importance, feature correlations, class imbalance, fairness, privacy, and FAIR (Findability, Accessibility, Interoperability, and Reusability) principle compliance. AIDRIN provides visualizations and reports to assist data scientists in further investigating the readiness of data.

AIDRIN Visualizations and Science Gateway

The proposed work will include improvements in the AIDRIN framework to (1) enhance, extend, and optimize the visualizations of metrics related to all six pillars of AI data readiness and (2) set up a science gateway on NERSC or AWS cloud service.

Topics: data readiness AI
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench with AI workloads

Tue, 11 Feb 2025 10:15:00 -0700

h5bench is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.

Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library’s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.

h5bench with AI workloads

The proposed work will include (1) analyzing and characterizing AI workloads that rely on HDF5 datasets, (2) extracting a kernel of their I/O operations, and (3) implementing and validating the kernel in h5bench.

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Autograder

Thu, 06 Feb 2025 13:00:00 -0800

The EduLinq Autograder is an open source tool used by several courses at UCSC to safely and quickly grade programming assignments. Grading student code is something that may seem simple at first (you just need to run their code!), but quickly becomes exceeding complex as you get more into the details. Specifically, grading a student’s code securely while providing the “last mile” service of getting code from students and sending results to instructors/TAs and the course’s LMS (e.g., Canvas) can be very difficult. The Autograder provides all of this in a free and open source project. The LINQS Lab has made many contributions to the maintain and improve the Autograder.

All students interested in LINQS projects for OSRE/GSoC 2025 should fill out this form. Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project. The form will stop accepting responses once the application window closes. Do not post on any of the project repositories about OSRE/GSoC (e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2025). Remember, these are active repositories that were not created for OSRE/GSoC.

LLM Detection

Topics: AI/ML LLM Research Backend
Skills: software development, backend, systems, data munging, go, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The task for this project is to create a system that provides a score indicating the system’s confidence that a given piece of code was written by an AI tool and not a student. This will supplement the existing code analysis tools in the Autograder. There are many approaches to completing this task that will be considered. A more software development approach can consist of levering exiting systems to create a production-ready system, whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.

See Also:

Code Analysis GUI

Topics: Frontend
Skills: software development, frontend, data munging, js, css, go
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Autograder has existing functionality to analyze the code in a student’s submission for malicious content. Relevant to this project is that the Autograder can run a pairwise similarity analysis against all submitted code. This is how most existing software plagiarism systems detect offending code. The existing infrastructure provides detailed statistics on code similarity, but does not currently have a visual way to display this data.

The task for this project is to create a web GUI using the Autograder REST API to display the results of a code analysis. The size of this project depends on how many of the existing features are going to be supported by the web GUI.

See Also:

Web GUI

Topics: Frontend
Skills: software development, frontend, js, css
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Autograder contains dozens of API endpoints, most directly representing a piece of functionality exposed to the user. All of these features are exposed in the Autograder’s Python Interface. However, the Python interface is a purely command-line interface. And although command-line interface are objectively (read: subjectively) the best, a web GUI would be more accessible to a wider audience. The autograder already has a web GUI, but it does not cover all the features available in the Autograder.

The task for this project is to augment the Autograder’s web GUI with more features. Specifically, add support for more tools used to create and administer courses.

See Also:

Mediglot

Tue, 04 Feb 2025 00:00:00 +0000

PolyPhy is a GPU-oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our workshop and more details about our research here. Recent projects, such as Polyglot and Mediglot have focused on using PolyPhy to better visualize language embeddings.

Medicinal Language Embeddings

Topics: Large Language Models NLP Embeddings Medicine
Skills: Python, JavaScript, Data Science, Technical Communication
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Oskar Elek, Kiran Deol

This project aims to refine and enhance Mediglot, a web application for visualizing 3D medicinal embeddings, which extends the Polyglot app and leverages the PolyPhy toolkit for network-inspired data science. Mediglot currently enables users to explore high-dimensional vector representations of medicines (derived from their salt compositions) in a 3D space using UMAP, as well as analyze similarity through the innovative Monte-Carlo Physarum Machine (MCPM) metric. Unlike traditional language data, medicinal embeddings do not have an inherent sequential structure. Instead, we must work with the salt compositions of each medicine to create embeddings that are faithful to the intended purpose of each medicine.

This year, we would like to focus on exploring and integrating state-of-the-art AI techniques and algorithms to improve Mediglot’s clustering capabilities and its representation of medicinal data in 3D. The contributor will experiment with advanced large language models (LLMs) and cutting-edge AI methods to develop innovative approaches for refining clustering and extracting deeper insights from medicinal embeddings. Beyond LLMs, we would like to experiment with more traditional language processing methods to design novel embedding procedures. Additionally, we would like to experiment with other similarity metrics. While the similarity of two medicines depends on the initial embedding, we would like to examine the effects of different metrics on the kinds of insights a user can extract. Finally, the contributor is expected to evaluate and compare different algorithms for dimensionality reduction to enhance the faithfulness of the visualization and its interpretability.

The ideal contributor for this project has experience with Python (and common scientific toolkits such as NumPy, Pandas, SciPy). They will also need some experience with JavaScript and web development (MediGlot is distributed as a vanilla JS web app). Knowledge of embedding techniques for language processing is highly recommended.

Specific tasks:

Closely work with the mentors to understand the context of the project and its detailed requirements in preparation for the proposal.
Become acquainted with the tooling (PolyPhy, PolyGlot, Mediglot) prior to the start of the project period.
Explore different embedding techniques for medicinal data (including implementing novel embedding procedures).
Explore different dimensionality reduction techniques, with a focus on faithful visualizations.
Document the process and resulting findings in a publicly available report.

Enhancing PolyPhy Web Application

Topics: Web Development UI/UX Design Full Stack Development JavaScript Next.js Node.js
Skills: Full Stack Web Development, UI/UX Design, JavaScript, Next.js, Node.js, Technical Communication
Difficulty: Challenging
Size: Medium (175 hours)
Mentors: Oskar Elek, Kiran Deol

This project aims to revamp and enhance the PolyPhy web platform to better support contributors, users, and researchers. The goal is to optimize the website’s UI/UX, improve its performance, and integrate Mediglot to provide users with a seamless experience in visualizing both general network structures and 3D medicinal embeddings.

The contributor will be responsible for improving the website’s overall look, feel, and functionality, ensuring a smooth and engaging experience for both contributors and end-users. This includes addressing front-end and back-end challenges, optimizing the platform for better accessibility, and ensuring seamless integration with Mediglot.

The ideal candidate should have experience in full-stack web development, particularly with Next.js, JavaScript, and Node.js, and should be familiar with UI/UX design principles. A strong ability to communicate effectively, both in writing and through code, is essential for this role.

Specific tasks:

Collaborate with mentors to understand the project’s goals and the specific requirements for the website improvements.
UI/UX Redesign:
- Redesign and enhance the website’s navigation, layout, and visual elements to create an intuitive and visually engaging experience.
- Improve mobile responsiveness for broader accessibility across devices.
Website Performance & Stability:
- Identify and resolve performance bottlenecks, bugs, or issues affecting speed, stability, and usability.
Mediglot Integration:
- Integrate the Mediglot web application with PolyPhy, ensuring seamless functionality and a unified user experience for visualizing medicinal data alongside general network reconstructions.
Documentation:
- Document the development process, challenges, and solutions in a clear and organized manner, ensuring transparent collaboration with mentors and the community.

Midway Through GSoC

Wed, 31 Jul 2024 00:00:00 +0000

Hello everyone! I’m Joel Tony, and I’m excited to share my progress update on the Drishti project as part of my Google Summer of Code (GSoC) experience. Over the past few weeks, I’ve been diving deep into the world of I/O visualization for scientific applications, and I’m thrilled to tell you about the strides we’ve made.

What is Drishti?

For those unfamiliar with Drishti, it’s an application used to visualize I/O traces of scientific applications. When running complex scientific applications, understanding their I/O behavior can be challenging. Drishti steps in to parse logs from various sources, with a primary focus on those collected using Darshan, a lightweight I/O characterization tool for HPC applications. Drishti provides human-interpretable insights on how to improve I/O performance based on these logs. While Drishti supports multiple log sources, our current work emphasizes Darshan logs due to their comprehensive I/O information. Additionally, Drishti offers visually appealing and easy-to-understand graphs to help users better grasp their application’s I/O patterns, making it easier to identify bottlenecks and optimize performance.

Progress and Challenges

Export Directory Feature

One of the first features I implemented was the export directory functionality. In earlier versions of Drishti, users couldn’t select where they wanted their output files to be saved. This became problematic when working with read-only log locations. I familiarized myself with the codebase, created a pull request, and successfully added this feature, allowing users to choose their preferred output location.

CI Improvements and Cross-Project Dependencies

While working on Drishti, I discovered the tight coupling between various tools in the HPC I/O organization, such as Drishti and DXT Explorer. This highlighted the need for improved Continuous Integration (CI) practices. We currently run about eight GitHub Actions for each pull request, but they don’t adequately test the interactions between different branches of these interconnected tools. This is an area we’ve identified for future improvement to ensure smoother integration and fewer conflicts between projects.

Refactoring for Multi-File Support

The bulk of my time was spent refactoring Drishti to extend its framework from parsing single Darshan files to handling multiple files. This task was more complex than it initially appeared, as Drishti’s insights are based on the contents of each Darshan file. When dealing with multiple files, we needed to find a way to aggregate the data meaningfully without sacrificing on performance.

The original codebase had a single, thousand-line function for parsing Darshan files. To improve this, I implemented a data class structure in Python. This refactoring allows for:

Better separation of computation and condition checking
Easier parallelization of processing multiple traces
Finer-grained profiling of performance bottlenecks
More flexibility in data manipulation and memory management

Learnings and Skills Gained

Through this process, I’ve gained valuable insights into:

Refactoring large codebases
Understanding and improving cross-project dependencies
Implementing data classes in Python for better code organization
Balancing performance with code readability and maintainability

Next Steps

As I move forward with the project, my focus will be on:

Adding unit tests for individual methods to ensure functionality
Exploring alternative data frame implementations like Polars for better performance
Developing aggregation methods for different types of data across multiple Darshan files
Optimizing memory usage and computational efficiency for large datasets

Conclusion

Working on Drishti has been an incredible learning experience. I’ve had the opportunity to tackle real-world challenges in scientific computing and I/O visualization. As we progress, I’m excited about the potential impact of these improvements on the scientific community’s ability to optimize their applications’ I/O performance.

I’m grateful for this opportunity and looking forward to the challenges and discoveries that lie ahead in the second half of my GSoC journey. Stay tuned for more updates as we continue to enhance Drishti!

If you have any questions or would like to learn more about the project, feel free to reach out to me. Let’s keep pushing the boundaries of scientific computing together!

Drishti

Thu, 06 Jun 2024 00:00:00 +0000

Namaste everyone! 🙏🏻

I’m Joel Tony, a third-year Computer Science undergraduate at BITS Pilani, Goa, India. I’m truly honored to be part of this year’s Google Summer of Code program, working with the UC OSPO organization on a project that genuinely excites me. I’m particularly grateful to be working under the mentorship of Dr. Jean Luca Bez, a Research Scientist at Lawrence Berkeley National Laboratory, and Dr. Suren Byna, a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.

My project, “Drishti: Visualization and Analysis of AI-based Applications”, aims to extend the Drishti framework to better support AI/ML workloads, focusing specifically on optimizing their Input/Output (I/O) performance. I/O refers to the data transfer between a computer’s memory and external storage devices like hard drives (HDDs) or solid-state drives (SSDs). As AI models and datasets continue to grow exponentially in size, efficient I/O management has become a critical bottleneck that can significantly impact the overall performance of these data-intensive workloads.

Drishti is an innovative, interactive web-based framework that helps users understand the I/O behavior of scientific applications by visualizing I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualizations, making performance issues more apparent. Now, I’m working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.

Through my studies in high-performance computing and working with tools like BeeGFS and Darshan, I’ve gained insights into the intricacies of I/O performance. However, adapting Drishti for AI/ML workloads presents new challenges. In traditional HPC, computing often dominates, but in the realm of AI, the tables have turned. As models grow by billions of parameters and datasets expand to petabytes, I/O has become the critical path. Training larger models or using richer datasets doesn’t just mean more computation; it means handling vastly more data. This shift makes I/O optimisation not just a performance tweak but a fundamental enabler of AI progress. By fine-tuning Drishti for AI/ML workloads, we aim to pinpoint I/O bottlenecks precisely, helping researchers streamline their data pipelines and unlock the full potential of their hardware.

As outlined in my proposal, my tasks are threefold:

Modularize Drishti’s codebase: Currently, it’s a single 1700-line file that handles multiple functionalities. I’ll be refactoring it into focused, maintainable modules, improving readability and facilitating future enhancements.
Enable multi-trace handling: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I’ll build a layer to aggregate these, providing a comprehensive view of the application’s I/O behavior.
Craft AI/ML-specific recommendations: Current suggestions often involve MPI-IO or HDF5, which aren’t typical in ML frameworks like PyTorch or TensorFlow. I’ll create targeted recommendations that align with these frameworks’ data pipelines.

This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimize it for the unique I/O challenges that AI/ML applications face. Whether it’s dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.

From classroom theories to hands-on projects, from understanding file systems to optimizing AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.

In today’s AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimized I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behavior in AI/ML applications more interpretable through Drishti, we’re not just tweaking code. We’re providing developers with the insights they need to optimize their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.

I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimizing I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I’m deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavor and to the UC OSPO and GSoC communities for this incredible opportunity.

Drishti

Tue, 30 Jan 2024 10:15:00 -0700

Drishti / Server-side Visualization Service

The proposed work will include investigating and building server-side solutions to support the visualization of larger I/O traces and logs, while integrating with the existing analysis, reports, and recommendations.

Topics: I/O HPC visualization, performance analysis
Skills: Python, HTML/CSS, JavaScript
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Drishti / Visualization and Analysis of AI-based Applications

Drishti to handle metrics from non-MPI applications, specifically, AI/ML codes and applications. This work entails adapting the existing framework, heuristics, and recommendations to support metrics collected from AI/ML workloads.

Topics: I/O HPC AI visualization, performance analysis
Skills: Python, AI, performance profiling
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench

Tue, 30 Jan 2024 10:15:00 -0700

h5bench / Reporting and Enhancing

The proposed work will include standardizing and enhancing the reports generated by the suite, and integrate additional I/O kernels (e.g., HACC-IO).

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench / Compression

The proposed work will focus on including compression capabilities into the h5bench core access patterns through HDF5 filters.

Topics: I/O HPC benchmarking, compression
Skills: C/C++, Python, HDF5
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

PolyPhy

Mon, 01 Jan 2024 00:00:00 +0000

PolyPhy is a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our workshop and more details about our research here.

Under the hood, PolyPhy uses a richer 3D scalar field representation of the reconstructed network, instead of a typical discrete representation like a graph or a mesh. The ultimate purpose of PolyPhy is to become a toolkit for a range of specialists across different disciplines: astronomers, neuroscientists, data scientists and even artists and designers. PolyPhy aspires to be a tool for discovering connections between different disciplines by creating quantitatively comparable structural analytics.

PolyPhy Web Presence

Topics: Web Development UX Social Media
Skills: full stack web development, Javascript, good communicator
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Oskar Elek, Ezra Huscher

The online presentation of a software project is without a doubt one of the core ingredients of its success. This project aims to develop a sustainable web presentce for PolyPhy, catering to interested contributors, active collaborators, and users alike.

Specific tasks:

Closely work with the mentors on understanding the context of the project and its detailed requirements in preparation of the proposal.
Port the existing website into a more modern Javascript framework (such as Next.js) that provides a user-friendly CMS and admin interface.
Update the contents of the website with new information from the repository repository page as well as other sources as directed by the mentors.
Develop a simple functional system for posting updates about the project to selected social media and other communication platforms (LinkedIn, Twitter/X or Mastodon, mailing list) which will also be reflected on the website.
Optional: improve the UX of the website where needed.
Optional: implement website analytics (visitor stats etc).

Data Visualization and Analysis with PolyPhy/Polyglot

Topics: Data Science Data Visualization Point Clustering 3D Neural Embeddings
Skills: data science, Python, Javascript, statistics, familiarity with AI and latent embedding spaces a big plus
Difficulty: Challenging
Size: Large (350+ hours)
Mentors: Oskar Elek, Kiran Deol

The aim of this project is to explore a novel data-scientific usecase using PolyPhy and its associated web visualization interface PolyGlot. The contributor is expected to identify a dataset they are already well familiar with, and that fits the application scope of the PolyPhy/PolyGlot tooling: a complex point cloud arising from a 3D or a higher dimensional process which will benefit from latent pattern identification and a subsequent visual as well as quantitative analysis. The contributor needs to have the rights for using the dataset - either by owning the copyright or via the open-source nature of the data.

Specific tasks:

Closely work with the mentors on understanding the context of the project and its detailed requirements in preparation of the proposal.
Become acquainted with the tooling (PolyPhy, PolyGlot) prior to the start of the project period.
Document the nature of the target dataset and define the complete data pipeline with assistance of the mentors, including the specific analytic tasks and objectives.
Implement the data pipeline in PolyPhy and PolyGlot.
Document the process and resulting findings in a publicly available report.

noWorkflow as an experiment management tool - Final Report

Thu, 14 Sep 2023 00:00:00 +0000

This post describes our midterm work status and some achievements we have made so far in our project proposal for noWorkflow.

For a more friendly introduction to our work, please, refer to this tutorial available.

Our final code to merge is available in this repository.

Different ways of managing experiments

From our starting point at the midterm, and from our initial aspirations for the SoR, we kept on track with the goal of adding features to noWorkflow related to managing DS/ML experimental setups focusing on reproducibility.

With the emergence of IA across multiple fields in industry and academia, the subject of reproducibility has become increasingly relevant. In [1] we have an interesting description of the sources of irreproducibility in Machine Learning. All these sources are present at different stages during the project's experimental phases and may even persist in production environments, leading to the accumulation of technical debt [2]. The problem of irreproducibility is also discussed in [[3], [4]], pointing out that the velocity of deliverances usually comes at the expense of reproducibility, among other victims.

The CRISP-DM process as reviewed in [5] demonstrates that Data Science experiments follows a typical path of execution. In the same manner, [[3], [6], [7]], points out that Machine Learning pipelines are composed of well-defined layers (or stages) through its lifecycle. The emergence of IA in real world applications stressed the almost artisanal ways of creating and managing analytical experiments and reinforced that there is room to make things more efficiently.

In the search for possible approaches to the problem, we came across several projects that aimed to address these issues. Not surprisingly, multiple authors pursued the same goal, for instance [[9], [10]]. In these references, and confirmed in our survey, we found from targeted solutions to specific steps in modeling to services aiming for end-to-end AIOps management. Some are available as software packages, others as SaaS in cloud environments. In general terms, all of them end up offering features in different layers of the workflow (i.e. data, feature, scoring, and evaluation) or with different conceptualizations of reproducibility/replicability/repeatability as noticed by [11]. On one hand, this lack of standards makes any assessment difficult. On the other hand, it suggests a community in an exploratory process of a hot topic subject.

Specifically for this project, our focus is in the initial stages of computational scientific experiments. As studied in [8], in this phase, experiments are i) implemented by people as prototypes, ii) with minor focus on pipeline design and iii) in tools like Notebooks, that mix documentation, visualization and code with no required sequential structure. These three practices impact reproducibility and efficiency and are prone to create technical debts. However, tools like noWorkflow show a huge potential in such scenarios. It is promising because they i) demands a minimal setup to be functional, ii) works well with almost nonexistent workflows iii) require minimal additional intrusive code among the experimental one and iv) integrates well with Notebooks that are the typical artifact in these experiments.

According to its core team, the primary goal of noWorkflow is to "...allow scientists to benefit from provenance data analysis even when they don't use a workflow system.". Unlike other tools, "noWorkflow captures provenance from Python scripts without needing a version control system or any other environment". It is particularly interesting when we are in the scenario described above, where we lack any structured system at the beginning of experiments. In fact, after going through the docs, we can verify that noWorkflow provides:

Command-line accessibility
Seamless integration with Jupyter Notebooks
Minimal setup requirements in your environment
Elimination of the need for virtual machines or containers in its setup
Workflow-free operation
Open source license
Framework-agnostic position

Finally, in our research, we confirmed that there is an open spot in the management of scientific experiments that needs to be occupied by reproducibility. Provenance tools can help the academy and industry groups in this goal, and in this summer we focused on adding relevant features to leverage the noWorkflow in this direction.

Different tools for different needs

In our research phase, we didn't find any taxonomy that fully accommodated our review of different categories of tools providing reproducibility and experimental management. So, we could describe some tools in the following categories (freely adapted from this online references [here] and [here]):

Data and Pipeline Versioning: Platforms dealing with ingestion, processing, and exposing of features for model training and inference. They enable collaboration and discoverability of already existing Feature Sets throughout the teams and organizations. Provide provenance and lineage for data in different levels of complexity.

Metadata Stores/Experiment Trackers: They are specifically built to store metadata about ML experiments and expose it to stakeholders. They help with debugging, comparing, and collaborating on experiments. It is possible to divide them into Experiment Trackers and a Model Registry. Moreover, there are projects offering reproducibility features like hyperparameter search, experiment versioning, etc. However, they demand more robust workflows and are better suited for projects in the production/monitoring phases.

Pipeline frameworks: They operate within the realm of production, similar to Data Engineering workflows. Their usual goal is to allow any ML/AI products to be served across a wide range of architectures, and integrate all the low-hanging fruits along the way. For instance, pipelines adding hyperparameter optimization tasks, experiment tracking integrations, boilerplate containerized deployment, etc.

Deployment and Observability: They focus on deploying models for real-time inference and monitoring model quality once they are deployed in production. Their aim is to facilitate post-deployment control tasks such as monitoring feature drifts, conducting A/B testing, facilitating fast model shifts, and more.

The most remarkable aspect of this survey is that there are different tools for different phases in the life cycle of AI products. There are tools like DVC and Pachyderm that are Metadata Stores, allowing Experiment Tracking with features of tagging variables, as well as Data and Pipeline tracking. They are the most similar tools to noWorkflow in functionality. However, DVC possesses a more complex framework in dealing with different 'types' of tags, and relies on command line tools to extract and analyze tagged variables. Also, it depends strongly on git and replicate the git logics. Pachyderm requires a more sophisticated setup at the start, relying on containers and a server. It is an obstacle to small and lean prototypes, requiring installation of a docker image, and all friction on managing it.

There are other tools, like MLFlow and Neptune that pose themselves as Model Experiment Versioning with features of Monitoring and Deployment. They also have elements of pipeline frameworks, offering full integration and boiler plates for seamless integration with cloud platforms.

Pipelines are a vast field. They are AWS SageMaker, Google Vertex, DataRobot and Weights & Biases, among others. All of them offer features helping in all categories, with a strong focus on exploring all automation that can be offered to the final user, suggesting automatic parameter tuning, model selection, retraining, data lineage, metadata storing, etc.

Finally, Deployment and Observability frameworks are in the deployment realm, which is another stage far removed from prototypical phases of experiments. They come into the scene when all experimental and inferential processes are done, and there is an AI artifact that needs to be deployed and monitored. Such tools like Seldon, H2O, Datarobot do this job, again, with some features of Hyperparameter tuning, pipeline frameworks, data and pipeline tracking.

In light of this, when considering management and operation of experiments, we have a reduced sample of alternatives. Among them, Notebook integration/management are rare. Some of them rely on other tools like Git or enforces an overhead in the coding/setup with reserved keywords, tags and managerial workflows that hinder the process.

At first sight, our "informal" taxonomy positions noWorkflow as a Data/Pipeline Versioning and Metadata Store/Experiment Tracker. It is not a Pipeline Framework which works like a building block, facilitating the integration of artifacts at production stages. It is not a Deployment and Observability framework, because they are in the post-deployment realm, which is another stage far removed from prototypical phases of experiments.

Desiderata

As mentioned earlier, a typical workflow in DS/ML projects is well described by the CRISP-DM [5] and precede phases of deployment and production in the whole lifecycle of DS/ML projects.

Fig 1: CRISP-DM example of trajectory through a data science project

Briefly speaking, a workflow starts when a user creates a Jupyter Notebook and starts writing code. Usually, he/she imports or selects data from a source, explore features which are expected to have the highest inference potential, tunes some parameters to set up its training, trains and evaluates the predictive power of the model through different metrics. At this final step, we have delineated a trial. This trial result can suggest further improvements and new hypotheses about data, features, model types and hyperparameters. Then, we have a new experiment in mind that will result in a new trial.

When this process repeats multiple times, a researcher may end with different notebooks storing, each one, a different experiment. Each notebook has multiple hyperparameters, modeling choices and modeling hypotheses. Otherwise, the experimenter may have a unique notebook where different experiments were executed, in a nonlinear order between the cells. This former case is pointed out in [8], where Notebook flexibility makes it difficult to understand which execution order resulted in a specific output.

In a dream space, any researcher/team would have benefited at most if they could

a) in a running Notebook, being able to retrieve all the operations that contributed to the result of a variable of interest. In this case, modifications applied in the inputs or in the order of operations would be easily detectable. In the same way, any nonlinear execution that interferes in a control result.

b) Compare trials after different experiments. After experimenting with different hypotheses about hyperparameters, features or operation order, the user should easily compare the history of two trials and spot differences.

c) Retrieve a target variable among different trials that were executed in the context of an experiment. After proceeding with multiple experimental trials, users should be able to compare the results that are stored in different Notebooks (or even not).

d) Be as much "no workflow" as possible. All the former requisites should be possible with minimal code intervention, tags, reserved words or any active coding effort.

With these goals in mind, we worked on our deliverables and used the experiment carried out by [12] as a guideline to validate the new noWorkflow features.

Deliverables

In this session, we will describe what we have implemented during this summer.

We started on tagging cells and variables and then navigating through its pre-dependencies, or all other variables and function calls that contributed to its final value. This was a fundamental step that allowed us to evolve to create features that are really useful in day-to-day practice.

From the features of tagging a cell and tagging a variable, we evolved to the following features (an interactive notebook is available here):

backwards_deps('var_name', glanularity_level) : returns a dictionary storing operations/functions calls and their associated values that contributed to the final value of the tagged variable. Glanularity_level allows to set if the internal operations of the functions must be included or not.

global_backwards_deps('var_name', glanularity_level) : does the same as backwards_deps, but from all different tagging and re-tagging events in the notebook. It allows to retrieval of the complete operation of a tagged variable across all executed cells in the notebook
store_operations(trial_id, dictionary_ops) : save the current trial in order to make further comparisons with other experiments. The dictionaries aren't stored in the .noworkflow/db.sqlite, but in a shelve object named *ops.db* in the current notebook local folder.
resume_trials() : to support the management of experiments, the user can see the trial_ids of all experiments stored in the ops.db available for comparison/analysis.
trial_intersection_diff(trial_id1, trial_id2) : all mutual variables/funcion_calls between two experiments have its scalar values compared

trial_diff(trial_id1, trial_id2) : The values of variables and function calls are exhibited in a diff file format, emphasizing the operations' order. The goal here is to show that between the two experiments, the order of operations was different. Again, only scalar values are exhibited. More complex data structures (matrices, vectors, tensors, etc.) are only signaled as 'complex_type'

var_tag_plot('var_name') : Chart the evolution of a given variable across multiple trials in the database. In this case, all experiments stored in ops.db and tagged as *target_var* have their values plotted

var_tag_values('var_name') : Provides access to pandas.dataframe var_name entries with correspondent values across different trials.

Challenges

As expected, we had unexpected findings along the project. Bellow, we delve into the most significant challenges we had to face:

Jupyter notebooks allow a nonlinear execution of small parts of code through cells. More than once, we had to align about how to create functionalities to attend different scenarios that were unexpected. One example was the backwards_deps() and global_backwards_deps() functions. The latter function was born to cover the case where the user wants all dependencies rather than the local cell dependencies.
Despite the high quality of the current version of the package, the project needs documentation, which slows down the analysis of any new development. In this project, the aid of mentors was crucial at some points where a deeper knowledge was needed.
What is the vocation of noWorkflow? At some points in the project, we had to discuss forcing some kind of workflow over the user. And it would go against the philosophy of the project.
When working on comparing results, especially in DS/ML fields, complex types arise. Numerical vectors, matrices, and tensors from NumPy and other frameworks, as well as data frames, can't be properly manipulated based on our current approach.
The dilemma of focusing on graphic visual features versus more sophisticated APIs. More than once, we needed to choose between making a visual add-on to Jupyter or implementing a more complete API.
The current version of Jupyter support in noWorkflow doesn’t integrate well with Jupyter Lab. Also, even the IPython version has new versions, and noWorkflow needs to adapt to a new version.

Future Improvements

Given our current achievements and the insights gained along the project, we would highlight the following points as crucial future roadmap improvements:

Add a complex type treatment for comparisons. Today, visualizing and navigating through matrices, data frames, tensors, isn't possible with noWorkflow, although the user can do by its own means.
Integrate the dictionaries storing sequences of operations from shelve objects to a more efficient way of storage and retrieval.
Make it easier for users to manage (store, retrieve, and navigate) through different trials.
Add graphical management instead of relying upon API calls only.
Evolve the feature of tagging cells.
When tagging a model, save its binary representation to be recovered in the future.
Adding the capability of tracking the local dataset reading. Currently, it is possible to track changes in the name/path of the dataset. However, any modification in the integrity of a dataset is not traceable.

What I've learned

This was a great summer with two personal discoveries. The first one was my first formal contact with the Reproducibility subject. The second was to fully contribute with an Open Source project. In the research phase, I could get in touch with the state-of-the-art of reproducibility research and some of it nuances. In the Open Source contributing experience, I could be mentored by the core team of the noWorkflow and exercise all the skills required in doing high level software product.

Acknowledgments

I would like to thank the organization of Summer of Reproducibility for aiding this wonderful opportunity for interested people to engage with Open Source software. Also, thanks to the core team of noWorkflow for supporting me in doing this work.

Bibliography

[1] [O. E. Gundersen, K. Coakley, C. Kirkpatrick, and Y. Gil, “Sources of irreproducibility in machine learning: A review,” arXiv preprint arXiv:2204. 07610.]

[2] [D. Sculley et al., “Machine Learning: The High Interest Credit Card of Technical Debt,” in SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.]

[3] [P. Sugimura and F. Hartl, “Building a reproducible machine learning pipeline,” arXiv preprint arXiv:1810. 04570, 2018.]

[4] [D. Sculley et al., “Hidden technical debt in machine learning systems,” Adv. Neural Inf. Process. Syst., vol. 28, 2015.]

[5] [F. Martínez-Plumed et al., “CRISP-DM twenty years later: From data mining processes to data science trajectories,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 8, pp. 3048–3061, 2019.]

[6] [N. A. Lynnerup, L. Nolling, R. Hasle, and J. Hallam, “A Survey on Reproducibility by Evaluating Deep Reinforcement Learning Algorithms on Real-World Robots,” in Proceedings of the Conference on Robot Learning, L. P. Kaelbling, D. Kragic, and K. Sugiura, Eds., in Proceedings of Machine Learning Research, vol. 100. PMLR, 30 Oct--01 Nov 2020, pp. 466–489.]

[7] [A. Masood, A. Hashmi, A. Masood, and A. Hashmi, “AIOps: predictive analytics & machine learning in operations,” Cognitive Computing Recipes: Artificial Intelligence Solutions Using Microsoft Cognitive Services and TensorFlow, pp. 359–382, 2019.]

[8] [J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire, “Understanding and improving the quality and reproducibility of Jupyter notebooks,” Empirical Software Engineering, vol. 26, no. 4, p. 65, 2021.]

[9] [D. Kreuzberger, N. Kühl, and S. Hirschl, “Machine Learning Operations (MLOps): Overview, Definition, and Architecture,” IEEE Access, vol. 11, pp. 31866–31879, 2023.]

[10] [N. Hewage and D. Meedeniya, “Machine learning operations: A survey on MLOps tool support,” arXiv preprint arXiv:2202. 10169, 2022.]

[11] [H. E. Plesser, “Reproducibility vs. replicability: a brief history of a confused terminology,” Front. Neuroinform., vol. 11, p. 76, 2018.]

[12] [Z. Salekshahrezaee, J. L. Leevy, and T. M. Khoshgoftaar, “The effect of feature extraction and data sampling on credit card fraud detection,” Journal of Big Data, vol. 10, no. 1, pp. 1–17, 2023.]

[Mid-term] Capturing provenance into Data Science/Machine Learning workflows

Mon, 31 Jul 2023 00:00:00 +0000

This post describes our midterm work status and some achievements we have done so far in the project for the noWorkflow package.

The initial weeks

I started doing a bibliographical review on reproducibility in the Data Science (DS) and Machine Learning (ML) realms. It was a new subject to me, and I aimed to build a more robust theoretical background in the field. Meanwhile, I took notes in this series of posts.

Then, as planned, I integrated with the current noWorkflow supporters in order get a broader view of the project and their contributions. Additionally, Juliana Freire, João Felipe Pimentel, and I set up a weekly one-hour schedule to keep track of my activities.

Brainstormed opportunities

At the beginning of June, we also met with other project supporters to brainstorm about our initial proposal. From this meeting, we came up with a plan on how technically approach a noWorkflow new feature in Data Science and Machine Learning experimental management.

In this brainstorm, we aligned that Jupyter Notebooks are, by far, the most frequent set up in DS/ML computational experiments. They established themselves as the fundamental artifact by embedding code, text and enabling execution and visualization. Entire experiments are created and kept in Jupyter notebooks until they are sent to production. And the opportunity at hand is to integrate noWorkflow with Jupyter Notebooks. Then, our mid-term goal was adapted from the original plan of only selecting and executing a prototypical ML experiment. We added the goal of paving the way for providing a tagging feature for Notebook cells.

More specifically, DS/ML experimental workflows usually have well-defined stages composed of data reading, feature engineering, model scoring, and metrics evaluation. In our dream space, the user would tag a cell in their experiment, enabling the capture of the tagged metadata into a database. This step integrates the ultimate goal of facilitating comparisons, management, and even causal inference across different trials of a DS/ML experiment.

Current deliverables

So, based on our plans, we create a separate table to store the metadata from cell tagging. This table stores the cell hash codes and information to match the code executed within a cell. As a result, we can store tags and the activation ids of the cells enabling us to identify a cell containing a given stage in a DS/ML experiment.

The second feature implemented was tagging a specific variable. In the same way for a cell, now it is possible to stamp a given variable with a tag, keeping its name, id, and received value in this separated table.

Finally, we worked on displaying the dependencies of a given variable. In this case, by tagging a given variable, we can display the other variables, values, and cells activated in its construction. Then, we can visualize the dependencies that contributed to its final value.

For an overview of current developments, please refer to my fork of the main project.

Challenges

During this period, we had to make choices along the way. For instance, capturing the provenance of cells through tags is a different solution than tagging code chunks in scripts. In this case, we decided to stick with tagging Notebook cells at this moment. We also opted to start storing the metadata to enable comparisons between trials rather than focus on a sophisticated graphic and user-friendly cell tagging system. We also opted to keep this metadata info stored in a separate table in the database.

Next steps

In the second half of the summer, our goal is to integrate these features in order to proceed with comparisons among experiments. Such comparisons would use the tagged variables as the hyperparameters of DS/ML experiments or key variables to assess the experiments, such as errors or scores. As a result, we will be able to compare the results of two trials in a more accurate, and easily reproducible experiment.

Verify the reproducibility of an experiment

Wed, 24 May 2023 00:00:00 +0000

Hello everyone, my name is Jesse and I’m proud to be a fellow in this 2023 Summer of Reproducibility program, contributing to noWorkflow project.

My proposal was accepted under the mentorship of João Felipe Pimentel and Juliana Freire and aims to work mapping and testing the capture of the provenance in typical Data Science and Machine Learning experiments.

What…

Although much can be said about what reproducibility means, the ability to replicate results in day-to-day Data Science and Machine Learning experiments can pose a significant challenge for individuals, companies and researche centers. This challenge becomes even more pronounced with the emergence of analytics and IA, where scientific methodologies are extensively applied on an industrial scale. Then reproducibility assumes a key role in productivity and accountability expected from Data Scientists, Machine Learning Engineers, and other roles engaged in ML/AI projects.

How…

In the day-to-day, the pitfalls of non-reproducibility appear at different points of the experiment lifecycle. These challenges arise when multiple experiments need to be managed for an individual or a team of scientists. In a typical experiment workflow, reproducibility appears in different steps of the process:

The need to track the provenance of datasets.
The need to manage changes in hypothesis tests.
Addressing the management of system hardware and OS setups.
Dealing with outputs from multiple experiments, including the results of various model trials.

In academic environments, these issues can result in mistakes and inaccuracies. In companies, they can lead to inefficiencies and technical debts that are difficult to address in the future.

Finally…

I believe this is a great opportunity to explore the emergence of these two hot topics that are IA and reproducilibity! I will share more updateds here throughout this summer and hope we can learn a lot together!

FlashNet: Towards Reproducible Data Science for Storage System

Thu, 02 Feb 2023 00:00:00 +0000

The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers. The person must know both the storage side as well the ML side as if studying two different fields at the same time. This project aims to answer these questions:

How can we encourage data scientists to look into storage problems?
How can we create a transparent platform that allows such decoupling?
Within the storage/ML community can we create two collaborative communities, the storage engineers and the storage data scientists?

In the ML/Deep Learning community, the large ImageNet benchmarks have spurred research in image recognition. Similarly, we would like to provide benchmarks for fostering storage research in ML-based per-IO latency prediction. Therefore, we present FlashNet, a reproducible data science platform for storage systems. To start a big task, we use I/O latency prediction as a case study. Thus, FlashNet has been built for I/O latency prediction tasks. With FlashNet, data engineers can collect the IO traces of various devices. The data scientists then can train the ML models to predict the IO latency based on those traces. All traces, results, and codes will be shared in the FlashNet training ground platform which utilizes Chameleon trovi for better reproducibility.

In this project, we plan to improve the modularity of the FlashNet pipeline and develop the Chameleon trovi packages. We will also continue to improve the performance of our binary-class and multiclass classifiers and test them on the new production traces that we collected from SNIA IOTA public trace repository. Finally, we will optimize the deployment of our continual-learning mechanism and test it in a cloud system environment. To the best of our knowledge, we are building the world-first end-to-end data science platform for storage systems.

Building FlashNet Platform

Topics: Storage systems, reproducibility, machine learning, continual learning
Skills: C++, Python, PyTorch, Experienced with Machine Learning pipeline
Difficulty: Medium
Size: Large (350 hours)
Mentors: Haryadi S. Gunawi
Contributor(s): Justin Shin, Maharani Ayu Putri Irawan

Build an open-source platform to enable collaboration between storage and ML communities, specifically to provide a common platform for advancing data science research for storage systems. The platform will be able to reproduce and evaluate different ML models/architecture, dataset patterns, data preprocessing techniques, and various feature engineering strategies.

Specific tasks:

Work with mentors on understanding the context of the project.
Reproduce the FlashNet evaluation results from prior works.
Build and improve FlashNet components based on the existing blueprint.
Collect and analyze the FlashNet evaluation results.

Polyphorm / PolyPhy

Thu, 15 Dec 2022 00:00:00 +0000

PolyPhy infrastructure engineering and practices

Topics: DevOps Code Refactoring CI/CD
Skills: fluidity in Python, experience with OOP, experience with building and packaging libraries, understanding GitHub and its tools ecosystem
Difficulty: Challenging
Size: 350+ hours
Mentors: Oskar Elek, Anisha Goel
Contributor(s): Prashant Jha

Your responsibility in this project will be developing new infrastructure of the PolyPhy project as well as maintaining the existing codebases. This is a multifaceted role that will require coordination with the team and active approach to understanding the technical needs of the community.

Specific tasks:

Work with the technical lead to develop effective interfaces for PolyPhy, providing access to its functionality on the level of both Python/Jupyter code and the command line.
Maintain the existing codebase and configure it according to the team’s needs.
Develop and extend the current CI/CD functionality and related code metrics.
Document the best practices related to the above.

Write PolyPhy’s technical story and content

Topics: Writing Documentation Storytelling
Skills: experienced writing structured text, well read, technical or scientific education, webdev basics (preferably NodeJS)
Difficulty: Moderate
Size: 350 hours
Mentors: Oskar Elek, Ezra Huscher

Integral to PolyPhy’s presentation is a “story” - a narrative understanding - that the users and the project contributors can relate to. Your responsibility will be to develop the written part of that understanding, as well as major portions of technical documentation that match it.

Specific tasks:

Work with mentors on understanding the context of the project.
Write and edit diverse pages of the project website.
Work with mentors to improve project’s written community practices (diversity, communication).
Write and edit narrative and explanatory parts of PolyPhy’s documentation.
Create tutorials that present core functionality of the toolkit.

Community engagement and management

Topics: Community Management Social Media Networking
Skills: documented experience with current social media landscape, social and well spoken, ability to communicate technical concepts
Difficulty: Moderate
Size: 175 or 350 hours
Mentors: Oskar Elek, Ezra Huscher

Your responsibility will be to build and engage the community around PolyPhy. This includes its standing team and stakeholders, current expert users, potential adopters as well as the general public. The scope (size) of the project depends on the level of commitment during and beyond the Summer and is negotiable upfront.

Specific tasks:

Manage the team’s communication channels (Slack, Zoom, email) and maintain active presence therein.
Develop social media presence for PolyPhy on Twitter, LinkedIn and other selected social media platforms.
Manage and extend the online presence for the project, including its website, mailing list, and other applicable outreach activities.
Research and engage with new communities that would benefit from PolyPhy, both as its expert users and contributors.

Apache AsterixDB

Mon, 07 Nov 2022 10:15:56 -0700

AsterixDB is an open source parallel big-data management system. AsterixDB is a well-established Apache project that has beedddn active in research for more than 10 years. It provides a flexible data model that supports modern NoSQL applications with a powerful query processor that can scale to billions of records and terabytes of data. Users can interact with AsterixDB through a power and easy to use declarative query language, SQL++, which provides a rich set of data types including timestamps, time intervals, text, and geospatial, in addition to traditional numerical and Boolean data types.

Geospatial Data Science on AsterixDB

Topics: Data science, SQL++, documentation
Skills: SQL, Writing, Spreadsheets
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentors: Ahmed Eldawy, Akil Sevim

Build a data science project using AsterixDB that analyzes geospatial data among other dimensions. Use Chicago Crimes as the main dataset and combine with other datasets including points of interests ZIP Code boundaries. During this project, we will answer interesting questions about the data and visualize the results such as:

What is the most common crime type in a specific date or over the weekends?
Where do most of the arrests happen?
How are the crime rates change over time for different regions?

The goals of this project are:

Understand how to build a scalable data science project using AsterixDB.
Translate common questions to SQL queries and run them on large data.
Learn how to visualize the results of queries and present them.
Write detailed documentation about the process of building a data science application in AsterixDB.
Improve the documentation of AsterixDB while working in the project to improve the experience for future users.

Machine Learning Integration

As a bonus task, and depending on the progress of the project, we can explore the integration of machine learning with AsterixDB through Python UDFs. We will utilize the AsterixDB Python integration through user-defined functions to connect AsterixDB backend with scikit-learn to build some unsupervised and supervised models for the data. For example, we can cluster the crimes based on their location and other attributes to find interesting patterns or hotspots.

FasTensor

Mon, 07 Nov 2022 10:15:56 -0700

FasTensor is a parallel execution engine for user-defined functions on multidimensional arrays. The user-defined functions follow the stencil metaphor used for scientific computing and is effective for expressing a wide range of computations for data analyses, including common aggregation operations from database management systems and advanced machine learning pipelines. FasTensor execution engine exploits the structural-locality in the multidimensional arrays to automate data management operations such as file I/O, data partitioning, communication, parallel execution, and so on.

Continuous Integration

Topics: Data Management, Analytics
Skills: C++, github
Difficulty: Medium
Size: Large (350 hours)
Mentor: John Wu, Bin Dong, Suren Byna

Develop a test suite for the public API of FasTensor
Automate execution of the test suite
Document the continuous integration process
Develop performance testing suite

FasTensor

Mon, 07 Nov 2022 10:15:56 -0700

Tensor execution engine on GPU

Topics: Data Management, Analytics
Skills: C++, github
Difficulty: Difficult
Size: Large (350 hours)
Mentor: John Wu, Bin Dong, Suren Byna

Tensor based computing is needed by scientific applications and now advanced AI model training. Most tensor libraries are hand customized and optimized on GPU, and most of they only serve one kind of application. For example, TensorFlow is only optimized for AI model training. Optimizing generic tensor computing libraries on GPU can benefit wide applications. Our FasTensor, as a generic tensor computing library, can only work efficiently on CPU now. How to run the FasTensor on GPU is still none-explored work. Research and development challenges will include but not limited to: 1) how to maintain structure-locality of tensor data on GPU; 2) how to reduce the performance loss when the structure-locality of tensor is broken on GPU.

Develop a mechanism to move user-define computing kernels onto GPU
Evaluate the performance of the execution engine
Document the execution mechanism
Develop performance testing suite

Continuous Integration

Topics: Data Management, Analytics
Skills: C++, github
Difficulty: Medium
Size: Large (300 hours)
Mentor: John Wu, Bin Dong, Suren Byna

Develop a test suite for the public API of FasTensor
Automate execution of the test suite
Document the continuous integration process

Polyphorm / PolyPhy

Mon, 07 Nov 2022 10:15:56 -0700

Polyphorm is an agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data. You can find more details about our research here. Under the hood, Polyphorm uses a richer 3D scalar field representation of the reconstructed network, instead of a discrete representation like a graph or a mesh.

PolyPhy will be a Python-based redesigned version of Polyphorm, currently in the beginning of its development cycle. PolyPhy will be a multi-platform toolkit meant for a wide audience across different disciplines: astronomers, neuroscientists, data scientists and even artists and designers. All of the offered projects focus on PolyPhy, with a variety of topics including design, coding, and even research. Ultimately, PolyPhy will become a tool for discovering connections between different disciplines by creating quantitatively comparable structural analytics.

Develop website for PolyPhy

Topics: Web Development Dynamic Updates UX
Skills: web development experience, good communicator, (HTML/CSS), (Javascript)
Difficulty: Moderate
Size: Medium or large (175 or 350 hours)
Mentors: Oskar Elek

Develop a clean and welcoming website for the project. The organization needs to reflect the needs of PolyPhy users, but also provide a convenient entry point for interested project contributors. No excessive pop-ups or webjunk.

Specific tasks:

Work with mentors on understanding the context of the project.
Port the contents of the repository page to a dedicated website.
Design the structure of the website according to best OS practices.
Work with the visual designer (see below) in creating a coherent and organic presentation.
Interactively link important metrics from the project dev environment as well as documentation.

Design visual experience for PolyPhy’s website and presentations

Topics: Design Art UX
Skills: vector and bitmap drawing, sense for spatial symmetry and framing, (interactive content creation), (animation)
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Oskar Elek

Develop visual content for the project using its main themes: nature-inspired computation, biomimetics, interconnected structures. Aid in designing visual structure of the website as well as other public-facing artifacts.

Specific tasks:

Work with mentors on understanding the context of the project.
Design imagery and other graphical elements to visually (re-)present PolyPhy.
Work with the technical writer (see below) in designing a coherent story.
Work with the web developer (see above) in creating a coherent and organic presentation.

Write PolyPhy’s technical story and content

Topics: Writing Documentation Storytelling
Skills: experienced writing structured text over 10 pages, well read, (technical or scientific education)
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Oskar Elek

Integral to PolyPhy’s presentation is a story that the users and the project contributors can relate to. The objective is to develop the verbal part of that story, as well as major portions of technical documentation that matches it. The difficulty of the project is scalable.

Specific tasks:

Work with mentors on understanding the context of the project.
Write different pages of the project website.
Work with mentors to improve project’s written community practices (diversity, communication).
Write and edit narrative and explanatory parts of PolyPhy’s documentation.
Work with the visual designer (see above) in designing a coherent story.

Video tutorials and presentation for PolyPhy

Topics: Video Presentation Tutorials Didactics
Skills: video editing, creating educational content, communication, (native or fluent in another language)
Difficulty: Easy-Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Oskar Elek, Drew Ehrlich

Create a public face for PolyPhy that reflects its history, context, and teaches its functionality to users in different degrees of familiarity.

Specific tasks:

Work with mentors on understanding the context and history of the project.
Interview diverse project contributors.
Create a video documenting PolyPhy’s history, with roots in astronomy, complex systems, fractals.
Create a set of tutorial videos for starting and intermediate PolyPhy users.
Create an accessible template for future tutorials.

Implement heterogeneous data I/O ops

Topics: I/O Operations File Conversion Numerics Testing
Skills: Python, experience working with scientific or statistical data, good debugging skills
Difficulty: Moderate-Challenging
Size: Medium or Large (175 or 350 hours)
Mentors: Oskar Elek, Anisha Goel

By default, PolyPhy operates with an unordered set of points as an input and scalar fields (float ndarrays) as an output, but others are applicable as well. Design and implement interfaces to load and export different data formats (CSV, OBJ, HDF5, FITS…) and modalities (points, meshes, density fields). The difficulty of the project can be scaled based on contributor’s interest.

Specific tasks:

Research which modalities are used by members of the target communities.
Implement modular loaders for the inputs and an interface to PolyPhy core.
Implement exporters for simulation datasets and visualization captures.
Write testing code for the above.
Integrate external packages as necessary.

Setup CI/CD for PolyPhy

Topics: Continuous Integration Continuous Deployment DevOps
Skills: experience with CI/CD, GitHub, Python package deployment
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Oskar Elek, Anisha Goel

The objective is to setup a CI/CD pipeline that automates the build testing and deployment of the software. The resulting process needs to be robust to contributor errors and work in the distributed conditions of a diverse contributor base.

Specific tasks:

Automate continuous building, testing, merging and deployment for PolyPhy in GitHub.
Publish the CI/CD metrics and build assets to the project webpage.
Work with other contributors in educating them about the best practices of using the developed CI/CD pipeline.
Add support for automated packaging using common management systems (pip, Anaconda).

Refine PolyPhy’s UI and develop new functional elements

Topics: UI/UX Visual Experience
Skills: Python programming, UI/UX development experience, (knowledge of graphics)
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Oskar Elek, David Abramov

The key feature of PolyPhy is its interactivity. By interacting with the underlying simulation model, the user can adjust its parameters in real time and respond to its behavior. For instance, an astrophysics expert can load a dataset of 100k galaxies and reconstruct the large-scale structure of the intergalactic medium. A responsive UI combined with real-time visualization allows them to judge the fidelity of the reconstruction and make necessary changes.

Specific tasks:

Implement a platform-agnostic UI to house PolyPhy’s main rendering context as well as secondary analytics.
Work with the visualization developer (see below) to integrate the rendering functionality.
Optimize to UI’s performance.
Test the implementation on different OS platforms.

Create new data visualization regimes

Topics: Interactive Visualization Data Analytics 3D Rendering
Skills: basic graphics theory and math, Python, GPU programming, (previous experience visualizing novel datasets)
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Oskar Elek, David Abramov

Data visualization is one of the core components of PolyPhy, as it provides a real-time overview of the underlying MCPM simulation. Through the feedback provided by the visualization, PolyPhy users can adjust the simulation model and make new findings about the dataset. Various operations over the reconstructed data (e.g. spatial searching) as well as important statistical summaries also benefit from clear visual presentation.

Specific tasks:

Develop novel ways of visualizing scientific data in PolyPhy.
Work with diverse data modalities - point clouds, graphs, scalar and vector fields.
Add support for visualizing metadata, such as annotations and labels.
Create UI elements for plotting statistical summaries computed in real-time.

Discrete graph extraction from simulated scalar fields

Topics: Graph Theory Data Science
Skills: good understanding of discrete math and graph theory, Python, (GPU programming)
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Oskar Elek, Farhanul Hasan

Develop a custom method for graph extraction from scalar field data produced by PolyPhy. Because PolyPhy typically produces network-like structures, representing these structures as weighted discrete graphs is very useful for efficiently navigating the data. The most important property of this abstracted representation is that it preserves the topology of the base scalar field by navigating the 1D ridges of the scalar field.

Specific tasks:

Become familiar with different algorithms for graph growing and skeleton extraction.
Implement the most suitable method in PolyPhy, interpreting the source scalar field as a throughput (transport) network. The weights of the resulting graph need to reflect the source throughputs between the respective node locations.
Implement common graph operations, e.g. hierarchical clustering and reduction, shortest path between two nodes, range queries.
Optimize the runtime of the implemented methods.
Work with the visualization developer (see above) to visualize the resulting graphs.

DirtViz 2.0 (2023)

Mon, 07 Feb 2022 00:00:00 +0000

DirtViz is a project to visualize data collected from sensors deployed in sensor networks. We have deployed a number of sensors measuring qualities like soil moisture, temperature, current and voltage in outdoor settings. This project involves extending our existing visualization stack, DirtViz 1.0 (see github), and expanding it to version 2.0. The project goal is to create a fully-fledged dataviz tool tailored to the types of data collected from embedded systems sensor networks.

Visualize Sensor Data

Topics: Data Visualization, Analytics
Skills: javascript, python, bash, webservers, git, embedded systems
Difficulty: Easy/Moderate
Size: Large, 350 hours
Mentors: Colleen Josephson, Sonia Naderi, Stephen Taylor, John Madden

Specific tasks:

Refine our web-based visualization tools to easily allow users to zoom in on date ranges, change axes, etc.
Create a system for remote collaborators/citizen scientists to upload their own data in a secure manner
Craft an intuitive navigation system so that data from deployment sites around the world can be easily viewed
Document the tool thoroughly for future maintenance
If interested, we are also open to you investigating correlations between different data streams and doing self-directed data analysis