hpc | UCSC OSPO

AIDRIN Privacy-Centric Enhancements: Backend & UX Upgrades

Fri, 25 Jul 2025 00:00:00 +0000

⏱️ Reading time: 5–6 minutes

Hey everyone,

If you’ve ever wondered what it takes to make AI data pipelines not just smarter, but safer and more transparent, you’re in the right place. The last few weeks working on AIDRIN for GSoC have been a deep dive into the engine room of privacy and backend systems that power the AIDRIN project. My focus has been on building out the core privacy infrastructure and backend features that power AIDRIN’s ability to give users real, actionable insights about their data. It’s been challenging, sometimes messy, but incredibly rewarding to see these changes make a tangible difference.

Having Dr. Jean Luca Bez and Prof. Suren Byna as mentors, along with the support of the entire team, has truly made all the difference. Their guidance, encouragement, and collaborative spirit have been a huge part of this journey, whether I’m brainstorming new ideas or just trying to untangle a tricky bug.

Privacy Metrics: Making Data Safer

A major part of my work has been putting data privacy at the front and center in AIDRIN. I focused on integrating essential privacy metrics like k-anonymity, l-diversity, t-closeness, and more, making sure they’re not just theoretical checkboxes, but real tools that users can interact with and understand. Now, these metrics are fully wired up in the backend and visualized in AIDRIN, so privacy risks are no longer just a vague concern. They are something AI data preparers can actually see and act on. Getting these metrics to work seamlessly with different datasets and ensuring their accuracy took some serious backend engineering, but the payoff has been worth it.

Speeding Things Up (So You Don’t Have To Wait Around)

As AIDRIN started handling bigger datasets, some of the calculations can be time-consuming because data has to be accessed every time a metric is computed. To address this, I added caching for previously computed metrics, like class imbalance and privacy checks, and set up asynchronous execution with Celery and Redis. This should make the app super responsive. Rather than waiting for heavy computations to finish, one can start taking notes about other metrics or explore different parts of the app while their results are loading in the background. It’s a small change, but it helps keep the workflow moving smoothly.

Small Touch Ups That (Hopefully) Make a Big Difference

I also spent time on the details that make the app easier to use. Tooltips now explain what the privacy metrics actually mean, error messages are clearer, and there’s a new cache info page where you can see and clear your cached data. The sensitive attribute dropdown is less confusing now, especially if you’re working with quasi-identifiers. These tweaks might seem minor, but they add up and make the app friendlier for everyone.

Docs, Docs, Docs

I’m a big believer that good documentation is just as important as good code. I updated the docs to cover all the new features, added citations for the privacy metrics, and made the install process a bit more straightforward. Hopefully, this means new users and contributors can get up to speed without too much hassle.

Huge Thanks to My Mentors and the Team

I really want to shine a light on Dr. Bez, Prof. Byna, and the entire AIDRIN team here. Their encouragement, practical advice, and collaborative spirit have been a huge part of my progress. Whether I’m stuck on a bug, brainstorming a new feature, or just need a second opinion, there’s always someone ready to help me think things through. Their experience and support have shaped not just the technical side of my work, but also how I approach problem-solving and teamwork.

What’s Next?

Looking ahead, I’m planning to expand AIDRIN’s support for multimodal datasets and keep refining the privacy and fairness modules. There’s always something new to learn or improve, and I’m excited to keep building. If you’re interested in data quality, privacy, or open-source AI tools, I’d love to connect and swap ideas.

Thanks for reading and for following along with my GSoC journey. I’ll be back soon with more updates!

This is the second post in my 3-part GSoC series with AIDRIN. Stay tuned for the final update.

Improving AI Data Pipelines in AIDRIN: A Privacy-Centric and Multimodal Expansion

Thu, 12 Jun 2025 00:00:00 +0000

⏱️ Reading time: 4–5 minutes

Hi 👋

I’m Harish Balaji, a Master’s student at NYU with a focus on Artificial Intelligence, Machine Learning, and Cybersecurity. I’m especially interested in building scalable systems that reflect responsible AI principles. For me, data quality isn’t just a technical detail. It’s a foundational aspect of building models that are reliable, fair, and reproducible in the real world.

This summer, I’m contributing to AIDRIN (AI Data Readiness Inspector) as part of Google Summer of Code 2025. I’m grateful to be working under the mentorship of Dr. Jean Luca Bez and Prof. Suren Byna from the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL).

AIDRIN is an open-source framework that helps researchers and practitioners evaluate whether a dataset is truly ready to be used in production-level AI workflows. From fairness to privacy, it provides a structured lens through which we can understand the strengths and gaps in our data.

Why this work matters

In machine learning, one principle always holds true:

“Garbage in, garbage out.”

Even the most advanced models can underperform or amplify harmful biases if trained on incomplete, imbalanced, or poorly understood data. This is where AIDRIN steps in. It provides practical tools to assess datasets across key dimensions like privacy, fairness, class balance, interpretability, and support for multiple modalities.

By making these characteristics measurable and transparent, AIDRIN empowers teams to make informed decisions early in the pipeline. It helps ensure that datasets are not only large or complex, but also trustworthy, representative, and purpose-fit.

My focus this summer

As part of my GSoC 2025 project, I’ll be focusing on extending AIDRIN’s evaluation capabilities. A big part of this involves strengthening its support for privacy metrics and designing tools that can handle non-tabular datasets, such as image-based data.

The goal is to expand AIDRIN’s reach without compromising on interpretability or ease of use. More technical insights and updates will follow in the next posts as the summer progresses.

What comes next

As the AI community continues to evolve, there’s a growing shift toward data-centric practices. I believe frameworks like AIDRIN are essential for helping us move beyond the question of “Does the model work?” toward a deeper and more meaningful one: “Was the data ready in the first place?”

Over the next few weeks, I’ll be working on development, testing, and integration. I’m excited to contribute to a tool that emphasizes transparency and reproducibility across the AI lifecycle, and to share lessons and ideas with others who care about responsible AI.

If you’re exploring similar challenges or working in the space of dataset evaluation and readiness, I’d love to connect and exchange thoughts. You can also read my full GSoC 2025 proposal below for more context around the project scope and vision:

👉 Read my GSoC 2025 proposal here

This is the first in a 3-part blog series documenting my GSoC journey with AIDRIN. Stay tuned for technical updates and behind-the-scenes insights as the summer unfolds!

Final Report: Stream processing support for FasTensor

Fri, 30 Aug 2024 00:00:00 +0000

Final Report: Stream processing support for FasTensor

Project Description

FasTensor is a scientific computing library specialized in performing computations over dense matrices that exhibit spatial locality, a characteristic often found in physical phenomena data. Our GSoC'24 project aimed to enhance FasTensor by enabling it to ingest and process live data streams from sensors and scientific equipment.

What is FasTensor?

Imagine you’re working on a physical simulation or solving partial differential equations (PDEs). You’ve discretized your PDE, but now you face a new challenge: you need to run your computations fast and parallelize them across massive compute clusters.

At this point, you find yourself describing a stencil [1] operation. But should you really spend your time tinkering with loop orders, data layouts, and countless other side-quests unrelated to your core problem?

This is where FasTensor comes in: Describe your computation as a stencil, and it takes care of ensuring optimal execution. FasTensor lets you focus on the science, not the implementation details.

Repository Links

FasTensor: https://github.com/BinDong314/FasTensor
My fork: https://github.com/my-name/FasTensor/tree/ftstream

PR(s)

Work done this summer

Develop Streaming simulator: FTStream

I was first entasked by Dr. Bin to develop a stream simulator for testing the streaming capability of FasTensor. For testing purposes, a stream is characterized by file size, count, and arrival interval. FTStream can generate streams of various sizes and intervals, up to the theoretical limits of disk and filesystem. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe!

Writing this tool was an adventure in throughput testing and exploring APIs. I wrote multiple drivers, each for a different whim and hijinks of systems in the HPC world. Here’s a brief journey through the APIs we explored:

HDF5 APIs: Pretty fast in flush-to-disk operation, but the API design strongly binds to file handles, which inhibits high throughput duplication.
HDF5 VFL and VOL: We dabbled in these dark arts, but there be dragons! Keeping a long-term view of maintenance, we dropped the idea.
POSIX O_DIRECT: This involved getting your buffers aligned right and handling remainders correctly. A step up, but not quite at the theoretical limits.
Linux AIO: Streaming is latency sensitive domain, to reach the theoretical limits, every syscall saved matters. Linux AIO allowed us syscall batching with io_submit(). It took a few testing sessions to get the correct combo of queue depth, buffer size, and alignment right.

We settled on O_DIRECT + Linux AIO. Feel free to modify ftstream/fastflush.h to suit your needs.

Stream Support

FasTensor has just one simple paradigm: you give it a data source, an output data store, and your transform, and it handles all the behind-the-scenes grunt work of computing over big datasets so you can focus on your research.

We aimed to achieve the same for streaming: Drop in the STREAM keyword, append a pattern identifying your stream, and use your usual transform.

Voila! Now your previous FasTensor code supports live data streams.

Technical tidbits:

Implements a manager-worker pattern to allow us flexibility in the future to implement different stream semantics such as windowing, CPU-memory based load balancing
Supports streams of indefinite size

Challenges

HPC has its fair share of challenges. Things you take for granted might not be available there, and it takes a while to adjust to paradigms of scale and parallelization.

For example, when developing FTStream, we found O_DIRECT is available on some parallel file systems like GPFS but not supported on Lustre/CFS. We developed a separate MPIO driver for FTStream that will be upstreamed once thoroughly tested on Lustre.

Future Work

Implement windowing and explore more advanced stream semantics.
Implement support for for defining workload policies
Optimize interleaving IO and Compute.

References

[1] Anshu Dubey. 2014. Stencils in Scientific Computations. In Proceedings of the Second Workshop on Optimizing Stencil Computations (WOSC ‘14). Association for Computing Machinery, New York, NY, USA, 57. https://doi.org/10.1145/2686745.2686756

Acknowledgement

I struck gold when it comes to mentors.

Dr. Bin Dong was really kind and supportive throughout the journey. From the very first steps of giving a tour around the codebase to giving me a lot of freedom to experiment, refactor, and refine.

Dr. John Wu was encouraging and nurturing of budding talent. We had great research presentations every Monday apart from usual mentor interactions, where different research groups presented their talks and students were invited to present their progress.

I’ve come across Quantum computing many times in the news, but I never thought I’d get a frontline preview from the researchers working at the bleeding edge at the Lawrence Berkeley National Laboratory (LBL).

This GSoC experience, made possible by Google and UC OSPO, has been invaluable for my growth as a developer and researcher.

For people interested in HPC, ML, Systems, or Reproducibility, I encourage you all to apply to UC OSPO. It’s been an incredible journey, and I’m grateful for every moment of it!

Midway Through GSoC

Wed, 31 Jul 2024 00:00:00 +0000

Hello everyone! I’m Joel Tony, and I’m excited to share my progress update on the Drishti project as part of my Google Summer of Code (GSoC) experience. Over the past few weeks, I’ve been diving deep into the world of I/O visualization for scientific applications, and I’m thrilled to tell you about the strides we’ve made.

What is Drishti?

For those unfamiliar with Drishti, it’s an application used to visualize I/O traces of scientific applications. When running complex scientific applications, understanding their I/O behavior can be challenging. Drishti steps in to parse logs from various sources, with a primary focus on those collected using Darshan, a lightweight I/O characterization tool for HPC applications. Drishti provides human-interpretable insights on how to improve I/O performance based on these logs. While Drishti supports multiple log sources, our current work emphasizes Darshan logs due to their comprehensive I/O information. Additionally, Drishti offers visually appealing and easy-to-understand graphs to help users better grasp their application’s I/O patterns, making it easier to identify bottlenecks and optimize performance.

Progress and Challenges

Export Directory Feature

One of the first features I implemented was the export directory functionality. In earlier versions of Drishti, users couldn’t select where they wanted their output files to be saved. This became problematic when working with read-only log locations. I familiarized myself with the codebase, created a pull request, and successfully added this feature, allowing users to choose their preferred output location.

CI Improvements and Cross-Project Dependencies

While working on Drishti, I discovered the tight coupling between various tools in the HPC I/O organization, such as Drishti and DXT Explorer. This highlighted the need for improved Continuous Integration (CI) practices. We currently run about eight GitHub Actions for each pull request, but they don’t adequately test the interactions between different branches of these interconnected tools. This is an area we’ve identified for future improvement to ensure smoother integration and fewer conflicts between projects.

Refactoring for Multi-File Support

The bulk of my time was spent refactoring Drishti to extend its framework from parsing single Darshan files to handling multiple files. This task was more complex than it initially appeared, as Drishti’s insights are based on the contents of each Darshan file. When dealing with multiple files, we needed to find a way to aggregate the data meaningfully without sacrificing on performance.

The original codebase had a single, thousand-line function for parsing Darshan files. To improve this, I implemented a data class structure in Python. This refactoring allows for:

Better separation of computation and condition checking
Easier parallelization of processing multiple traces
Finer-grained profiling of performance bottlenecks
More flexibility in data manipulation and memory management

Learnings and Skills Gained

Through this process, I’ve gained valuable insights into:

Refactoring large codebases
Understanding and improving cross-project dependencies
Implementing data classes in Python for better code organization
Balancing performance with code readability and maintainability

Next Steps

As I move forward with the project, my focus will be on:

Adding unit tests for individual methods to ensure functionality
Exploring alternative data frame implementations like Polars for better performance
Developing aggregation methods for different types of data across multiple Darshan files
Optimizing memory usage and computational efficiency for large datasets

Conclusion

Working on Drishti has been an incredible learning experience. I’ve had the opportunity to tackle real-world challenges in scientific computing and I/O visualization. As we progress, I’m excited about the potential impact of these improvements on the scientific community’s ability to optimize their applications’ I/O performance.

I’m grateful for this opportunity and looking forward to the challenges and discoveries that lie ahead in the second half of my GSoC journey. Stay tuned for more updates as we continue to enhance Drishti!

If you have any questions or would like to learn more about the project, feel free to reach out to me. Let’s keep pushing the boundaries of scientific computing together!

Streaming into the Future: Adding Real-Time Processing to FasTensor

Tue, 30 Jul 2024 00:00:00 +0000

Hey there, HPC enthusiasts and fellow coders! I’m excited to share my progress on this summer’s Google Summer of Code project under UC OSPO’s FasTensor. Here’s a glimpse into how we’re pushing the boundaries of real-time data processing.

The Big Picture: FasTensor and HPC Challenges

First, a quick refresher: FasTensor is our go-to tool for handling dense arrays in scientific computing. It tackles three major HPC challenges:

Optimizing computations
Distributing data efficiently
Balancing workloads across computing cores

FasTensor excels at these tasks, especially when dealing with data that has structural locality - a common feature in scientific computing. Here, the Stencil computations come in handy, capturing data locality for operations like solving partial differential equations in physical simulations.

The Mission: Bringing FasTensor into Real-Time

While FasTensor is great at processing existing data, the next frontier is handling live data streams from scientific instruments and sensors. That’s where my GSoC project comes in: adding stream processing capabilities to FasTensor.

Progress Highlights:

Building a Stream Simulator

We’ve created FTstream, a nifty tool that simulates data streams. It can generate streams of various sizes and intervals, pushing the limits of what your disk can handle. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe! This tool is crucial because many scientific instruments, from particle accelerators to radio telescopes, generate massive amounts of data at incredible speeds and we need to able to simulate that. For context, that’s faster than a 10MP RGB camera shooting at 35 frames per second that generates data at ~1 GiB/s.

Optimizing I/O Strategies

We’ve been experimenting with various I/O approaches to optimize high-speed data stream handling.

Exploring Streaming Semantics

We’re investigating various ways to express and execute stream transformations, to ensure that FasTensor can handle a wide range of streaming computations.

Developing I/O Drivers

We’ve developed two new I/O drivers based on LinuxAIO and MPI IO to ingest incoming data smoothly and maintain stream consistency.

What’s Next?

Putting It All Together

We’re in the final stretch of integrating all these components into a seamless stream processing system.

Rigorous Testing

We’ll push our stream processing to its limits, simulating diverse data flows to ensure rock-solid performance in any scientific setting.

HPC Environment Validation

The ultimate test will be running our new streaming capabilities in real HPC environments, checking how they perform with different I/O setups and computing paradigms.

Wrapping Up

This summer has been a whirlwind of coding, testing, and learning. We’re making significant strides in bringing real-time processing capabilities to FasTensor, which could open up exciting new possibilities in scientific computing and data analysis. Stay tuned for more updates as we finalize this feature. If you’re interested in the nitty-gritty technical details or want to check out the code, feel free to reach out or check our project repository. Happy coding, and may your computations be ever faster!

Halfway Through GSOC: Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Sat, 20 Jul 2024 00:00:00 +0000

Hello, I’m Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University. I’m currently working on the AIIO / Graph Neural Network project under the guidance of Bin Dong and Suren Byna. Our project focuses on enhancing the AIIO framework to automatically diagnose I/O performance bottlenecks in high-performance computing (HPC) systems using Graph Neural Networks (GNNs).

Project Overview

Our primary goal is to tackle the persistent issue of I/O bottlenecks in HPC applications. Identifying these bottlenecks manually is often labor-intensive and prone to errors. By integrating GNNs into the AIIO framework, we aim to create an automated solution that can diagnose these bottlenecks with high accuracy, ultimately improving the efficiency and reliability of HPC systems.

Progress and Challenges

Over the past few weeks, my work has been centered on developing a robust data pre-processing pipeline. This pipeline is crucial for converting raw I/O log data into a graph format suitable for GNN analysis. The data pre-processing involves extracting relevant features from Darshan I/O logs, which include job-related information and performance metrics. One of the main challenges has been dealing with the heterogeneity and sparsity of the data, which can affect the accuracy of our models. To address this, we’ve focused on using correlation analysis to identify and select the most relevant features, ensuring that the dataset is well-structured and informative for GNN processing.

We’ve also started constructing the GNN model. The model is designed to capture the complex relationships between different I/O operations and their impact on system performance. This involves defining nodes and edges in the graph that represent job IDs, counter types, and their values. We explored different graph structures, including those that focus on counter types and those that incorporate more detailed information. While more detailed graphs offer better accuracy, they also require more computational resources.

Current Achievements

Data Pre-processing Pipeline: We have successfully developed and tested the pipeline to transform Darshan I/O logs into graph-structured data. This was a significant milestone, as it sets the foundation for all subsequent GNN modeling efforts.
GNN Model Construction: The initial version of our GNN model has been implemented. This model is now capable of learning from the graph data and making predictions about I/O performance bottlenecks.
Correlation Analysis for Graph Structure Design: We have used correlation analysis on the dataset to understand the relationships between I/O counters. This analysis has been instrumental in designing a more effective graph structure, helping to better capture the dependencies and interactions critical for accurate performance diagnosis.

Training for Different Graph Structures: We are currently training our model using various graph structures to determine the most effective configuration for accurate I/O performance diagnosis. This ongoing process aims to refine our approach and improve the model’s predictive accuracy.

Next Steps

Looking ahead, we plan to focus on several key areas:

Refinement and Testing: We’ll continue refining the GNN model, focusing on improving its accuracy and efficiency. This includes experimenting with different graph structures and training techniques.
SHAP Analysis: To enhance the interpretability of our model, we’ll incorporate SHAP (SHapley Additive exPlanations) values. This will help us understand the contribution of each feature to the model’s predictions, making it easier to identify critical factors in I/O performance.
Documentation and Community Engagement: As we make progress, we’ll document our methods and findings, sharing them with the broader community. This includes contributing to open-source repositories and engaging with other researchers in the field.

This journey has been both challenging and rewarding, and I am grateful for the support and guidance from my mentors and the community. I look forward to sharing more updates as we continue to advance this exciting project.

Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Fri, 14 Jun 2024 00:00:00 +0000

Hello, I am Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University, specializing in Artificial Intelligence. This summer, I will be working on the project AIIO / Graph Neural Network under the mentorship of Bin Dong and Suren Byna.

High-Performance Computing (HPC) applications often face performance issues due to I/O bottlenecks. Manually identifying these bottlenecks is time-consuming and error-prone. My project aims to enhance the AIIO framework by integrating a Graph Neural Network (GNN) model to automatically diagnose I/O performance bottlenecks at the job level. This involves developing a comprehensive data pre-processing pipeline, constructing and validating a tailored GNN model, and rigorously testing the model’s accuracy using test cases from the AIIO dataset.

Through this project, I seek to provide a sophisticated, AI-driven approach to understanding and improving I/O performance in HPC systems, ultimately contributing to more efficient and reliable HPC applications.

Stream Processing support for FasTensor

Thu, 13 Jun 2024 00:00:00 +0000

Hi, I’m Aditya Narayan,👋

I’m a frequent visitor to the town square of theoretical CS, operations (Ops), and robust high-performance systems. Sometimes I indulge myself with insights on Computing and Biology, and other times I enjoy the accounts of minefield experiences in the systems world. Luckily, this summer, OSRE offered an opportunity that happened to be at the perfect intersection of my interests.

This summer, I will be working on a scientific computing library called FasTensor that offers a parallel computing structure called Stencil, widely popular in the scientific computing world to solve PDEs for Physical Simulations and Convolutions on Signals, among its many uses. I am excited to introduce my mentors, Dr. Bin Dong and Dr. John Wu of the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL). They bring invaluable expertise to the project.

They recognized the need for a tensor processing library that provided dedicated support for big datasets with inherent structural locality, often found in the scientific computing world, which was lacking in popular open-source MapReduce or Key-Value based frameworks.

More often than not, the operations performed on these datasets are composed of computations involving neighboring elements. This motivated the development of the FasTensor library.

I will be working on providing a Stream Processing interface that enables online data processing of large-scale datasets as they arrive from Data Producers. The project focuses on offering rich interfaces for managing and composing streams, supporting common scientific data formats like HDF5, and integrating fault tolerance and reliability mechanisms.

I am thrilled to work on the FasTensor project because I believe it has the potential to make a significant impact by enabling researchers to implement a rich set of computations on their big datasets in an easy and intuitive manner.

After all, FasTensor has just one simple paradigm: A -> Transform(F(x), B),

and it handles all the behind-the-scenes grunt work of handling big datasets so you can focus on your research.

Stay tuned for updates and feel free to collaborate!

Drishti

Thu, 06 Jun 2024 00:00:00 +0000

Namaste everyone! 🙏🏻

I’m Joel Tony, a third-year Computer Science undergraduate at BITS Pilani, Goa, India. I’m truly honored to be part of this year’s Google Summer of Code program, working with the UC OSPO organization on a project that genuinely excites me. I’m particularly grateful to be working under the mentorship of Dr. Jean Luca Bez, a Research Scientist at Lawrence Berkeley National Laboratory, and Dr. Suren Byna, a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.

My project, “Drishti: Visualization and Analysis of AI-based Applications”, aims to extend the Drishti framework to better support AI/ML workloads, focusing specifically on optimizing their Input/Output (I/O) performance. I/O refers to the data transfer between a computer’s memory and external storage devices like hard drives (HDDs) or solid-state drives (SSDs). As AI models and datasets continue to grow exponentially in size, efficient I/O management has become a critical bottleneck that can significantly impact the overall performance of these data-intensive workloads.

Drishti is an innovative, interactive web-based framework that helps users understand the I/O behavior of scientific applications by visualizing I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualizations, making performance issues more apparent. Now, I’m working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.

Through my studies in high-performance computing and working with tools like BeeGFS and Darshan, I’ve gained insights into the intricacies of I/O performance. However, adapting Drishti for AI/ML workloads presents new challenges. In traditional HPC, computing often dominates, but in the realm of AI, the tables have turned. As models grow by billions of parameters and datasets expand to petabytes, I/O has become the critical path. Training larger models or using richer datasets doesn’t just mean more computation; it means handling vastly more data. This shift makes I/O optimisation not just a performance tweak but a fundamental enabler of AI progress. By fine-tuning Drishti for AI/ML workloads, we aim to pinpoint I/O bottlenecks precisely, helping researchers streamline their data pipelines and unlock the full potential of their hardware.

As outlined in my proposal, my tasks are threefold:

Modularize Drishti’s codebase: Currently, it’s a single 1700-line file that handles multiple functionalities. I’ll be refactoring it into focused, maintainable modules, improving readability and facilitating future enhancements.
Enable multi-trace handling: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I’ll build a layer to aggregate these, providing a comprehensive view of the application’s I/O behavior.
Craft AI/ML-specific recommendations: Current suggestions often involve MPI-IO or HDF5, which aren’t typical in ML frameworks like PyTorch or TensorFlow. I’ll create targeted recommendations that align with these frameworks’ data pipelines.

This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimize it for the unique I/O challenges that AI/ML applications face. Whether it’s dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.

From classroom theories to hands-on projects, from understanding file systems to optimizing AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.

In today’s AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimized I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behavior in AI/ML applications more interpretable through Drishti, we’re not just tweaking code. We’re providing developers with the insights they need to optimize their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.

I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimizing I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I’m deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavor and to the UC OSPO and GSoC communities for this incredible opportunity.

ReproNB: Reproducibility of Interactive Notebook Systems

Mon, 26 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: HPC, MPI, distributed systems
Skills: C++, Python
Difficulty: Difficult
Size: Large; 350 hours
Mentors: Tanu Malik

Notebooks have gained wide popularity in scientific computing. A notebook is both a web-based interactive front- end to program workflows and a lightweight container for sharing code and its output. Reproducing notebooks in different target environments, however, is a challenge. Notebooks do not share the computational environment in which they are executed. Consequently, despite being shareable they are often not reproducible. We have developed FLINC (see also eScience'22 paper) to address this problem. However, it currently does not support all forms of experiments, especially those relating to HPC experiments. In this project we will extend FLINC to HPC experiments. This will involve using recording and replaying mechanisms such as ReMPI and rr within FLINC.

Project Deliverable

The project deliverable will be a set of HPC experiments that are packaged with FLINC and available on Chamaeleon.

ScaleBugs: Reproducible Scalability Bugs

Tue, 07 Feb 2023 00:00:00 +0000

Scalable systems lay essential foundations of the modern information industry. HPC data centers tend to have hundreds to thousands of nodes in their clusters. The use of “extreme-scale” distributed systems has given birth to a new type of bug: scalability bugs. As its name suggests, scalability bugs may be presented depending on the scale of a run, and thus, symptoms may only be observable in large-scale deployments, but not in small or median deployments. For example, Cassandra-6127 is a scalability bug detected in the popular distributed database Cassandra. The scalability bug causes unnecessary CPU usage, however, the symptom is not observed unless ~1000 nodes are deployed. This demonstrates the main challenge of studying scalability bugs: it is extremely challenging to reproduce without deploying the system at a large scale.

In this project, our goal is to build a dataset of reproducible scalability bugs. To achieve this, we will go through the existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. For each bug report, we determine if the reported bug depends on the scale of the run, such as the number of nodes utilized. With the collected scale-dependent bugs, we then will craft the workload to reproduce those scalability bugs. Our workloads will be designed to trigger some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance. For example, a successful reproduction should be able to show the performance drop along with an increasing number of nodes.

Building a Dataset of Reproducible Scalability Bugs

Topics: Scalability systems, bug patterns, reproducibility, bug dataset
Skills: Linux Shell, Docker, Java, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Cindy Rubio González, Haryadi S. Gunawi, Hao-Nan Zhu
Contributor(s): Goodness Ayinmode, Zahra Nabila Maharani

The student will build a dataset of reproducible scalability bugs. Each bug artifact in the dataset will contain (1) the buggy and fixed versions of the scalability system, (2) a runtime environment that ensures reproducibility, and (3) a workload shell script that could demonstrate the symptoms of the bug under different scales.

Specific Tasks

Work with the mentors to understand the context of the project.
Learn the background of scalability systems.
Inspect the bug reports from Apache JIRA and identify scale-dependent bugs.
Craft shell scripts to trigger the exact scalability bug described by the bug report.
Organize the reproducible scalability bugs and write documentation to build the code and trigger the bug.