LBNL | UCSC OSPO

AI Data Readiness Inspector (AIDRIN)

Fri, 30 Jan 2026 10:15:00 -0700

Garbage In, Garbage Out (GIGO) is a widely accepted quote in computer science across various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.

AIDRIN (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of data readiness for AI processes, covering a broad range of dimensions from the literature. AIDRIN uses metrics from traditional data quality assessment, such as completeness, outliers, and duplicates, to evaluate data. Furthermore, AIDRIN uses metrics specific to assessing AI data, such as feature importance, feature correlations, class imbalance, fairness, privacy, and compliance with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles. AIDRIN provides visualizations and reports to assist data scientists in further investigating data readiness.

AIDRIN Multiple File Formats

The proposed work will include improvements in the AIDRIN framework to (1) add support for new file formats such as Zarr, ROOT, and HDF5; and (2) to allow providing custom data ingestion mechanisms.

Topics: data readiness, AI, data analysis
Skills: Python, C/C++, data analysis, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Drishti

Fri, 30 Jan 2026 10:15:00 -0700

Drishti is a novel interactive web-based analysis framework to visualize I/O traces, highlight bottlenecks, and help understand the I/O behavior of scientific applications. Drishti aims to fill the gap between the trace collection, analysis, and tuning phases. The framework contains an interactive I/O trace analysis component for end-users to visually inspect their applications’ I/O behavior, focusing on areas of interest and getting a clear picture of common root causes of I/O performance bottlenecks. Based on the automatic detection of I/O performance bottlenecks, our framework maps numerous common and well-known bottlenecks and their solution recommendations that can be implemented by users.

Drishti Comparisons and Heatmaps

The proposed work will include investigating and building a solution to allow comparing and finding differences between two I/O trace files (similar to a diff), covering the analysis and visualization components. It will also explore additional metrics and counters such as Darshan heatmaps in the analysis and visualization components of the framework.

Topics: I/O, HPC, data analysis, visualization, profiling, tracing
Skills: Python, data analysis, performance profiling
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

AIDRIN Privacy-Centric Enhancements: Backend & UX Upgrades

Fri, 25 Jul 2025 00:00:00 +0000

⏱️ Reading time: 5–6 minutes

Hey everyone,

If you’ve ever wondered what it takes to make AI data pipelines not just smarter, but safer and more transparent, you’re in the right place. The last few weeks working on AIDRIN for GSoC have been a deep dive into the engine room of privacy and backend systems that power the AIDRIN project. My focus has been on building out the core privacy infrastructure and backend features that power AIDRIN’s ability to give users real, actionable insights about their data. It’s been challenging, sometimes messy, but incredibly rewarding to see these changes make a tangible difference.

Having Dr. Jean Luca Bez and Prof. Suren Byna as mentors, along with the support of the entire team, has truly made all the difference. Their guidance, encouragement, and collaborative spirit have been a huge part of this journey, whether I’m brainstorming new ideas or just trying to untangle a tricky bug.

Privacy Metrics: Making Data Safer

A major part of my work has been putting data privacy at the front and center in AIDRIN. I focused on integrating essential privacy metrics like k-anonymity, l-diversity, t-closeness, and more, making sure they’re not just theoretical checkboxes, but real tools that users can interact with and understand. Now, these metrics are fully wired up in the backend and visualized in AIDRIN, so privacy risks are no longer just a vague concern. They are something AI data preparers can actually see and act on. Getting these metrics to work seamlessly with different datasets and ensuring their accuracy took some serious backend engineering, but the payoff has been worth it.

Speeding Things Up (So You Don’t Have To Wait Around)

As AIDRIN started handling bigger datasets, some of the calculations can be time-consuming because data has to be accessed every time a metric is computed. To address this, I added caching for previously computed metrics, like class imbalance and privacy checks, and set up asynchronous execution with Celery and Redis. This should make the app super responsive. Rather than waiting for heavy computations to finish, one can start taking notes about other metrics or explore different parts of the app while their results are loading in the background. It’s a small change, but it helps keep the workflow moving smoothly.

Small Touch Ups That (Hopefully) Make a Big Difference

I also spent time on the details that make the app easier to use. Tooltips now explain what the privacy metrics actually mean, error messages are clearer, and there’s a new cache info page where you can see and clear your cached data. The sensitive attribute dropdown is less confusing now, especially if you’re working with quasi-identifiers. These tweaks might seem minor, but they add up and make the app friendlier for everyone.

Docs, Docs, Docs

I’m a big believer that good documentation is just as important as good code. I updated the docs to cover all the new features, added citations for the privacy metrics, and made the install process a bit more straightforward. Hopefully, this means new users and contributors can get up to speed without too much hassle.

Huge Thanks to My Mentors and the Team

I really want to shine a light on Dr. Bez, Prof. Byna, and the entire AIDRIN team here. Their encouragement, practical advice, and collaborative spirit have been a huge part of my progress. Whether I’m stuck on a bug, brainstorming a new feature, or just need a second opinion, there’s always someone ready to help me think things through. Their experience and support have shaped not just the technical side of my work, but also how I approach problem-solving and teamwork.

What’s Next?

Looking ahead, I’m planning to expand AIDRIN’s support for multimodal datasets and keep refining the privacy and fairness modules. There’s always something new to learn or improve, and I’m excited to keep building. If you’re interested in data quality, privacy, or open-source AI tools, I’d love to connect and swap ideas.

Thanks for reading and for following along with my GSoC journey. I’ll be back soon with more updates!

This is the second post in my 3-part GSoC series with AIDRIN. Stay tuned for the final update.

Improving AI Data Pipelines in AIDRIN: A Privacy-Centric and Multimodal Expansion

Thu, 12 Jun 2025 00:00:00 +0000

⏱️ Reading time: 4–5 minutes

Hi 👋

I’m Harish Balaji, a Master’s student at NYU with a focus on Artificial Intelligence, Machine Learning, and Cybersecurity. I’m especially interested in building scalable systems that reflect responsible AI principles. For me, data quality isn’t just a technical detail. It’s a foundational aspect of building models that are reliable, fair, and reproducible in the real world.

This summer, I’m contributing to AIDRIN (AI Data Readiness Inspector) as part of Google Summer of Code 2025. I’m grateful to be working under the mentorship of Dr. Jean Luca Bez and Prof. Suren Byna from the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL).

AIDRIN is an open-source framework that helps researchers and practitioners evaluate whether a dataset is truly ready to be used in production-level AI workflows. From fairness to privacy, it provides a structured lens through which we can understand the strengths and gaps in our data.

Why this work matters

In machine learning, one principle always holds true:

“Garbage in, garbage out.”

Even the most advanced models can underperform or amplify harmful biases if trained on incomplete, imbalanced, or poorly understood data. This is where AIDRIN steps in. It provides practical tools to assess datasets across key dimensions like privacy, fairness, class balance, interpretability, and support for multiple modalities.

By making these characteristics measurable and transparent, AIDRIN empowers teams to make informed decisions early in the pipeline. It helps ensure that datasets are not only large or complex, but also trustworthy, representative, and purpose-fit.

My focus this summer

As part of my GSoC 2025 project, I’ll be focusing on extending AIDRIN’s evaluation capabilities. A big part of this involves strengthening its support for privacy metrics and designing tools that can handle non-tabular datasets, such as image-based data.

The goal is to expand AIDRIN’s reach without compromising on interpretability or ease of use. More technical insights and updates will follow in the next posts as the summer progresses.

What comes next

As the AI community continues to evolve, there’s a growing shift toward data-centric practices. I believe frameworks like AIDRIN are essential for helping us move beyond the question of “Does the model work?” toward a deeper and more meaningful one: “Was the data ready in the first place?”

Over the next few weeks, I’ll be working on development, testing, and integration. I’m excited to contribute to a tool that emphasizes transparency and reproducibility across the AI lifecycle, and to share lessons and ideas with others who care about responsible AI.

If you’re exploring similar challenges or working in the space of dataset evaluation and readiness, I’d love to connect and exchange thoughts. You can also read my full GSoC 2025 proposal below for more context around the project scope and vision:

👉 Read my GSoC 2025 proposal here

This is the first in a 3-part blog series documenting my GSoC journey with AIDRIN. Stay tuned for technical updates and behind-the-scenes insights as the summer unfolds!

AI Data Readiness Inspector (AIDRIN)

Tue, 11 Feb 2025 10:15:00 -0700

Garbage In Garbage Out (GIGO) is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.

AIDRIN (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of the readiness of data for AI processes, covering a broad range of readiness dimensions available in the literature. AIDRIN uses metrics in traditional data quality assessment, such as completeness, outliers, and duplicates, for data evaluation. Furthermore, AIDRIN uses metrics specific to assess data for AI, such as feature importance, feature correlations, class imbalance, fairness, privacy, and FAIR (Findability, Accessibility, Interoperability, and Reusability) principle compliance. AIDRIN provides visualizations and reports to assist data scientists in further investigating the readiness of data.

AIDRIN Visualizations and Science Gateway

The proposed work will include improvements in the AIDRIN framework to (1) enhance, extend, and optimize the visualizations of metrics related to all six pillars of AI data readiness and (2) set up a science gateway on NERSC or AWS cloud service.

Topics: data readiness AI
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench with AI workloads

Tue, 11 Feb 2025 10:15:00 -0700

h5bench is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.

Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library’s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.

h5bench with AI workloads

The proposed work will include (1) analyzing and characterizing AI workloads that rely on HDF5 datasets, (2) extracting a kernel of their I/O operations, and (3) implementing and validating the kernel in h5bench.

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Final Report: Stream processing support for FasTensor

Fri, 30 Aug 2024 00:00:00 +0000

Final Report: Stream processing support for FasTensor

Project Description

FasTensor is a scientific computing library specialized in performing computations over dense matrices that exhibit spatial locality, a characteristic often found in physical phenomena data. Our GSoC'24 project aimed to enhance FasTensor by enabling it to ingest and process live data streams from sensors and scientific equipment.

What is FasTensor?

Imagine you’re working on a physical simulation or solving partial differential equations (PDEs). You’ve discretized your PDE, but now you face a new challenge: you need to run your computations fast and parallelize them across massive compute clusters.

At this point, you find yourself describing a stencil [1] operation. But should you really spend your time tinkering with loop orders, data layouts, and countless other side-quests unrelated to your core problem?

This is where FasTensor comes in: Describe your computation as a stencil, and it takes care of ensuring optimal execution. FasTensor lets you focus on the science, not the implementation details.

Repository Links

FasTensor: https://github.com/BinDong314/FasTensor
My fork: https://github.com/my-name/FasTensor/tree/ftstream

PR(s)

Work done this summer

Develop Streaming simulator: FTStream

I was first entasked by Dr. Bin to develop a stream simulator for testing the streaming capability of FasTensor. For testing purposes, a stream is characterized by file size, count, and arrival interval. FTStream can generate streams of various sizes and intervals, up to the theoretical limits of disk and filesystem. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe!

Writing this tool was an adventure in throughput testing and exploring APIs. I wrote multiple drivers, each for a different whim and hijinks of systems in the HPC world. Here’s a brief journey through the APIs we explored:

HDF5 APIs: Pretty fast in flush-to-disk operation, but the API design strongly binds to file handles, which inhibits high throughput duplication.
HDF5 VFL and VOL: We dabbled in these dark arts, but there be dragons! Keeping a long-term view of maintenance, we dropped the idea.
POSIX O_DIRECT: This involved getting your buffers aligned right and handling remainders correctly. A step up, but not quite at the theoretical limits.
Linux AIO: Streaming is latency sensitive domain, to reach the theoretical limits, every syscall saved matters. Linux AIO allowed us syscall batching with io_submit(). It took a few testing sessions to get the correct combo of queue depth, buffer size, and alignment right.

We settled on O_DIRECT + Linux AIO. Feel free to modify ftstream/fastflush.h to suit your needs.

Stream Support

FasTensor has just one simple paradigm: you give it a data source, an output data store, and your transform, and it handles all the behind-the-scenes grunt work of computing over big datasets so you can focus on your research.

We aimed to achieve the same for streaming: Drop in the STREAM keyword, append a pattern identifying your stream, and use your usual transform.

Voila! Now your previous FasTensor code supports live data streams.

Technical tidbits:

Implements a manager-worker pattern to allow us flexibility in the future to implement different stream semantics such as windowing, CPU-memory based load balancing
Supports streams of indefinite size

Challenges

HPC has its fair share of challenges. Things you take for granted might not be available there, and it takes a while to adjust to paradigms of scale and parallelization.

For example, when developing FTStream, we found O_DIRECT is available on some parallel file systems like GPFS but not supported on Lustre/CFS. We developed a separate MPIO driver for FTStream that will be upstreamed once thoroughly tested on Lustre.

Future Work

Implement windowing and explore more advanced stream semantics.
Implement support for for defining workload policies
Optimize interleaving IO and Compute.

References

[1] Anshu Dubey. 2014. Stencils in Scientific Computations. In Proceedings of the Second Workshop on Optimizing Stencil Computations (WOSC ‘14). Association for Computing Machinery, New York, NY, USA, 57. https://doi.org/10.1145/2686745.2686756

Acknowledgement

I struck gold when it comes to mentors.

Dr. Bin Dong was really kind and supportive throughout the journey. From the very first steps of giving a tour around the codebase to giving me a lot of freedom to experiment, refactor, and refine.

Dr. John Wu was encouraging and nurturing of budding talent. We had great research presentations every Monday apart from usual mentor interactions, where different research groups presented their talks and students were invited to present their progress.

I’ve come across Quantum computing many times in the news, but I never thought I’d get a frontline preview from the researchers working at the bleeding edge at the Lawrence Berkeley National Laboratory (LBL).

This GSoC experience, made possible by Google and UC OSPO, has been invaluable for my growth as a developer and researcher.

For people interested in HPC, ML, Systems, or Reproducibility, I encourage you all to apply to UC OSPO. It’s been an incredible journey, and I’m grateful for every moment of it!

Midway Through GSoC

Wed, 31 Jul 2024 00:00:00 +0000

Hello everyone! I’m Joel Tony, and I’m excited to share my progress update on the Drishti project as part of my Google Summer of Code (GSoC) experience. Over the past few weeks, I’ve been diving deep into the world of I/O visualization for scientific applications, and I’m thrilled to tell you about the strides we’ve made.

What is Drishti?

For those unfamiliar with Drishti, it’s an application used to visualize I/O traces of scientific applications. When running complex scientific applications, understanding their I/O behavior can be challenging. Drishti steps in to parse logs from various sources, with a primary focus on those collected using Darshan, a lightweight I/O characterization tool for HPC applications. Drishti provides human-interpretable insights on how to improve I/O performance based on these logs. While Drishti supports multiple log sources, our current work emphasizes Darshan logs due to their comprehensive I/O information. Additionally, Drishti offers visually appealing and easy-to-understand graphs to help users better grasp their application’s I/O patterns, making it easier to identify bottlenecks and optimize performance.

Progress and Challenges

Export Directory Feature

One of the first features I implemented was the export directory functionality. In earlier versions of Drishti, users couldn’t select where they wanted their output files to be saved. This became problematic when working with read-only log locations. I familiarized myself with the codebase, created a pull request, and successfully added this feature, allowing users to choose their preferred output location.

CI Improvements and Cross-Project Dependencies

While working on Drishti, I discovered the tight coupling between various tools in the HPC I/O organization, such as Drishti and DXT Explorer. This highlighted the need for improved Continuous Integration (CI) practices. We currently run about eight GitHub Actions for each pull request, but they don’t adequately test the interactions between different branches of these interconnected tools. This is an area we’ve identified for future improvement to ensure smoother integration and fewer conflicts between projects.

Refactoring for Multi-File Support

The bulk of my time was spent refactoring Drishti to extend its framework from parsing single Darshan files to handling multiple files. This task was more complex than it initially appeared, as Drishti’s insights are based on the contents of each Darshan file. When dealing with multiple files, we needed to find a way to aggregate the data meaningfully without sacrificing on performance.

The original codebase had a single, thousand-line function for parsing Darshan files. To improve this, I implemented a data class structure in Python. This refactoring allows for:

Better separation of computation and condition checking
Easier parallelization of processing multiple traces
Finer-grained profiling of performance bottlenecks
More flexibility in data manipulation and memory management

Learnings and Skills Gained

Through this process, I’ve gained valuable insights into:

Refactoring large codebases
Understanding and improving cross-project dependencies
Implementing data classes in Python for better code organization
Balancing performance with code readability and maintainability

Next Steps

As I move forward with the project, my focus will be on:

Adding unit tests for individual methods to ensure functionality
Exploring alternative data frame implementations like Polars for better performance
Developing aggregation methods for different types of data across multiple Darshan files
Optimizing memory usage and computational efficiency for large datasets

Conclusion

Working on Drishti has been an incredible learning experience. I’ve had the opportunity to tackle real-world challenges in scientific computing and I/O visualization. As we progress, I’m excited about the potential impact of these improvements on the scientific community’s ability to optimize their applications’ I/O performance.

I’m grateful for this opportunity and looking forward to the challenges and discoveries that lie ahead in the second half of my GSoC journey. Stay tuned for more updates as we continue to enhance Drishti!

If you have any questions or would like to learn more about the project, feel free to reach out to me. Let’s keep pushing the boundaries of scientific computing together!

Streaming into the Future: Adding Real-Time Processing to FasTensor

Tue, 30 Jul 2024 00:00:00 +0000

Hey there, HPC enthusiasts and fellow coders! I’m excited to share my progress on this summer’s Google Summer of Code project under UC OSPO’s FasTensor. Here’s a glimpse into how we’re pushing the boundaries of real-time data processing.

The Big Picture: FasTensor and HPC Challenges

First, a quick refresher: FasTensor is our go-to tool for handling dense arrays in scientific computing. It tackles three major HPC challenges:

Optimizing computations
Distributing data efficiently
Balancing workloads across computing cores

FasTensor excels at these tasks, especially when dealing with data that has structural locality - a common feature in scientific computing. Here, the Stencil computations come in handy, capturing data locality for operations like solving partial differential equations in physical simulations.

The Mission: Bringing FasTensor into Real-Time

While FasTensor is great at processing existing data, the next frontier is handling live data streams from scientific instruments and sensors. That’s where my GSoC project comes in: adding stream processing capabilities to FasTensor.

Progress Highlights:

Building a Stream Simulator

We’ve created FTstream, a nifty tool that simulates data streams. It can generate streams of various sizes and intervals, pushing the limits of what your disk can handle. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe! This tool is crucial because many scientific instruments, from particle accelerators to radio telescopes, generate massive amounts of data at incredible speeds and we need to able to simulate that. For context, that’s faster than a 10MP RGB camera shooting at 35 frames per second that generates data at ~1 GiB/s.

Optimizing I/O Strategies

We’ve been experimenting with various I/O approaches to optimize high-speed data stream handling.

Exploring Streaming Semantics

We’re investigating various ways to express and execute stream transformations, to ensure that FasTensor can handle a wide range of streaming computations.

Developing I/O Drivers

We’ve developed two new I/O drivers based on LinuxAIO and MPI IO to ingest incoming data smoothly and maintain stream consistency.

What’s Next?

Putting It All Together

We’re in the final stretch of integrating all these components into a seamless stream processing system.

Rigorous Testing

We’ll push our stream processing to its limits, simulating diverse data flows to ensure rock-solid performance in any scientific setting.

HPC Environment Validation

The ultimate test will be running our new streaming capabilities in real HPC environments, checking how they perform with different I/O setups and computing paradigms.

Wrapping Up

This summer has been a whirlwind of coding, testing, and learning. We’re making significant strides in bringing real-time processing capabilities to FasTensor, which could open up exciting new possibilities in scientific computing and data analysis. Stay tuned for more updates as we finalize this feature. If you’re interested in the nitty-gritty technical details or want to check out the code, feel free to reach out or check our project repository. Happy coding, and may your computations be ever faster!

Stream Processing support for FasTensor

Thu, 13 Jun 2024 00:00:00 +0000

Hi, I’m Aditya Narayan,👋

I’m a frequent visitor to the town square of theoretical CS, operations (Ops), and robust high-performance systems. Sometimes I indulge myself with insights on Computing and Biology, and other times I enjoy the accounts of minefield experiences in the systems world. Luckily, this summer, OSRE offered an opportunity that happened to be at the perfect intersection of my interests.

This summer, I will be working on a scientific computing library called FasTensor that offers a parallel computing structure called Stencil, widely popular in the scientific computing world to solve PDEs for Physical Simulations and Convolutions on Signals, among its many uses. I am excited to introduce my mentors, Dr. Bin Dong and Dr. John Wu of the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL). They bring invaluable expertise to the project.

They recognized the need for a tensor processing library that provided dedicated support for big datasets with inherent structural locality, often found in the scientific computing world, which was lacking in popular open-source MapReduce or Key-Value based frameworks.

More often than not, the operations performed on these datasets are composed of computations involving neighboring elements. This motivated the development of the FasTensor library.

I will be working on providing a Stream Processing interface that enables online data processing of large-scale datasets as they arrive from Data Producers. The project focuses on offering rich interfaces for managing and composing streams, supporting common scientific data formats like HDF5, and integrating fault tolerance and reliability mechanisms.

I am thrilled to work on the FasTensor project because I believe it has the potential to make a significant impact by enabling researchers to implement a rich set of computations on their big datasets in an easy and intuitive manner.

After all, FasTensor has just one simple paradigm: A -> Transform(F(x), B),

and it handles all the behind-the-scenes grunt work of handling big datasets so you can focus on your research.

Stay tuned for updates and feel free to collaborate!

Drishti

Thu, 06 Jun 2024 00:00:00 +0000

Namaste everyone! 🙏🏻

I’m Joel Tony, a third-year Computer Science undergraduate at BITS Pilani, Goa, India. I’m truly honored to be part of this year’s Google Summer of Code program, working with the UC OSPO organization on a project that genuinely excites me. I’m particularly grateful to be working under the mentorship of Dr. Jean Luca Bez, a Research Scientist at Lawrence Berkeley National Laboratory, and Dr. Suren Byna, a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.

My project, “Drishti: Visualization and Analysis of AI-based Applications”, aims to extend the Drishti framework to better support AI/ML workloads, focusing specifically on optimizing their Input/Output (I/O) performance. I/O refers to the data transfer between a computer’s memory and external storage devices like hard drives (HDDs) or solid-state drives (SSDs). As AI models and datasets continue to grow exponentially in size, efficient I/O management has become a critical bottleneck that can significantly impact the overall performance of these data-intensive workloads.

Drishti is an innovative, interactive web-based framework that helps users understand the I/O behavior of scientific applications by visualizing I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualizations, making performance issues more apparent. Now, I’m working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.

Through my studies in high-performance computing and working with tools like BeeGFS and Darshan, I’ve gained insights into the intricacies of I/O performance. However, adapting Drishti for AI/ML workloads presents new challenges. In traditional HPC, computing often dominates, but in the realm of AI, the tables have turned. As models grow by billions of parameters and datasets expand to petabytes, I/O has become the critical path. Training larger models or using richer datasets doesn’t just mean more computation; it means handling vastly more data. This shift makes I/O optimisation not just a performance tweak but a fundamental enabler of AI progress. By fine-tuning Drishti for AI/ML workloads, we aim to pinpoint I/O bottlenecks precisely, helping researchers streamline their data pipelines and unlock the full potential of their hardware.

As outlined in my proposal, my tasks are threefold:

Modularize Drishti’s codebase: Currently, it’s a single 1700-line file that handles multiple functionalities. I’ll be refactoring it into focused, maintainable modules, improving readability and facilitating future enhancements.
Enable multi-trace handling: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I’ll build a layer to aggregate these, providing a comprehensive view of the application’s I/O behavior.
Craft AI/ML-specific recommendations: Current suggestions often involve MPI-IO or HDF5, which aren’t typical in ML frameworks like PyTorch or TensorFlow. I’ll create targeted recommendations that align with these frameworks’ data pipelines.

This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimize it for the unique I/O challenges that AI/ML applications face. Whether it’s dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.

From classroom theories to hands-on projects, from understanding file systems to optimizing AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.

In today’s AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimized I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behavior in AI/ML applications more interpretable through Drishti, we’re not just tweaking code. We’re providing developers with the insights they need to optimize their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.

I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimizing I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I’m deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavor and to the UC OSPO and GSoC communities for this incredible opportunity.

Drishti

Tue, 30 Jan 2024 10:15:00 -0700

Drishti / Server-side Visualization Service

The proposed work will include investigating and building server-side solutions to support the visualization of larger I/O traces and logs, while integrating with the existing analysis, reports, and recommendations.

Topics: I/O HPC visualization, performance analysis
Skills: Python, HTML/CSS, JavaScript
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Drishti / Visualization and Analysis of AI-based Applications

Drishti to handle metrics from non-MPI applications, specifically, AI/ML codes and applications. This work entails adapting the existing framework, heuristics, and recommendations to support metrics collected from AI/ML workloads.

Topics: I/O HPC AI visualization, performance analysis
Skills: Python, AI, performance profiling
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench

Tue, 30 Jan 2024 10:15:00 -0700

h5bench / Reporting and Enhancing

The proposed work will include standardizing and enhancing the reports generated by the suite, and integrate additional I/O kernels (e.g., HACC-IO).

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench / Compression

The proposed work will focus on including compression capabilities into the h5bench core access patterns through HDF5 filters.

Topics: I/O HPC benchmarking, compression
Skills: C/C++, Python, HDF5
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna