GSoC'24 | UCSC OSPO

[Final Report] Automated Reproducibility Checklist support within StatWrap

Sat, 02 Nov 2024 00:00:00 +0000

Namaste🙏🏻! I’m Adi Akhilesh Singh, and I’m excited to share my final updates on the Reproducibility Checklists project by StatWrap, under the mentorship of Luke Rasmussen.

Project Overview

This project introduces customizable reproducibility checklists in StatWrap, enabling metadata-driven and user-guided generation of checklists. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklist to ensure their work is reproducible.

Project Links

Explore the StatWrap project repository and my contributions during GSoC ‘24:

Progress And Achievements

During the timeline of this project, I worked on designing the interface for the checklist page and the data structure to support the project needs.

The interface was designed with user needs in mind, featuring components such as:

URLs component to manage external links or file URIs, attached to the project.
Images component to display project image files.
Checklist Notes component to manage user-added notes.

All these assets (Files, URLs, Images) can be added to each checklist statement using the existing assets and external resources(urls) present in the project.

Additionally, for each checklist item, StatWrap runs relevant scans to provide meaningful data based on its requirements. For example, for the item, “All the software dependencies for the project are documented,” StatWrap scans project files to list the languages and dependencies detected. For each checklist statement supported in StatWrap, we implement methods to retrieve specific information by scanning project data. StatWrap currently supports six such checklist statements identified as foundational for ensuring research reproducibility. Additionally, the checklist can be exported as a PDF summary, generated by StatWrap using the checklist data, with options to include notes.

Future Prospects

As the project concludes, several areas for growth have emerged:

Expanding language support within StatWrap. While StatWrap already includes key languages used in research, there is always a scope to extend compatibility to cover even more technologies.
Options to export a data-extensive report that includes checklist and their associated scan results. These and other enhancements, like adding new checklist statements with their scanning methods, will extend StatWrap’s impact on reproducibility in research.

Earlier Blogs

If you’re interested in seeing the project’s evolution, check out my earlier posts:

Thank you for reading!

Final Report: Stream processing support for FasTensor

Fri, 30 Aug 2024 00:00:00 +0000

Final Report: Stream processing support for FasTensor

Project Description

FasTensor is a scientific computing library specialized in performing computations over dense matrices that exhibit spatial locality, a characteristic often found in physical phenomena data. Our GSoC'24 project aimed to enhance FasTensor by enabling it to ingest and process live data streams from sensors and scientific equipment.

What is FasTensor?

Imagine you’re working on a physical simulation or solving partial differential equations (PDEs). You’ve discretized your PDE, but now you face a new challenge: you need to run your computations fast and parallelize them across massive compute clusters.

At this point, you find yourself describing a stencil [1] operation. But should you really spend your time tinkering with loop orders, data layouts, and countless other side-quests unrelated to your core problem?

This is where FasTensor comes in: Describe your computation as a stencil, and it takes care of ensuring optimal execution. FasTensor lets you focus on the science, not the implementation details.

Repository Links

FasTensor: https://github.com/BinDong314/FasTensor
My fork: https://github.com/my-name/FasTensor/tree/ftstream

PR(s)

Work done this summer

Develop Streaming simulator: FTStream

I was first entasked by Dr. Bin to develop a stream simulator for testing the streaming capability of FasTensor. For testing purposes, a stream is characterized by file size, count, and arrival interval. FTStream can generate streams of various sizes and intervals, up to the theoretical limits of disk and filesystem. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe!

Writing this tool was an adventure in throughput testing and exploring APIs. I wrote multiple drivers, each for a different whim and hijinks of systems in the HPC world. Here’s a brief journey through the APIs we explored:

HDF5 APIs: Pretty fast in flush-to-disk operation, but the API design strongly binds to file handles, which inhibits high throughput duplication.
HDF5 VFL and VOL: We dabbled in these dark arts, but there be dragons! Keeping a long-term view of maintenance, we dropped the idea.
POSIX O_DIRECT: This involved getting your buffers aligned right and handling remainders correctly. A step up, but not quite at the theoretical limits.
Linux AIO: Streaming is latency sensitive domain, to reach the theoretical limits, every syscall saved matters. Linux AIO allowed us syscall batching with io_submit(). It took a few testing sessions to get the correct combo of queue depth, buffer size, and alignment right.

We settled on O_DIRECT + Linux AIO. Feel free to modify ftstream/fastflush.h to suit your needs.

Stream Support

FasTensor has just one simple paradigm: you give it a data source, an output data store, and your transform, and it handles all the behind-the-scenes grunt work of computing over big datasets so you can focus on your research.

We aimed to achieve the same for streaming: Drop in the STREAM keyword, append a pattern identifying your stream, and use your usual transform.

Voila! Now your previous FasTensor code supports live data streams.

Technical tidbits:

Implements a manager-worker pattern to allow us flexibility in the future to implement different stream semantics such as windowing, CPU-memory based load balancing
Supports streams of indefinite size

Challenges

HPC has its fair share of challenges. Things you take for granted might not be available there, and it takes a while to adjust to paradigms of scale and parallelization.

For example, when developing FTStream, we found O_DIRECT is available on some parallel file systems like GPFS but not supported on Lustre/CFS. We developed a separate MPIO driver for FTStream that will be upstreamed once thoroughly tested on Lustre.

Future Work

Implement windowing and explore more advanced stream semantics.
Implement support for for defining workload policies
Optimize interleaving IO and Compute.

References

[1] Anshu Dubey. 2014. Stencils in Scientific Computations. In Proceedings of the Second Workshop on Optimizing Stencil Computations (WOSC ‘14). Association for Computing Machinery, New York, NY, USA, 57. https://doi.org/10.1145/2686745.2686756

Acknowledgement

I struck gold when it comes to mentors.

Dr. Bin Dong was really kind and supportive throughout the journey. From the very first steps of giving a tour around the codebase to giving me a lot of freedom to experiment, refactor, and refine.

Dr. John Wu was encouraging and nurturing of budding talent. We had great research presentations every Monday apart from usual mentor interactions, where different research groups presented their talks and students were invited to present their progress.

I’ve come across Quantum computing many times in the news, but I never thought I’d get a frontline preview from the researchers working at the bleeding edge at the Lawrence Berkeley National Laboratory (LBL).

This GSoC experience, made possible by Google and UC OSPO, has been invaluable for my growth as a developer and researcher.

For people interested in HPC, ML, Systems, or Reproducibility, I encourage you all to apply to UC OSPO. It’s been an incredible journey, and I’m grateful for every moment of it!

Midterm Report : Halfway through medicinal data visulaization using PolyPhy/Polyglot

Mon, 12 Aug 2024 00:00:00 +0000

Introduction

Hello! My name is Ayush Sharma, a machine learning engineer and researcher based out of Chandigarh, a beautiful city in Northern India known for its modern architecture and green spaces. For the last month and a half I have been working closely with my mentors Oskar Elek and Kiran Deol on the project titled Unveiling Medicine Patterns: 3D Clustering with Polyphy/Polyglotas part of GSoC 2024.

Progress and Challenges

The project focuses on developing effective clustering algorithms to visualize medicine data in three dimensions using PolyPhy and Polyglot. My journey began with data preprocessing and cleaning, where unnecessary data points were removed, and missing values were addressed.

One of the primary techniques we’ve employed is UMAP (Uniform Manifold Approximation and Projection). UMAP’s ability to preserve the global structure of the data while providing meaningful clusters proved advantageous. Initial experiments with UMAP on datasets of various sizes (ranging from 1,500 to 15,000 medicines) provided valuable insights into the clustering patterns. By iteratively halving the dimensions and refining the parameters, we achieved more accurate clustering results.

To complement UMAP, we explored t-SNE (t-distributed Stochastic Neighbor Embedding). t-SNE’s focus on local relationships helped in understanding finer details within the clusters. By adjusting t-SNE parameters and conducting perturbations, we could better comprehend the data’s behavior. Combining UMAP with t-SNE in a loop, halving dimensions iteratively, showed promise, allowing us to leverage the strengths of both techniques to enhance clustering accuracy.

We also experimented with pre-trained models like BERT and Glove to create embeddings for the medicines. BERT’s splitting of salts into subparts and Glove’s limitations in recognizing specific salts led us to inaccurate clustering and we’ve been working on improving it for the time being.

Next Steps

Moving forward, I will focus on refining our clustering and embedding techniques to enhance overall accuracy. This involves integrating Jaccard distance alongside other distance measures to improve similarity assessments between medicines and clusters. Additionally, I’ll continue experimenting with advanced models like gpt,CLIP, gemini etc., for better embeddings while addressing the limitations of BERT and Glove by leveraging custom embeddings created with transformers and one-hot encoding. Optimization of UMAP and t-SNE algorithms will also be crucial, ensuring their effectiveness in clustering and visualization. These steps aim to overcome current challenges and further advance the project’s goals.

Midway Through GSoC

Wed, 31 Jul 2024 00:00:00 +0000

Hello everyone! I’m Joel Tony, and I’m excited to share my progress update on the Drishti project as part of my Google Summer of Code (GSoC) experience. Over the past few weeks, I’ve been diving deep into the world of I/O visualization for scientific applications, and I’m thrilled to tell you about the strides we’ve made.

What is Drishti?

For those unfamiliar with Drishti, it’s an application used to visualize I/O traces of scientific applications. When running complex scientific applications, understanding their I/O behavior can be challenging. Drishti steps in to parse logs from various sources, with a primary focus on those collected using Darshan, a lightweight I/O characterization tool for HPC applications. Drishti provides human-interpretable insights on how to improve I/O performance based on these logs. While Drishti supports multiple log sources, our current work emphasizes Darshan logs due to their comprehensive I/O information. Additionally, Drishti offers visually appealing and easy-to-understand graphs to help users better grasp their application’s I/O patterns, making it easier to identify bottlenecks and optimize performance.

Progress and Challenges

Export Directory Feature

One of the first features I implemented was the export directory functionality. In earlier versions of Drishti, users couldn’t select where they wanted their output files to be saved. This became problematic when working with read-only log locations. I familiarized myself with the codebase, created a pull request, and successfully added this feature, allowing users to choose their preferred output location.

CI Improvements and Cross-Project Dependencies

While working on Drishti, I discovered the tight coupling between various tools in the HPC I/O organization, such as Drishti and DXT Explorer. This highlighted the need for improved Continuous Integration (CI) practices. We currently run about eight GitHub Actions for each pull request, but they don’t adequately test the interactions between different branches of these interconnected tools. This is an area we’ve identified for future improvement to ensure smoother integration and fewer conflicts between projects.

Refactoring for Multi-File Support

The bulk of my time was spent refactoring Drishti to extend its framework from parsing single Darshan files to handling multiple files. This task was more complex than it initially appeared, as Drishti’s insights are based on the contents of each Darshan file. When dealing with multiple files, we needed to find a way to aggregate the data meaningfully without sacrificing on performance.

The original codebase had a single, thousand-line function for parsing Darshan files. To improve this, I implemented a data class structure in Python. This refactoring allows for:

Better separation of computation and condition checking
Easier parallelization of processing multiple traces
Finer-grained profiling of performance bottlenecks
More flexibility in data manipulation and memory management

Learnings and Skills Gained

Through this process, I’ve gained valuable insights into:

Refactoring large codebases
Understanding and improving cross-project dependencies
Implementing data classes in Python for better code organization
Balancing performance with code readability and maintainability

Next Steps

As I move forward with the project, my focus will be on:

Adding unit tests for individual methods to ensure functionality
Exploring alternative data frame implementations like Polars for better performance
Developing aggregation methods for different types of data across multiple Darshan files
Optimizing memory usage and computational efficiency for large datasets

Conclusion

Working on Drishti has been an incredible learning experience. I’ve had the opportunity to tackle real-world challenges in scientific computing and I/O visualization. As we progress, I’m excited about the potential impact of these improvements on the scientific community’s ability to optimize their applications’ I/O performance.

I’m grateful for this opportunity and looking forward to the challenges and discoveries that lie ahead in the second half of my GSoC journey. Stay tuned for more updates as we continue to enhance Drishti!

If you have any questions or would like to learn more about the project, feel free to reach out to me. Let’s keep pushing the boundaries of scientific computing together!

Streaming into the Future: Adding Real-Time Processing to FasTensor

Tue, 30 Jul 2024 00:00:00 +0000

Hey there, HPC enthusiasts and fellow coders! I’m excited to share my progress on this summer’s Google Summer of Code project under UC OSPO’s FasTensor. Here’s a glimpse into how we’re pushing the boundaries of real-time data processing.

The Big Picture: FasTensor and HPC Challenges

First, a quick refresher: FasTensor is our go-to tool for handling dense arrays in scientific computing. It tackles three major HPC challenges:

Optimizing computations
Distributing data efficiently
Balancing workloads across computing cores

FasTensor excels at these tasks, especially when dealing with data that has structural locality - a common feature in scientific computing. Here, the Stencil computations come in handy, capturing data locality for operations like solving partial differential equations in physical simulations.

The Mission: Bringing FasTensor into Real-Time

While FasTensor is great at processing existing data, the next frontier is handling live data streams from scientific instruments and sensors. That’s where my GSoC project comes in: adding stream processing capabilities to FasTensor.

Progress Highlights:

Building a Stream Simulator

We’ve created FTstream, a nifty tool that simulates data streams. It can generate streams of various sizes and intervals, pushing the limits of what your disk can handle. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe! This tool is crucial because many scientific instruments, from particle accelerators to radio telescopes, generate massive amounts of data at incredible speeds and we need to able to simulate that. For context, that’s faster than a 10MP RGB camera shooting at 35 frames per second that generates data at ~1 GiB/s.

Optimizing I/O Strategies

We’ve been experimenting with various I/O approaches to optimize high-speed data stream handling.

Exploring Streaming Semantics

We’re investigating various ways to express and execute stream transformations, to ensure that FasTensor can handle a wide range of streaming computations.

Developing I/O Drivers

We’ve developed two new I/O drivers based on LinuxAIO and MPI IO to ingest incoming data smoothly and maintain stream consistency.

What’s Next?

Putting It All Together

We’re in the final stretch of integrating all these components into a seamless stream processing system.

Rigorous Testing

We’ll push our stream processing to its limits, simulating diverse data flows to ensure rock-solid performance in any scientific setting.

HPC Environment Validation

The ultimate test will be running our new streaming capabilities in real HPC environments, checking how they perform with different I/O setups and computing paradigms.

Wrapping Up

This summer has been a whirlwind of coding, testing, and learning. We’re making significant strides in bringing real-time processing capabilities to FasTensor, which could open up exciting new possibilities in scientific computing and data analysis. Stay tuned for more updates as we finalize this feature. If you’re interested in the nitty-gritty technical details or want to check out the code, feel free to reach out or check our project repository. Happy coding, and may your computations be ever faster!

Enhancing h5bench with HDF5 Compression Capability

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

As part of the h5bench project my Enhencing h5bench with HDF5 Compression Capability under the mentorship of Dr. Jean Luca Bez and Dr. Suren Byna aims to allow users of h5bench to incoporate compression features in their simulations by creating custom benchmarks with common scientific lossless & lossy compression algorithms such as SZ, SZ3, ZFP, and GZIP.

The problem I am trying to solve is to implement multiple data compression algorithms in h5bench core access patterns through HDF5 filters. This capability should grant users the flexibility to configure the parameters and methods of compression applied to their datasets according to their specific needs and preferences. My solution primarily involves using a user-defined HDF5 filter mechanism to implement lossless and lossy compression algorithms, such as ZFP, SZ, and cuSZ. Throughout the process, I will deliver one C source code implementing compression configuration settings, one C source code implementing lossless and lossy algorithms, a set of performance reports before and after data compression in CSV and standard output files, and a technical documentation on h5bench user manual website.

Midterm Blog

This summer, after completing my junior year, I was honored to have the opportunity working with Dr. Jean Luca Bez and Dr. Suren Byna on the h5bench, an open-source benchmarking project designed to simulate runnning sync/async HDF5 I/O on HPC machines. This post will cover mostly what I have learned, produced, planned, and thoughts over the first six weeks.

First of all, let’s define some of the terms here. HDF5 stands for Hierarchical Data Format 5. Unlike other data storage formats (JSON, CSV, XML…), HDF5 is not only a container that manages data similar to a file system, but also a powerful library that gives you the ability to perform I/O (Inputs/Outputs) operations between memory and file. One of the reasons this tool is commonly used by HPC applications is that it also supports MPI I/O, which is a protocol for parallel computing (you can think of it as the parallel version of POSIX). With exabytes of data and high frequencies of usage for analysis in scientific studies, HDF5 is perfect for the job. Essentially, h5bench is a software that tests the hardware’s performance through HDF5 (it also provides other benchmark kernels such as AMReX, E3SM-IO, MACSio, and openPMD-api, but my job focuses on using vanilla HDF5 I/O).

So, what I have done so far? Frist, my job is to allow users to tune input parameters regarding data compression, and make sure h5bench prints accurate benchmark results with the intended compression algorithm applied to their datasets. h5bench’s frondend is written in Python, which takes an input of a JSON file from user and parses it into a CFG configuration file that can be read by the backend later, which is written in C. I created a new enum struct and made user able to specify one from a range of compression algorithms (SZ3, ZFP, LZ4, GZIP, and other pre-defined algorithms). I also made it possible to apply these algorithms to the datasets, so the .h5 (an HDF5 file) would contain chunks of compressed data after multiple H5Dwrite calls.

Next, the challenges and gains. Throughout the first six weeks, 30% of the time was spent on understanding the newest version of h5bench and HDF5 by reading through C source codes and documentations, and asking many dumb questions to my mentors (thanks to their patience and great answers :D). Writing code is fairly easy after I really understood what the program is doing. By that I mean you have to understand every line in almost all functions and how each and every variables change. 40% of the time was used on debugging and testing the compression algorithm, mainly SZ3. To make code behaves correctly is another level of difficulty. Most of the issues resulted from failing to configure the application and dependent libraries correctly. Without necessary macros enabled during the build process, features like compression filter plugin will not run. As I was also new to CMake and HPC environment, I learned that new envrionment variables will be reset for every new session, even if you requested a compute node resource. Besides getting used to the standard build sequence: “cmake ..”, “make”, “make install”, I also learned to use “ccmake ..” to examine the flags of the compiled program. The rest of time I learned more about parallel computing, HDF5, compression algorithms, by reading some papers and documentations. A lot of notes were taken (I must say a good note taking system is the game changer). Last but not the least, I also spent times synchronizing online and offline with my mentors to discuess problems. Without their help, I can never make this far.

My next phase will tackle these problems, here I am just offering a list:

Test applying filter with other compression algorithms, and with different dimension layout of the dataset
Add decompression capability
Allow users to tune the auxiliary parameters for controlling the behavior of a certain compression filter H5Pset_filter(COMPRESS_INFO.dcpl_id, H5Z_FILTER_SZ3, H5Z_FLAG_MANDATORY, 0, NULL); cd_nelmts cd_values[]
Print additional benchmark results to indicate what and how the compression filter is applied, and the compression ratio

Unveiling Medicine Patterns: 3D Clustering with Polyphy/Polyglot

Wed, 19 Jun 2024 00:00:00 +0000

Hello! My name is Ayush and this summer I’ll be contributing to Polyphy and Polyglot, a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. under the mentorship of Oskar Elek and Kiran Deol.

For the reference here’s my proposal for this project.

Polyglot offers an immersive 3D visualization experience, enabling users to zoom, rotate, and delve into complex datasets. My project aims to harness these capabilities to unlock hidden connections in the realm of medicine, specifically focusing on the relationships between drugs based on their shared salt compositions, rather than just their active ingredients. This approach promises to reveal intricate patterns and relationships that have the potential to revolutionize drug discovery, pharmacology, and personalized medicine.

In this project, I will create custom embeddings for a vast dataset of over 600,000 medicines, capturing the relationships between their salt compositions. By visualizing these embeddings in Polyglot’s 3D space, researchers can identify previously unknown connections between medicines, leading to new insights and breakthroughs. The dynamic and interactive nature of Polyglot will empower researchers to explore these complex relationships in a very efficient and cool way, potentially accelerating the discovery of new drug interactions and therapeutic applications.

I am really excited to work on this project. Keep following the blogs for further updates!.

Artificial Intelligence Explainability Accountability

Fri, 14 Jun 2024 00:00:00 +0000

Hey! I’m Sarthak Chowdhary(Shaburu), and I am thrilled to share my incredible journey with the Open Source Program Office of UC Santa Cruz! Association as part of Google Summer of Code (GSoC) 2024. This experience marks a pivotal milestone in my career, offering me the chance to delve into an intriguing project while learning from the brightest minds in the open-source community. Allow me to guide you through my adventure thus far, from the nerve-wracking wait for results to the exhilarating commencement of the coding period.

Before we start here’s my Proposal.

Pre-GSoC Application

I had shortlisted 3 Organizations that i was working on

OSPO UC Santa Cruz - Amplifying Research Impact Through Open Source
CVAT.AI - Computer Vision Data Annotation for AI
Emory University - Biomedical Research to Advance Medical Care

On the 1st of May, like many students eagerly anticipating the results of the Google Summer of Code (GSoC) 2024, I found myself glued to my screen, anxiously awaiting the clock to strike 11:30 PM IST. After what felt like an eternity of waiting, I finally received the email that changed everything: I had been selected for GSoC 2024 with the Open Source Program Office of UC Santa Cruz!

The first month of GSoC, known as the community bonding period, is for establishing rapport with the people working on the project. I researched about my mentor Dr. Leilani H. Gilpin and build a good rapport with her, who is an Assistant Professor in Computer Science and Engineering and an affiliate of the Science & Justice Research Center at UC Santa Cruz. She is also a part of the AI group @ UCSC and leads the AI Explainability and Accountability (AIEA) Lab. Her research focuses on the design and analysis of methods for autonomous systems to explain themselves. Her work has applications to robust decision-making, system debugging, and accountability. Her current work examines how generative models can be used in iterative XAIstress testing. She guided me through the necessary documentation and explained the Project demands and requirements in detail, which was invaluable for my project.

Project

The project aims to build a system that is capable of taking some input which will be the student’s code and explaining them their mistakes from low level syntax errors, compilation errors to high level issues such as overloaded variables.

My Proposal aims to create custom novel basic questions and take it up a notch by creating custom drivers for each problem, common drivers to detect low level errors and give baseline explanations for various error cases, combining these drivers to make a robust system and use third-party open source software (like monaco code editor - the editor of the web) where necessary. Write uniform and consistent feedback/explanations for Each coding problem while covering all the possible edge cases and a pipeline which will iterate the test cases and feedbacks. This benchmark suite will be used for testing the system.

Additionally I plan on building an interface that has a roadmap from basics such as arrays, hashmaps to advanced topics such as trees, heap, backtracking along with progress bars and throws confetti on successful unit tests (important). These will be using the same benchmark suite that will be built under the hood. I will be utilizing Judge0 (open-source online code execution system) for the code execution and Monaco(open-source The Editor of the Web) as the code editor for this.

Project goals:

Project Objective: By the end of summer the software should be a novel and robust tool for helping the community of beginner and advanced programmers alike in learning programming by hyper-focusing on the mistakes they make and using AI to explain to them the how, what and why of their code. Provide clear and concise explanations accompanied by actionable suggestions for debugging and improvement.
Expected deliverables: A Robust eXplainable AI benchmark suite which will be used extensively for the undergraduate AI courses and possibly the Graduate courses as well. Along with anyone interested in learning programming with the help of personalized AI.
Future work based on project: A beautiful Gamified interface that gets people excited to learn programming which utilizes the above benchmark suite would be awesome to build!

When I Started my programming journey (before ChatGPT😨) I personally encountered problems that were way above my skill set and I had no way of knowing so, which used to result in spending countless hours without proper feedback as to where I was going wrong. This project has a real impact on people in an innovative way which I wish I had access to at the start of my Programming journey, so working on it comes from a place of passion. Also this specific project will test my own understanding of programming and spending the summer solidifying it, that too under the guidance of Leilani H. Gilpin is a dream come true for me.

Developing Trustworthy Large Language Models

Fri, 14 Jun 2024 00:00:00 +0000

Hi! Thanks for stopping by.

In this first blog post of a series of three, I’d like to introduce myself, my mentor, and my project.

My name is Nikhil. I am an ML researcher who works at the intersection of NLP, ML, and HCI. I previously worked as a Machine Learning Engineer II at VMware and spent some wonderful summers interning with ML teams at NVIDIA and IIT Bombay. I also recently graduated from the University of Southern California (USC) with honors in Computer Science and a master’s thesis.

This year at Google Summer of Code (GSoC 24), I will be working on developing trustworthy large language models. I’m very grateful to be mentored by Leilani H. Gilpin at the AIEA lab, UC Santa Cruz. I truly admire the flexibility and ownership she allows me in pursuing my ideas independently within this project. Please feel free to peruse my accepted GSoC proposal here.

Project: My project has a tangible outcome: An open-source, end-to-end, full-stack web app with a hybrid trustworthy LLM in the backend.

This open-source web app will be a lightweight tool that not only has the ability to take diverse textual prompts and connect with several LLMs and a database but also the capability to gather qualitative and quantitative user feedback. Users will be able to see how this feedback affects the LLMs’ responses and impacts its reasoning and explanations (xAI). The tool will be thoroughly tested to ensure that the unit tests are passing and there is complete code coverage.

At the moment, we are investigating LLMs and making them more trustworthy in constraint satisfaction tasks like logical reasoning and misinformation detection tasks. However, our work has applicability in other areas of Responsible AI, such as Social Norms (toxicity detection and cultural insensitivity), Reliability (misinformation, hallucination, and inconsistency), Explainability & Reasoning (lack of interpretability, limited logical, and causal reasoning), Safety (privacy violation and violence), and Robustness (prompt attacks and distribution shifts).

Impact:

Responsible AI research teams across industry and academia can use this as a boilerplate for their user study projects.
Diverse PhD students and academic researchers looking to study LLM and user interaction research will find this useful.
LLM alignment researchers and practitioners can find this resourceful as user feedback affects the inherent rewards model of the internal LLMs.
Explainable AI (xAI) researchers can find value in the explanations that this tool generates, which reveal interpretable insights into how modern LLMs think and use their memory. These are just a few use cases; however, there are several others that we look forward to describing in the upcoming posts.

This was my first blog in the series of three for the UC OSPO. Stay tuned for the upcoming blogs, which will detail my progress at the halfway mark and the final one concluding my work.

If you find this work interesting and would love to share your thoughts, I am happy to chat! :) Feel free to connect on LinkedIn and mention that you are reaching out from this blog post.

It is great to meet the UC OSPO community, and thanks for reading. Bye for now.

Enhancing Usability and Expandability of the Open Sensing Platform project

Fri, 14 Jun 2024 00:00:00 +0000

Greetings everyone,

I am Ahmed Falah and I am delighted to be part of the 2024 Google Summer of Code program, where I am contributing to the Open Sensing Platform project.

My proposal was accepted, and I am fortunate to have Colleen Josephson and John Madden as my mentors. The objective of my project is to enhance usability and expandability of the Open Sensing Platform, a hardware solution for deploying sensor networks in outdoor environments. This platform utilizes low-power, long-range communication to transmit data from various sensors to a visualization dashboard. While the platform effectively collects data, its configuration process requires modifying source code to make it more user-friendly. My first steps to enhance usability of the project:

Improve User Interface (UI): Develop a user-friendly interface to interact with the platform, enabling researchers to configure the device without modifying code.
Conversion of user configuration: convert user configuration data to the Protobuf format for efficient storage and transmission.

Additionally, I will explore updating the NVRAM functions to interact with Protobuf messages instead of directly writing/reading raw data to NVRAM. I will also implement functions to serialize user configuration data into a Protobuf message and deserialize the message back into a data structure for use within the firmware.

I will be posting regular updates and informative blogs throughout the summer, so stay tuned!

Stream Processing support for FasTensor

Thu, 13 Jun 2024 00:00:00 +0000

Hi, I’m Aditya Narayan,👋

I’m a frequent visitor to the town square of theoretical CS, operations (Ops), and robust high-performance systems. Sometimes I indulge myself with insights on Computing and Biology, and other times I enjoy the accounts of minefield experiences in the systems world. Luckily, this summer, OSRE offered an opportunity that happened to be at the perfect intersection of my interests.

This summer, I will be working on a scientific computing library called FasTensor that offers a parallel computing structure called Stencil, widely popular in the scientific computing world to solve PDEs for Physical Simulations and Convolutions on Signals, among its many uses. I am excited to introduce my mentors, Dr. Bin Dong and Dr. John Wu of the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL). They bring invaluable expertise to the project.

They recognized the need for a tensor processing library that provided dedicated support for big datasets with inherent structural locality, often found in the scientific computing world, which was lacking in popular open-source MapReduce or Key-Value based frameworks.

More often than not, the operations performed on these datasets are composed of computations involving neighboring elements. This motivated the development of the FasTensor library.

I will be working on providing a Stream Processing interface that enables online data processing of large-scale datasets as they arrive from Data Producers. The project focuses on offering rich interfaces for managing and composing streams, supporting common scientific data formats like HDF5, and integrating fault tolerance and reliability mechanisms.

I am thrilled to work on the FasTensor project because I believe it has the potential to make a significant impact by enabling researchers to implement a rich set of computations on their big datasets in an easy and intuitive manner.

After all, FasTensor has just one simple paradigm: A -> Transform(F(x), B),

and it handles all the behind-the-scenes grunt work of handling big datasets so you can focus on your research.

Stay tuned for updates and feel free to collaborate!

Drishti

Thu, 06 Jun 2024 00:00:00 +0000

Namaste everyone! 🙏🏻

I’m Joel Tony, a third-year Computer Science undergraduate at BITS Pilani, Goa, India. I’m truly honored to be part of this year’s Google Summer of Code program, working with the UC OSPO organization on a project that genuinely excites me. I’m particularly grateful to be working under the mentorship of Dr. Jean Luca Bez, a Research Scientist at Lawrence Berkeley National Laboratory, and Dr. Suren Byna, a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.

My project, “Drishti: Visualization and Analysis of AI-based Applications”, aims to extend the Drishti framework to better support AI/ML workloads, focusing specifically on optimizing their Input/Output (I/O) performance. I/O refers to the data transfer between a computer’s memory and external storage devices like hard drives (HDDs) or solid-state drives (SSDs). As AI models and datasets continue to grow exponentially in size, efficient I/O management has become a critical bottleneck that can significantly impact the overall performance of these data-intensive workloads.

Drishti is an innovative, interactive web-based framework that helps users understand the I/O behavior of scientific applications by visualizing I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualizations, making performance issues more apparent. Now, I’m working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.

Through my studies in high-performance computing and working with tools like BeeGFS and Darshan, I’ve gained insights into the intricacies of I/O performance. However, adapting Drishti for AI/ML workloads presents new challenges. In traditional HPC, computing often dominates, but in the realm of AI, the tables have turned. As models grow by billions of parameters and datasets expand to petabytes, I/O has become the critical path. Training larger models or using richer datasets doesn’t just mean more computation; it means handling vastly more data. This shift makes I/O optimisation not just a performance tweak but a fundamental enabler of AI progress. By fine-tuning Drishti for AI/ML workloads, we aim to pinpoint I/O bottlenecks precisely, helping researchers streamline their data pipelines and unlock the full potential of their hardware.

As outlined in my proposal, my tasks are threefold:

Modularize Drishti’s codebase: Currently, it’s a single 1700-line file that handles multiple functionalities. I’ll be refactoring it into focused, maintainable modules, improving readability and facilitating future enhancements.
Enable multi-trace handling: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I’ll build a layer to aggregate these, providing a comprehensive view of the application’s I/O behavior.
Craft AI/ML-specific recommendations: Current suggestions often involve MPI-IO or HDF5, which aren’t typical in ML frameworks like PyTorch or TensorFlow. I’ll create targeted recommendations that align with these frameworks’ data pipelines.

This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimize it for the unique I/O challenges that AI/ML applications face. Whether it’s dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.

From classroom theories to hands-on projects, from understanding file systems to optimizing AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.

In today’s AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimized I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behavior in AI/ML applications more interpretable through Drishti, we’re not just tweaking code. We’re providing developers with the insights they need to optimize their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.

I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimizing I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I’m deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavor and to the UC OSPO and GSoC communities for this incredible opportunity.

Enhancing h5bench with HDF5 Compression Capability

Mon, 27 May 2024 00:00:00 +0000