<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Jean Luca Bez | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/index.xml" rel="self" type="application/rss+xml"/><description>Jean Luca Bez</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/avatar_hu7f5524663ca74c94691994ffb50a02af_53180_270x270_fill_q75_lanczos_center.jpg</url><title>Jean Luca Bez</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/</link></image><item><title>AI Data Readiness Inspector (AIDRIN)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/lbl/aidrin/</link><pubDate>Fri, 30 Jan 2026 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/lbl/aidrin/</guid><description>&lt;p>Garbage In, Garbage Out (GIGO) is a widely accepted quote in computer science across various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.&lt;/p>
&lt;p>&lt;a href="https://arxiv.org/pdf/2406.19256" target="_blank" rel="noopener">AIDRIN&lt;/a> (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of data readiness for AI processes, covering a broad range of dimensions from the literature. AIDRIN uses metrics from traditional data quality assessment, such as completeness, outliers, and duplicates, to evaluate data. Furthermore, AIDRIN uses metrics specific to assessing AI data, such as feature importance, feature correlations, class imbalance, fairness, privacy, and compliance with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles. AIDRIN provides visualizations and reports to assist data scientists in further investigating data readiness.&lt;/p>
&lt;h3 id="aidrin-multiple-file-formats">AIDRIN Multiple File Formats&lt;/h3>
&lt;p>The proposed work will include improvements in the AIDRIN framework to (1) add support for new file formats such as Zarr, ROOT, and HDF5; and (2) to allow providing custom data ingestion mechanisms.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>data readiness&lt;/code>, &lt;code>AI&lt;/code>, &lt;code>data analysis&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, C/C++, data analysis, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/suren-byna/">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Drishti</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/lbl/drishti/</link><pubDate>Fri, 30 Jan 2026 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/lbl/drishti/</guid><description>&lt;p>&lt;a href="https://github.com/hpc-io/drishti" target="_blank" rel="noopener">Drishti&lt;/a> is a novel interactive web-based analysis framework to visualize I/O traces, highlight bottlenecks, and help understand the I/O behavior of scientific applications. Drishti aims to fill the gap between the trace collection, analysis, and tuning phases. The framework contains an interactive I/O trace analysis component for end-users to visually inspect their applications&amp;rsquo; I/O behavior, focusing on areas of interest and getting a clear picture of common root causes of I/O performance bottlenecks. Based on the automatic detection of I/O performance bottlenecks, our framework maps numerous common and well-known bottlenecks and their solution recommendations that can be implemented by users.&lt;/p>
&lt;h3 id="drishti-comparisons-and-heatmaps">Drishti Comparisons and Heatmaps&lt;/h3>
&lt;p>The proposed work will include investigating and building a solution to allow comparing and finding differences between two I/O trace files (similar to a &lt;code>diff&lt;/code>), covering the analysis and visualization components. It will also explore additional metrics and counters such as Darshan heatmaps in the analysis and visualization components of the framework.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code>, &lt;code>HPC&lt;/code>, &lt;code>data analysis&lt;/code>, &lt;code>visualization&lt;/code>, &lt;code>profiling&lt;/code>, &lt;code>tracing&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, data analysis, performance profiling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>AIDRIN Privacy-Centric Enhancements: Backend &amp; UX Upgrades</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/aidrin/20250725-harish_balaji/</link><pubDate>Fri, 25 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/aidrin/20250725-harish_balaji/</guid><description>&lt;p>⏱️ Reading time: 5–6 minutes&lt;/p>
&lt;p>Hey everyone,&lt;/p>
&lt;p>If you’ve ever wondered what it takes to make AI data pipelines not just smarter, but safer and more transparent, you’re in the right place. The last few weeks working on AIDRIN for GSoC have been a deep dive into the engine room of privacy and backend systems that power the AIDRIN project. My focus has been on building out the core privacy infrastructure and backend features that power AIDRIN’s ability to give users real, actionable insights about their data. It’s been challenging, sometimes messy, but incredibly rewarding to see these changes make a tangible difference.&lt;/p>
&lt;p>Having Dr. Jean Luca Bez and Prof. Suren Byna as mentors, along with the support of the entire team, has truly made all the difference. Their guidance, encouragement, and collaborative spirit have been a huge part of this journey, whether I’m brainstorming new ideas or just trying to untangle a tricky bug.&lt;/p>
&lt;h2 id="privacy-metrics-making-data-safer">Privacy Metrics: Making Data Safer&lt;/h2>
&lt;p>A major part of my work has been putting data privacy at the front and center in AIDRIN. I focused on integrating essential privacy metrics like k-anonymity, l-diversity, t-closeness, and more, making sure they’re not just theoretical checkboxes, but real tools that users can interact with and understand. Now, these metrics are fully wired up in the backend and visualized in AIDRIN, so privacy risks are no longer just a vague concern. They are something AI data preparers can actually see and act on. Getting these metrics to work seamlessly with different datasets and ensuring their accuracy took some serious backend engineering, but the payoff has been worth it.&lt;/p>
&lt;h2 id="speeding-things-up-so-you-dont-have-to-wait-around">Speeding Things Up (So You Don’t Have To Wait Around)&lt;/h2>
&lt;p>As AIDRIN started handling bigger datasets, some of the calculations can be time-consuming because data has to be accessed every time a metric is computed. To address this, I added caching for previously computed metrics, like class imbalance and privacy checks, and set up asynchronous execution with Celery and Redis. This should make the app super responsive. Rather than waiting for heavy computations to finish, one can start taking notes about other metrics or explore different parts of the app while their results are loading in the background. It’s a small change, but it helps keep the workflow moving smoothly.&lt;/p>
&lt;h2 id="small-touch-ups-that-hopefully-make-a-big-difference">Small Touch Ups That (Hopefully) Make a Big Difference&lt;/h2>
&lt;p>I also spent time on the details that make the app easier to use. Tooltips now explain what the privacy metrics actually mean, error messages are clearer, and there’s a new cache info page where you can see and clear your cached data. The sensitive attribute dropdown is less confusing now, especially if you’re working with quasi-identifiers. These tweaks might seem minor, but they add up and make the app friendlier for everyone.&lt;/p>
&lt;h2 id="docs-docs-docs">Docs, Docs, Docs&lt;/h2>
&lt;p>I’m a big believer that good documentation is just as important as good code. I updated the docs to cover all the new features, added citations for the privacy metrics, and made the install process a bit more straightforward. Hopefully, this means new users and contributors can get up to speed without too much hassle.&lt;/p>
&lt;h2 id="huge-thanks-to-my-mentors-and-the-team">Huge Thanks to My Mentors and the Team&lt;/h2>
&lt;p>I really want to shine a light on Dr. Bez, Prof. Byna, and the entire AIDRIN team here. Their encouragement, practical advice, and collaborative spirit have been a huge part of my progress. Whether I’m stuck on a bug, brainstorming a new feature, or just need a second opinion, there’s always someone ready to help me think things through. Their experience and support have shaped not just the technical side of my work, but also how I approach problem-solving and teamwork.&lt;/p>
&lt;h2 id="whats-next">What’s Next?&lt;/h2>
&lt;p>Looking ahead, I’m planning to expand AIDRIN’s support for multimodal datasets and keep refining the privacy and fairness modules. There’s always something new to learn or improve, and I’m excited to keep building. If you’re interested in data quality, privacy, or open-source AI tools, I’d love to connect and swap ideas.&lt;/p>
&lt;p>Thanks for reading and for following along with my GSoC journey. I’ll be back soon with more updates!&lt;/p>
&lt;p>&lt;em>This is the second post in my 3-part GSoC series with AIDRIN. Stay tuned for the final update.&lt;/em>&lt;/p></description></item><item><title>Improving AI Data Pipelines in AIDRIN: A Privacy-Centric and Multimodal Expansion</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/aidrin/20250612-harish_balaji/</link><pubDate>Thu, 12 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/aidrin/20250612-harish_balaji/</guid><description>&lt;p>⏱️ Reading time: 4–5 minutes&lt;/p>
&lt;p>Hi 👋&lt;/p>
&lt;p>I’m Harish Balaji, a Master’s student at NYU with a focus on Artificial Intelligence, Machine Learning, and Cybersecurity. I’m especially interested in building scalable systems that reflect responsible AI principles. For me, data quality isn’t just a technical detail. It’s a foundational aspect of building models that are reliable, fair, and reproducible in the real world.&lt;/p>
&lt;p>This summer, I’m contributing to AIDRIN (AI Data Readiness Inspector) as part of Google Summer of Code 2025. I’m grateful to be working under the mentorship of Dr. Jean Luca Bez and Prof. Suren Byna from the &lt;a href="https://crd.lbl.gov/divisions/scidata/sdm/" target="_blank" rel="noopener">Scientific Data Management Group&lt;/a> at Lawrence Berkeley National Laboratory (LBNL).&lt;/p>
&lt;p>AIDRIN is an open-source framework that helps researchers and practitioners evaluate whether a dataset is truly ready to be used in production-level AI workflows. From fairness to privacy, it provides a structured lens through which we can understand the strengths and gaps in our data.&lt;/p>
&lt;h2 id="why-this-work-matters">Why this work matters&lt;/h2>
&lt;p>In machine learning, one principle always holds true:&lt;/p>
&lt;blockquote>
&lt;p>&amp;ldquo;Garbage in, garbage out.&amp;rdquo;&lt;/p>
&lt;/blockquote>
&lt;p>Even the most advanced models can underperform or amplify harmful biases if trained on incomplete, imbalanced, or poorly understood data. This is where AIDRIN steps in. It provides practical tools to assess datasets across key dimensions like privacy, fairness, class balance, interpretability, and support for multiple modalities.&lt;/p>
&lt;p>By making these characteristics measurable and transparent, AIDRIN empowers teams to make informed decisions early in the pipeline. It helps ensure that datasets are not only large or complex, but also trustworthy, representative, and purpose-fit.&lt;/p>
&lt;h2 id="my-focus-this-summer">My focus this summer&lt;/h2>
&lt;p>As part of my GSoC 2025 project, I’ll be focusing on extending AIDRIN’s evaluation capabilities. A big part of this involves strengthening its support for privacy metrics and designing tools that can handle non-tabular datasets, such as image-based data.&lt;/p>
&lt;p>The goal is to expand AIDRIN’s reach without compromising on interpretability or ease of use. More technical insights and updates will follow in the next posts as the summer progresses.&lt;/p>
&lt;h2 id="what-comes-next">What comes next&lt;/h2>
&lt;p>As the AI community continues to evolve, there’s a growing shift toward data-centric practices. I believe frameworks like AIDRIN are essential for helping us move beyond the question of &lt;em>&amp;ldquo;Does the model work?&amp;rdquo;&lt;/em> toward a deeper and more meaningful one: &lt;em>&amp;ldquo;Was the data ready in the first place?&amp;rdquo;&lt;/em>&lt;/p>
&lt;p>Over the next few weeks, I’ll be working on development, testing, and integration. I’m excited to contribute to a tool that emphasizes transparency and reproducibility across the AI lifecycle, and to share lessons and ideas with others who care about responsible AI.&lt;/p>
&lt;p>If you’re exploring similar challenges or working in the space of dataset evaluation and readiness, I’d love to connect and exchange thoughts. You can also read my full GSoC 2025 proposal below for more context around the project scope and vision:&lt;/p>
&lt;p>👉 &lt;a href="https://drive.google.com/file/d/1RUyU2fHkc8GZ9vTj5SUr6jj84ZaRUvNt/view" target="_blank" rel="noopener">Read my GSoC 2025 proposal here&lt;/a>&lt;/p>
&lt;p>&lt;em>This is the first in a 3-part blog series documenting my GSoC journey with AIDRIN. Stay tuned for technical updates and behind-the-scenes insights as the summer unfolds!&lt;/em>&lt;/p></description></item><item><title>AI Data Readiness Inspector (AIDRIN)</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/lbl/aidrin/</link><pubDate>Tue, 11 Feb 2025 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/lbl/aidrin/</guid><description>&lt;p>Garbage In Garbage Out (GIGO) is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.&lt;/p>
&lt;p>&lt;a href="https://arxiv.org/pdf/2406.19256" target="_blank" rel="noopener">AIDRIN&lt;/a> (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of the readiness of data for AI processes, covering a broad range of readiness dimensions available in the literature. AIDRIN uses metrics in traditional data quality assessment, such as completeness, outliers, and duplicates, for data evaluation. Furthermore, AIDRIN uses metrics specific to assess data for AI, such as feature importance, feature correlations, class imbalance, fairness, privacy, and FAIR (Findability, Accessibility, Interoperability, and Reusability) principle compliance. AIDRIN provides visualizations and reports to assist data scientists in further investigating the readiness of data.&lt;/p>
&lt;h3 id="aidrin-visualizations-and-science-gateway">AIDRIN Visualizations and Science Gateway&lt;/h3>
&lt;p>The proposed work will include improvements in the AIDRIN framework to (1) enhance, extend, and optimize the visualizations of metrics related to all six pillars of AI data readiness and (2) set up a science gateway on NERSC or AWS cloud service.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>data readiness&lt;/code> &lt;code>AI&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, C/C++, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/suren-byna/">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>h5bench with AI workloads</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/lbl/h5bench-ai/</link><pubDate>Tue, 11 Feb 2025 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/lbl/h5bench-ai/</guid><description>&lt;p>&lt;a href="https://github.com/hpc-io/h5bench" target="_blank" rel="noopener">h5bench&lt;/a> is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.&lt;/p>
&lt;p>Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library&amp;rsquo;s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.&lt;/p>
&lt;h3 id="h5bench-with-ai-workloads">h5bench with AI workloads&lt;/h3>
&lt;p>The proposed work will include (1) analyzing and characterizing AI workloads that rely on HDF5 datasets, (2) extracting a kernel of their I/O operations, and (3) implementing and validating the kernel in h5bench.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>benchmarking&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, C/C++, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/suren-byna/">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Midway Through GSoC</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/drishti/20240714-jaytau/</link><pubDate>Wed, 31 Jul 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/drishti/20240714-jaytau/</guid><description>&lt;p>Hello everyone! I&amp;rsquo;m &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joel-tony/">Joel Tony&lt;/a>, and I&amp;rsquo;m excited to share my progress update on the &lt;a href="https://github.com/hpc-io/drishti" target="_blank" rel="noopener">Drishti&lt;/a> project as part of my Google Summer of Code (GSoC) experience. Over the past few weeks, I&amp;rsquo;ve been diving deep into the world of I/O visualization for scientific applications, and I&amp;rsquo;m thrilled to tell you about the strides we&amp;rsquo;ve made.&lt;/p>
&lt;h2 id="what-is-drishti">What is Drishti?&lt;/h2>
&lt;p>For those unfamiliar with Drishti, it&amp;rsquo;s an application used to visualize I/O traces of scientific applications. When running complex scientific applications, understanding their I/O behavior can be challenging. Drishti steps in to parse logs from various sources, with a primary focus on those collected using &lt;a href="https://wordpress.cels.anl.gov/darshan/" target="_blank" rel="noopener">Darshan&lt;/a>, a lightweight I/O characterization tool for HPC applications. Drishti provides human-interpretable insights on how to improve I/O performance based on these logs. While Drishti supports multiple log sources, our current work emphasizes Darshan logs due to their comprehensive I/O information. Additionally, Drishti offers visually appealing and easy-to-understand graphs to help users better grasp their application&amp;rsquo;s I/O patterns, making it easier to identify bottlenecks and optimize performance.&lt;/p>
&lt;h2 id="progress-and-challenges">Progress and Challenges&lt;/h2>
&lt;h3 id="export-directory-feature">Export Directory Feature&lt;/h3>
&lt;p>One of the first features I implemented was the export directory functionality. In earlier versions of Drishti, users couldn&amp;rsquo;t select where they wanted their output files to be saved. This became problematic when working with read-only log locations. I familiarized myself with the codebase, created a pull request, and successfully added this feature, allowing users to choose their preferred output location.&lt;/p>
&lt;h3 id="ci-improvements-and-cross-project-dependencies">CI Improvements and Cross-Project Dependencies&lt;/h3>
&lt;p>While working on Drishti, I discovered the tight coupling between various tools in the HPC I/O organization, such as Drishti and DXT Explorer. This highlighted the need for improved Continuous Integration (CI) practices. We currently run about eight GitHub Actions for each pull request, but they don&amp;rsquo;t adequately test the interactions between different branches of these interconnected tools. This is an area we&amp;rsquo;ve identified for future improvement to ensure smoother integration and fewer conflicts between projects.&lt;/p>
&lt;h3 id="refactoring-for-multi-file-support">Refactoring for Multi-File Support&lt;/h3>
&lt;p>The bulk of my time was spent refactoring Drishti to extend its framework from parsing single Darshan files to handling multiple files. This task was more complex than it initially appeared, as Drishti&amp;rsquo;s insights are based on the contents of each Darshan file. When dealing with multiple files, we needed to find a way to aggregate the data meaningfully without sacrificing on performance.&lt;/p>
&lt;p>The original codebase had a single, thousand-line function for parsing Darshan files. To improve this, I implemented a data class structure in Python. This refactoring allows for:&lt;/p>
&lt;ol>
&lt;li>Better separation of computation and condition checking&lt;/li>
&lt;li>Easier parallelization of processing multiple traces&lt;/li>
&lt;li>Finer-grained profiling of performance bottlenecks&lt;/li>
&lt;li>More flexibility in data manipulation and memory management&lt;/li>
&lt;/ol>
&lt;h2 id="learnings-and-skills-gained">Learnings and Skills Gained&lt;/h2>
&lt;p>Through this process, I&amp;rsquo;ve gained valuable insights into:&lt;/p>
&lt;ol>
&lt;li>Refactoring large codebases&lt;/li>
&lt;li>Understanding and improving cross-project dependencies&lt;/li>
&lt;li>Implementing data classes in Python for better code organization&lt;/li>
&lt;li>Balancing performance with code readability and maintainability&lt;/li>
&lt;/ol>
&lt;h2 id="next-steps">Next Steps&lt;/h2>
&lt;p>As I move forward with the project, my focus will be on:&lt;/p>
&lt;ol>
&lt;li>Adding unit tests for individual methods to ensure functionality&lt;/li>
&lt;li>Exploring alternative data frame implementations like Polars for better performance&lt;/li>
&lt;li>Developing aggregation methods for different types of data across multiple Darshan files&lt;/li>
&lt;li>Optimizing memory usage and computational efficiency for large datasets&lt;/li>
&lt;/ol>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Working on Drishti has been an incredible learning experience. I&amp;rsquo;ve had the opportunity to tackle real-world challenges in scientific computing and I/O visualization. As we progress, I&amp;rsquo;m excited about the potential impact of these improvements on the scientific community&amp;rsquo;s ability to optimize their applications&amp;rsquo; I/O performance.&lt;/p>
&lt;p>I&amp;rsquo;m grateful for this opportunity and looking forward to the challenges and discoveries that lie ahead in the second half of my GSoC journey. Stay tuned for more updates as we continue to enhance Drishti!&lt;/p>
&lt;p>If you have any questions or would like to learn more about the project, feel free to &lt;a href="https://www.jaytau.com/#contact?ref=uc-ospo" target="_blank" rel="noopener">reach out to me&lt;/a>. Let&amp;rsquo;s keep pushing the boundaries of scientific computing together!&lt;/p></description></item><item><title>Enhancing h5bench with HDF5 Compression Capability</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/h5bench/20240731-henryz/</link><pubDate>Sat, 27 Jul 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/h5bench/20240731-henryz/</guid><description>&lt;h1 id="introduction">Introduction&lt;/h1>
&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/h5bench">h5bench&lt;/a> project my &lt;a href="https://summerofcode.withgoogle.com/myprojects/details/n0H28Z40" target="_blank" rel="noopener">Enhencing h5bench with HDF5 Compression Capability&lt;/a> under the mentorship of Dr. &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and Dr. Suren Byna aims to allow users of h5bench to incoporate compression features in their simulations by creating custom benchmarks with common scientific lossless &amp;amp; lossy compression algorithms such as SZ, SZ3, ZFP, and GZIP.&lt;/p>
&lt;p>The problem I am trying to solve is to implement multiple data compression algorithms in h5bench core access patterns through HDF5 filters. This capability should grant users the flexibility to configure the parameters and methods of compression applied to their datasets according to their specific needs and preferences. My solution primarily involves using a user-defined HDF5 filter mechanism to implement lossless and lossy compression algorithms, such as ZFP, SZ, and cuSZ. Throughout the process, I will deliver one C source code implementing compression configuration settings, one C source code implementing lossless and lossy algorithms, a set of performance reports before and after data compression in CSV and standard output files, and a technical documentation on h5bench user manual website.&lt;/p>
&lt;h1 id="midterm-blog">Midterm Blog&lt;/h1>
&lt;p>This summer, after completing my junior year, I was honored to have the opportunity working with Dr. Jean Luca Bez and Dr. Suren Byna on the h5bench, an open-source benchmarking project designed to simulate runnning sync/async HDF5 I/O on HPC machines. This post will cover mostly what I have learned, produced, planned, and thoughts over the first six weeks.&lt;/p>
&lt;p>First of all, let&amp;rsquo;s define some of the terms here. HDF5 stands for Hierarchical Data Format 5. Unlike other data storage formats (JSON, CSV, XML&amp;hellip;), HDF5 is not only a container that manages data similar to a file system, but also a powerful library that gives you the ability to perform I/O (Inputs/Outputs) operations between memory and file. One of the reasons this tool is commonly used by HPC applications is that it also supports MPI I/O, which is a protocol for parallel computing (you can think of it as the parallel version of POSIX). With exabytes of data and high frequencies of usage for analysis in scientific studies, HDF5 is perfect for the job. Essentially, h5bench is a software that tests the hardware&amp;rsquo;s performance through HDF5 (it also provides other benchmark kernels such as AMReX, E3SM-IO, MACSio, and openPMD-api, but my job focuses on using vanilla HDF5 I/O).&lt;/p>
&lt;p>So, what I have done so far? Frist, my job is to allow users to tune input parameters regarding data compression, and make sure h5bench prints accurate benchmark results with the intended compression algorithm applied to their datasets. h5bench&amp;rsquo;s frondend is written in Python, which takes an input of a JSON file from user and parses it into a CFG configuration file that can be read by the backend later, which is written in C. I created a new enum struct and made user able to specify one from a range of compression algorithms (SZ3, ZFP, LZ4, GZIP, and other pre-defined algorithms). I also made it possible to apply these algorithms to the datasets, so the .h5 (an HDF5 file) would contain chunks of compressed data after multiple H5Dwrite calls.&lt;/p>
&lt;p>Next, the challenges and gains. Throughout the first six weeks, 30% of the time was spent on understanding the newest version of h5bench and HDF5 by reading through C source codes and documentations, and asking many dumb questions to my mentors (thanks to their patience and great answers :D). Writing code is fairly easy after I really understood what the program is doing. By that I mean you have to understand every line in almost all functions and how each and every variables change. 40% of the time was used on debugging and testing the compression algorithm, mainly SZ3. To make code behaves correctly is another level of difficulty. Most of the issues resulted from failing to configure the application and dependent libraries correctly. Without necessary macros enabled during the build process, features like compression filter plugin will not run. As I was also new to CMake and HPC environment, I learned that new envrionment variables will be reset for every new session, even if you requested a compute node resource. Besides getting used to the standard build sequence: &amp;ldquo;cmake ..&amp;rdquo;, &amp;ldquo;make&amp;rdquo;, &amp;ldquo;make install&amp;rdquo;, I also learned to use &amp;ldquo;ccmake ..&amp;rdquo; to examine the flags of the compiled program. The rest of time I learned more about parallel computing, HDF5, compression algorithms, by reading some papers and documentations. A lot of notes were taken (I must say a good note taking system is the game changer). Last but not the least, I also spent times synchronizing online and offline with my mentors to discuess problems. Without their help, I can never make this far.&lt;/p>
&lt;p>My next phase will tackle these problems, here I am just offering a list:&lt;/p>
&lt;ul>
&lt;li>Test applying filter with other compression algorithms, and with different dimension layout of the dataset&lt;/li>
&lt;li>Add decompression capability&lt;/li>
&lt;li>Allow users to tune the auxiliary parameters for controlling the behavior of a certain compression filter H5Pset_filter(COMPRESS_INFO.dcpl_id, H5Z_FILTER_SZ3, H5Z_FLAG_MANDATORY, 0, NULL); cd_nelmts cd_values[]&lt;/li>
&lt;li>Print additional benchmark results to indicate what and how the compression filter is applied, and the compression ratio&lt;/li>
&lt;/ul></description></item><item><title>Drishti</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/drishti/20240614-jaytau/</link><pubDate>Thu, 06 Jun 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/drishti/20240614-jaytau/</guid><description>&lt;p>Namaste everyone! 🙏🏻&lt;/p>
&lt;p>I&amp;rsquo;m &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joel-tony/">Joel Tony&lt;/a>, a third-year Computer Science undergraduate at BITS Pilani, Goa, India. I&amp;rsquo;m truly honored to be part of this year&amp;rsquo;s Google Summer of Code program, working with the UC OSPO organization on a project that genuinely excites me. I&amp;rsquo;m particularly grateful to be working under the mentorship of Dr. &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a>, a Research Scientist at Lawrence Berkeley National Laboratory, and Dr. &lt;a href="https://sbyna.github.io" target="_blank" rel="noopener">Suren Byna&lt;/a>, a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.&lt;/p>
&lt;p>My project, &amp;ldquo;&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/drishti">Drishti: Visualization and Analysis of AI-based Applications&lt;/a>&amp;rdquo;, aims to extend the &lt;a href="https://github.com/hpc-io/drishti" target="_blank" rel="noopener">Drishti&lt;/a> framework to better support AI/ML workloads, focusing specifically on optimizing their Input/Output (I/O) performance. I/O refers to the data transfer between a computer&amp;rsquo;s memory and external storage devices like hard drives (HDDs) or solid-state drives (SSDs). As AI models and datasets continue to grow exponentially in size, efficient I/O management has become a critical bottleneck that can significantly impact the overall performance of these data-intensive workloads.&lt;/p>
&lt;p>Drishti is an innovative, interactive web-based framework that helps users understand the I/O behavior of scientific applications by visualizing I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualizations, making performance issues more apparent. Now, I&amp;rsquo;m working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.&lt;/p>
&lt;p>Through my studies in high-performance computing and working with tools like BeeGFS and Darshan, I&amp;rsquo;ve gained insights into the intricacies of I/O performance. However, adapting Drishti for AI/ML workloads presents new challenges. In traditional HPC, computing often dominates, but in the realm of AI, the tables have turned. As models grow by billions of parameters and datasets expand to petabytes, I/O has become the critical path. Training larger models or using richer datasets doesn&amp;rsquo;t just mean more computation; it means handling vastly more data. This shift makes I/O optimisation not just a performance tweak but a fundamental enabler of AI progress. By fine-tuning Drishti for AI/ML workloads, we aim to pinpoint I/O bottlenecks precisely, helping researchers streamline their data pipelines and unlock the full potential of their hardware.&lt;/p>
&lt;p>As outlined in my &lt;a href="https://docs.google.com/document/d/1zfQclXYWFswUbHuuwEU7bjjTvzS3gRCyNci08lTR3Rg/edit?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>, my tasks are threefold:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Modularize Drishti&amp;rsquo;s codebase&lt;/strong>: Currently, it&amp;rsquo;s a single 1700-line file that handles multiple functionalities. I&amp;rsquo;ll be refactoring it into focused, maintainable modules, improving readability and facilitating future enhancements.&lt;/li>
&lt;li>&lt;strong>Enable multi-trace handling&lt;/strong>: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I&amp;rsquo;ll build a layer to aggregate these, providing a comprehensive view of the application&amp;rsquo;s I/O behavior.&lt;/li>
&lt;li>&lt;strong>Craft AI/ML-specific recommendations&lt;/strong>: Current suggestions often involve MPI-IO or HDF5, which aren&amp;rsquo;t typical in ML frameworks like PyTorch or TensorFlow. I&amp;rsquo;ll create targeted recommendations that align with these frameworks&amp;rsquo; data pipelines.&lt;/li>
&lt;/ol>
&lt;p>This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimize it for the unique I/O challenges that AI/ML applications face. Whether it&amp;rsquo;s dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.&lt;/p>
&lt;p>From classroom theories to hands-on projects, from understanding file systems to optimizing AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.&lt;/p>
&lt;p>In today&amp;rsquo;s AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimized I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behavior in AI/ML applications more interpretable through Drishti, we&amp;rsquo;re not just tweaking code. We&amp;rsquo;re providing developers with the insights they need to optimize their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.&lt;/p>
&lt;p>I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimizing I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I&amp;rsquo;m deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavor and to the UC OSPO and GSoC communities for this incredible opportunity.&lt;/p></description></item><item><title>Enhancing h5bench with HDF5 Compression Capability</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/h5bench/20240614-henryz/</link><pubDate>Mon, 27 May 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/lbl/h5bench/20240614-henryz/</guid><description>&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/h5bench">h5bench&lt;/a> project my &lt;a href="https://summerofcode.withgoogle.com/myprojects/details/n0H28Z40" target="_blank" rel="noopener">Enhencing h5bench with HDF5 Compression Capability&lt;/a> under the mentorship of Dr. &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and Dr. Suren Byna aims to allow users of h5bench to incoporate compression features in their simulations by creating custom benchmarks with common scientific lossless &amp;amp; lossy compression algorithms such as SZ, SZ3, ZFP, and GZIP.&lt;/p>
&lt;p>The problem I am trying to solve is to implement multiple data compression algorithms in h5bench core access patterns through HDF5 filters. This capability should grant users the flexibility to configure the parameters and methods of compression applied to their datasets according to their specific needs and preferences. My solution primarily involves using a user-defined HDF5 filter mechanism to implement lossless and lossy compression algorithms, such as ZFP, SZ, and cuSZ. Throughout the process, I will deliver one C source code implementing compression configuration settings, one C source code implementing lossless and lossy algorithms, a set of performance reports before and after data compression in CSV and standard output files, and a technical documentation on h5bench user manual website.&lt;/p></description></item><item><title>Drishti</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/drishti/</link><pubDate>Tue, 30 Jan 2024 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/drishti/</guid><description>&lt;p>&lt;a href="https://github.com/hpc-io/drishti" target="_blank" rel="noopener">Drishti&lt;/a> is a novel interactive web-based analysis framework to visualize I/O traces, highlight bottlenecks, and help understand the I/O behavior of scientific applications. Drishti aims to fill the gap between the trace collection, analysis, and tuning phases. The framework contains an interactive I/O trace analysis component for end-users to visually inspect their applications&amp;rsquo; I/O behavior, focusing on areas of interest and getting a clear picture of common root causes of I/O performance bottlenecks. Based on the automatic detection of I/O performance bottlenecks, our framework maps numerous common and well-known bottlenecks and their solution recommendations that can be implemented by users.&lt;/p>
&lt;h3 id="drishti--server-side-visualization-service">Drishti / Server-side Visualization Service&lt;/h3>
&lt;p>The proposed work will include investigating and building server-side solutions to support the visualization of larger I/O traces and logs, while integrating with the existing analysis, reports, and recommendations.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>visualization&lt;/code>, &lt;code>performance analysis&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, HTML/CSS, JavaScript&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="drishti--visualization-and-analysis-of-ai-based-applications">Drishti / Visualization and Analysis of AI-based Applications&lt;/h3>
&lt;p>Drishti to handle metrics from non-MPI applications, specifically, AI/ML codes and applications. This work entails adapting the existing framework, heuristics, and recommendations to support metrics collected from AI/ML workloads.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>AI&lt;/code> &lt;code>visualization&lt;/code>, &lt;code>performance analysis&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, AI, performance profiling&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>h5bench</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/h5bench/</link><pubDate>Tue, 30 Jan 2024 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/h5bench/</guid><description>&lt;p>&lt;a href="https://github.com/hpc-io/h5bench" target="_blank" rel="noopener">h5bench&lt;/a> is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.&lt;/p>
&lt;p>Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library&amp;rsquo;s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.&lt;/p>
&lt;h3 id="h5bench--reporting-and-enhancing">h5bench / Reporting and Enhancing&lt;/h3>
&lt;p>The proposed work will include standardizing and enhancing the reports generated by the suite, and integrate additional I/O kernels (e.g., HACC-IO).&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>benchmarking&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, C/C++, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="h5bench--compression">h5bench / Compression&lt;/h3>
&lt;p>The proposed work will focus on including compression capabilities into the h5bench core access patterns through HDF5 filters.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>benchmarking&lt;/code>, &lt;code>compression&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C/C++, Python, HDF5&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>