LBL | UCSC OSPO

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Fri, 05 Sep 2025 00:00:00 +0000

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 2

Hello all! 👋 This is Tharit again. I want to share this blog post about my Part 2 of Google Summer of Code projects. In case you miss it, you can take a look at Part1 as well. Without further introduction, these following supports were added since last time.

Complex Number Support PR #148

Between this PR and the previous PR, there are lots of debugging and testing to make sure that all existing MPILinearOperator works under NCCL as they do with mpi4py PR #141, #142 #145

Most of the PyLops-MPI users are scientists and engineers working on the scientific problems - and most of the scientific problem involves complex numbers (Fourier Transform touches many things). NCCL does not support the complex number out-of-the-box.

It turned out that adding complex-number support was not the big issue. The complex number is simply the contiguous array of, says, float64. Unlike typical float64, one element of complex128 number is then represented by two float64. Things get more complicate if we start to talk about the complex number arithmatic. Luckily, NCCL semantics only supports element-wise ncclSum, ncclProd, ncclMin, ncclMax, ncclAvg. Wrapping element-wise operations for complex number is straightforward.

The change to PyLops-MPI _nccl.py itself is minimal. We simply added the function below and this hides the complexity of buffer-size management from users.

def _nccl_buf_size(buf, count=None):
 """ Get an appropriate buffer size according to the dtype of buf
 if buf.dtype in ['complex64', 'complex128']:
 return 2 * count if count else 2 * buf.size
 else:
 return count if count else buf.size

The conceptual is quite simple. But mechanically, to get it right in the general case required some extensive bug fixing, particularly in the call to _allgather as noted earlier in the “Core Change” section. The array needs some preprocessing (to align with NCCL semantics) and post-processing so that the result from Pylops-MPI’s NCCL allgather matches with the PyLops-MPI allgather. This is because Pylops-MPI must be able to switch between mpi4py and NCCL seamlessly from the user’s perspective. To make it concrete, here is how we do the _allgather() with NCCL

 def _allgather(self, send_buf, recv_buf=None):
 """Allgather operation
 """
 if deps.nccl_enabled and self.base_comm_nccl:
 if isinstance(send_buf, (tuple, list, int)):
 return nccl_allgather(self.base_comm_nccl, send_buf, recv_buf)
 else:
 send_shapes = self.base_comm.allgather(send_buf.shape)
 (padded_send, padded_recv) = _prepare_nccl_allgather_inputs(send_buf, send_shapes)
 raw_recv = nccl_allgather(self.base_comm_nccl, padded_send, recv_buf if recv_buf else padded_recv)
 return _unroll_nccl_allgather_recv(raw_recv, padded_send.shape, send_shapes)
 # < snip - MPI allgather >

After this feature was added, the PyLops-MPI with NCCL now catches up with its original MPI implementation, i.e., the test coverage is now the same 306 tests passed !

Benchmark Instrumentation PR #157

Profiling distributed GPU operations is critical to understanding performance bottlenecks. To make this easier, we added a lightweight benchmark instrumentation framework in PyLops-MPI. The goal was to allow developers to mark execution points in a function and collect timing information for these markers.

The core of the implementation is a @benchmark decorator. Inside a decorated function, developers can call mark(label) to record the time at specific points. After the function completes, the timings are reported in a human-readable format. This design is inspired by C++-style instrumentation, letting developers place markers directly in the code where they are most informative.

But because we are in Python, to handle nested function calls, we collect the timing information as a stack (bottom-up call graph) and parse the result at the end of the decorated function. Here’s is the illustration:

@benchmark
def outer_func_with_mark(par):
 mark("Outer func start")
 inner_func_with_mark(par) # <- this does `dot` and is also decorated
 dist_arr = DistributedArray(global_shape=par['global_shape'],
 partition=par['partition'],
 dtype=par['dtype'], axis=par['axis'])
 dist_arr + dist_arr
 mark("Outer func ends")

The text output is

[decorator]outer_func_with_mark: total runtime: 0.001206 s
 [decorator]inner_func_with_mark: total runtime: 0.000351 s
 Begin array constructor-->Begin dot: 0.000026 s
 Begin dot-->Finish dot: 0.000322 s
 Outer func start-->Outer func ends: 0.001202 s

Benchmarking is controlled via the environment variable BENCH_PYLOPS_MPI. It defaults to 1 (enable) but can be set to 0 to skip benchmarking for clean output. This means users can leave the decorated code unchanged and disable the benchmark through the environment variable. This is inspired by the C++ debug flag set during the compilation. Moreover, careful attention had to be made on concurrency issue of benchmarking because the time is recorded by CPU while the NCCL issues the operation in an async manner to CUDA stream PR #163 is an example of this.

Benchmark Result

This was the moment of truth. Our 12-week hardwork would be judged by a set of hard cold numbers. Our expectation was that

If the system does not have proprietary NVLink for GPU-GPU communication but is NCCL-compatible, the communication using CuPy + NCCL should still be faster than NumPy + MPI (and possiblyCuPy + MPI) in PyLops-MPI i.e., there should be a benefit from using NCCL from communication-related optimizations enabled by this project.

The result below was from NCSA UIUC Delta system 4-Way NVIDIA A40 GPU (no NVLink) with the allreduce operation.

That meets our expection. One thing to note here is: we see that actually the CuPy + MPI communication being slower than the NumPy + MPI. This is because the current implementation of PyLops-MPI uses non-buffered calls of mpi4py - see detail here. The choice was made due to its simplicity as it allowed send and receiving generic Python object wrapped in a list and thus allowed fast development process. However, These require the memory copy from GPU to CPU, do communication, and copy memory from CPU to GPU (pickle protocol) - see our discussion with mpi4py community here. This leads us to “Things left to do” section (later).

If the system has an NVLink for GPU-GPU communication, we will be able to see a significant gain in performance of PyLops-MPI with NCCL.

The result below is also from NCSA UIUC Delta system 8-Way NVIDIA H200 GPU (with NVLink) but we only use 4 GPUs to compare with previous result. This is also with the allreduce operation.

Here we unleash the true power of NCCL and its infrasture as you can see that the bandwidth of PyLops-MPI with NCCL is 800x of the MPI implementation !. It may not make much sense to compare the number with NumPy+MPI because there is drastic hardware infrastructure upgrade involved.

To top things off, we also ran the experiment trying to saturate the communication with the array size going up to 32 GB in total. We can see that we have the linear scaling i.e. time vs. data size grows linearly.

Finally, we ran an experiment with the application of Least-squares Migration, which is an iterative inversion scheme:

Each iteration applies a forward A and an adjoint A.T operation to form residuals and gradients.
A gradient accumulation requires a global reduction across processes with allreduce. Note that the computation is not trivial and so the total run-time of CPU and GPU is not fairly comparable (notice that in H200, the CuPy+MPI is not the slowest anymore). But we want to give the idea of how things piece together in the real application.

The impact of this GSoC project is clear:

With our NCCL-enabled PyLops-MPI,

if you don’t have access to the state-of-the-art infrastructure, PyLops-MPI with NCCL can still 10x the communication bandwith (A40 case)
if you do, we allow you to get the most out of the system (H200 case).

And the best thing is to use NCCL with PyLops-MPI, it requires minimal code change as shown in this LSM Tutorial and illustrated below. Only two change require from the code that run on MPI: the array must be allocated in GPU and nccl has to be passed to the DistributedArray. And that’s it !

nccl_comm = pylops_mpi.utils._nccl.initialize_nccl_comm()

# <snip - same set-up as running with MPI>

lsm = LSM(
 # <snip>
 cp.asarray(wav.astype(np.float32)), # Copy to GPU
 # <snip>
 engine="cuda",
 dtype=np.float32
)
lsm.Demop.trav_srcs = cp.asarray(lsm.Demop.trav_srcs.astype(np.float32)) # Copy to GPU
lsm.Demop.trav_recs = cp.asarray(lsm.Demop.trav_recs.astype(np.float32)) # Copy to GPU

x0 = pylops_mpi.DistributedArray(VStack.shape[1],
 partition=pylops_mpi.Partition.BROADCAST,
 base_comm_nccl=nccl_comm, # Explicitly pass nccl communicator
 engine="cupy") # Must use CuPy
# <snip - the rest is the same>

Things left to do

CUDA-Aware MPI: As we pointed out in the A40 experiment that current implementation of PyLops-MPI use non-buffered calls of mpi4py and thus introduces the memory copying from GPU to CPU. We aim to optimize this by introducing the buffered calls. However, this is not a trivial task because some of the MPI-related code was developed based on the semantics that the communication return the list object while the buffered call will return the array instead.

AIDRIN Privacy-Centric Enhancements: Backend & UX Upgrades

Fri, 25 Jul 2025 00:00:00 +0000

⏱️ Reading time: 5–6 minutes

Hey everyone,

If you’ve ever wondered what it takes to make AI data pipelines not just smarter, but safer and more transparent, you’re in the right place. The last few weeks working on AIDRIN for GSoC have been a deep dive into the engine room of privacy and backend systems that power the AIDRIN project. My focus has been on building out the core privacy infrastructure and backend features that power AIDRIN’s ability to give users real, actionable insights about their data. It’s been challenging, sometimes messy, but incredibly rewarding to see these changes make a tangible difference.

Having Dr. Jean Luca Bez and Prof. Suren Byna as mentors, along with the support of the entire team, has truly made all the difference. Their guidance, encouragement, and collaborative spirit have been a huge part of this journey, whether I’m brainstorming new ideas or just trying to untangle a tricky bug.

Privacy Metrics: Making Data Safer

A major part of my work has been putting data privacy at the front and center in AIDRIN. I focused on integrating essential privacy metrics like k-anonymity, l-diversity, t-closeness, and more, making sure they’re not just theoretical checkboxes, but real tools that users can interact with and understand. Now, these metrics are fully wired up in the backend and visualized in AIDRIN, so privacy risks are no longer just a vague concern. They are something AI data preparers can actually see and act on. Getting these metrics to work seamlessly with different datasets and ensuring their accuracy took some serious backend engineering, but the payoff has been worth it.

Speeding Things Up (So You Don’t Have To Wait Around)

As AIDRIN started handling bigger datasets, some of the calculations can be time-consuming because data has to be accessed every time a metric is computed. To address this, I added caching for previously computed metrics, like class imbalance and privacy checks, and set up asynchronous execution with Celery and Redis. This should make the app super responsive. Rather than waiting for heavy computations to finish, one can start taking notes about other metrics or explore different parts of the app while their results are loading in the background. It’s a small change, but it helps keep the workflow moving smoothly.

Small Touch Ups That (Hopefully) Make a Big Difference

I also spent time on the details that make the app easier to use. Tooltips now explain what the privacy metrics actually mean, error messages are clearer, and there’s a new cache info page where you can see and clear your cached data. The sensitive attribute dropdown is less confusing now, especially if you’re working with quasi-identifiers. These tweaks might seem minor, but they add up and make the app friendlier for everyone.

Docs, Docs, Docs

I’m a big believer that good documentation is just as important as good code. I updated the docs to cover all the new features, added citations for the privacy metrics, and made the install process a bit more straightforward. Hopefully, this means new users and contributors can get up to speed without too much hassle.

Huge Thanks to My Mentors and the Team

I really want to shine a light on Dr. Bez, Prof. Byna, and the entire AIDRIN team here. Their encouragement, practical advice, and collaborative spirit have been a huge part of my progress. Whether I’m stuck on a bug, brainstorming a new feature, or just need a second opinion, there’s always someone ready to help me think things through. Their experience and support have shaped not just the technical side of my work, but also how I approach problem-solving and teamwork.

What’s Next?

Looking ahead, I’m planning to expand AIDRIN’s support for multimodal datasets and keep refining the privacy and fairness modules. There’s always something new to learn or improve, and I’m excited to keep building. If you’re interested in data quality, privacy, or open-source AI tools, I’d love to connect and swap ideas.

Thanks for reading and for following along with my GSoC journey. I’ll be back soon with more updates!

This is the second post in my 3-part GSoC series with AIDRIN. Stay tuned for the final update.

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Wed, 23 Jul 2025 00:00:00 +0000

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 1

Hello all! 👋 My name is Tharit, and I’m a computer science student at the University of Texas at Austin. This summer, I am fortunate to participate in the Google Summer of Code (GSoC) 2025 program, hosted by UC OSPO and the PyLops team. My project focuses on enabling NCCL GPU-to-GPU communication in PyLops-MPI, under the guidance of mentors Matteo Ravasi and Yuxi Hong.

You might have come across this post if you’re a PyLops user interested in scaling PyLops-MPI with GPU/NCCL support, or if you’re exploring GSoC projects and wondering what we are up to. Either way, I hope this post gives you useful insights.

What is PyLops-MPI?

If you’ve worked with inverse problems, you’ve likely come across PyLops. It’s a Python library that provides an extensive suite of linear operators and solvers. Operators are designed with a clear focus on the forward and adjoint pair (A and A.T) whilst solvers take operators and data to solve the associate inverse problem. In fields such as geophysics, astrophysics, or medical imaging,, inverse problem are solved routinely to image the Earth, the space, or the human body from remote measurements. In all cases, real-life problems tend to consume a lot of computing and require a lot of memory. PyLops allows users to express these problems in an abstract manner that is reminiscent of the underlying equations whilst not compromising on efficiency.

PyLops-MPI is the distributed extension of PyLops, introduced during GSoC 2023. It enables users to scale their computations over CPU and GPU clusters via MPI. However, up until now, even GPU-based communications were routed through MPI, introducing potential performance bottlenecks.

The Goal of the Project

Our goal is to take PyLops-MPI to the next level by enabling GPU-to-GPU collective communications directly using NVIDIA NCCL. This allows full utilization of high-bandwidth interconnects like NVLink, and avoids unnecessary memory transfers through the host CPU. This blog marks the midpoint of the program (week 6 of 12), and I’d like to reflect on the progress so far, challenges faced, and what’s coming next.

What is a Collective Communication anyway?

In PyLops-MPI, distributed computations require nodes to exchange information, for example, during gradient computations or reductions in iterative solvers. A naive implementation (useful for a thought experiment) would involve each node taking turns broadcasting data, which can be quite slow. NVIDIA’s NCCL abstracts away the complexity of topology-aware communication. For example, in the image below, if GPUs are most effective by communicating in a ring fashion for all-reduce operation, NCCL will automatically pick that layout and not use GPU 01-GPU 04 and GPU 02-GPU 03 communication links.

Example of a compute node with 4 GPUs attached, directly connected to each other with NVLink

What we achieved, so far

It is probably best to tell stories through the sequence of pull requests.

Core Changes in DistributedArray (PR #130)

This PR introduces NCCL support into the DistributedArray class. The design allows users to optionally pass both a NcclCommunicator and a MPI.Comm. By doing so, small control data (e.g., shape, dtype) is still exchanged via MPI, leveraging Python’s flexibility and minimizing performance impact. As you will see, this decision to keep two communicators turns out to be a good call. This is how the __init__ method of DistributedArray looks like with the new addition in bold:

 def __init__(self, global_shape: Union[Tuple, Integral],
 base_comm: Optional[MPI.Comm] = MPI.COMM_WORLD,
 base_comm_nccl: Optional[NcclCommunicatorType] = None, # Added to this line
 partition: Partition = Partition.SCATTER, axis: int = 0,
 local_shapes: Optional[List[Union[Tuple, Integral]]] = None,
 mask: Optional[List[Integral]] = None)

The CuPy’s NCCL API

NCCL’s API (mirroring its C++ origins) is minimalistic, requiring manual memory management. One prominent example is the implementation of allGather() Previously, using mpi4py we could leverage the Python dynamic typing nature (everything is an object, so one just sends the object). This means mpi4py allows different ranks to send different sizes of arrays. NCCL requires every rank in the communicator to send the same size. To work around this, we implemented padding and reshaping logic in multi-dimensional arrays. NCCL treats arrays as contiguous byte streams, so padding must be handled carefully ¹.

Moreover, we had to accommodate NCCL’s lower-level API, lacking conveniences like communicator’s split variants. Internally, we introduced unified abstractions such as _allgather(), _allreduce(), send(), recv(). etc to DistributedArray and modified the communication model to work seamlessly whether MPI or NCCL is used. By doing this, other developers can focus on developing new operators (that suit their needs), and abstract away the existence of different communicators.

Example of a challenge coming from having an unevenly distributed array

Keep things small: Dependency management (PR #132 and PR #135)

Despite adding this new capability, we are fully aware that not every user has access to a cluster of GPU and therefore we don’t make NCCL and CuPy mandatory dependencies. The first time someoneinstalls and experiments with PyLops-MPI is likely to run it on a single-node desktop. And we don’t want to introduce such complexity early on. This means that our code has to accommodate “optional dependency” or have a “protected import”. If we have import cupy as cp at the beginning of DistributedArray, users without GPU will encounter an error before doing anything useful at all. In other words, our library should import CuPy and NCCL automatically when both the system enables it and users also ask for it. The pattern looks like this:

nccl_test = util.find_spec("cupy") is not None and int(os.getenv("NCCL_PYLOPS_MPI", 1)) == 1
if nccl_test:
 # try import CuPy and then check for NCCL
 if nccl is available:
 # success
 else:
 # unable to import but the package is installed
else:
 # package is not installed or the environment variable disables it
# Finally, set nccl_enabled flag for other module to use for protected import

This helps preserve PyLops-MPI’s minimal base installation. This required carefully isolating imports and adapting the module resolution logic using a backend dispatching mechanism. This is something I never thought of or took it into consideration before.

The Basic Operator with NCCL PR 137

We chose MPIVStack as the first operator to implement NCCL support due to its simplicity. Several design choices emerged:

Implicit Communicator Propagation

We updated forward and adjoint calls to propagate the base_comm_nccl from input to output automatically. This way, if x is NCCL-enabled, then y = A @ x or A.H @ x will also be NCCL-enabled. This avoids mismatches and keeps operator pipelines consistent.

Interestingly, the operator itself did not need to explicitly take base_comm_nccl as an argument, reducing complexity for developers extending PyLops-MPI. This point is contrary to our initial expectation. The operator does not have to take base_comm_nccl like DistributedArray did. This is good news. This reduces the potential that other developers may have to deal with different cases of communication when adding new operators.

Optional Dual-Communicator Design

As with DistributedArray, the ability to pass both an MPI communicator and an NCCL communicator proved to be a sound decision. By maintaining NCCL as an optional backend, we gain fine-grained control over which communication paths use NCCL versus MPI. This flexibility allowed us to optimize performance-critical paths while retaining MPI for control messages and small metadata transfers.

In particular, in the communication of ghost cells, which are used for computation around the boundary, like in derivative calculation, small metadata, such as cell_fronts (typically lists of rank-sized integers) continues to be efficiently transmitted via MPI. These metadata are needed for send/receiver buffer allocations. This leverages Python’s object serialization model (list[int]) without incurring GPU synchronization costs. But the actual cell array itself is communicated with NCCL since these arrays can be large.

What’s Next?

Aside from enabling NCCL support for the remaining operators and their full test coverage, some more exciting upcoming updates are

Complex-number type support for NCCL
Benchmarking results on a real HPC system

Stay tuned for Part 2, and thanks for reading!

For the best performance mpi4py would require the buffer memory allocation as well. The mpi4py package provides two interface: buffered and non-bufferred. Currently PyLops-MPI takes the non-buffered approach. This suggest a room of optimization. ↩︎

Improving AI Data Pipelines in AIDRIN: A Privacy-Centric and Multimodal Expansion

Thu, 12 Jun 2025 00:00:00 +0000

⏱️ Reading time: 4–5 minutes

Hi 👋

I’m Harish Balaji, a Master’s student at NYU with a focus on Artificial Intelligence, Machine Learning, and Cybersecurity. I’m especially interested in building scalable systems that reflect responsible AI principles. For me, data quality isn’t just a technical detail. It’s a foundational aspect of building models that are reliable, fair, and reproducible in the real world.

This summer, I’m contributing to AIDRIN (AI Data Readiness Inspector) as part of Google Summer of Code 2025. I’m grateful to be working under the mentorship of Dr. Jean Luca Bez and Prof. Suren Byna from the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL).

AIDRIN is an open-source framework that helps researchers and practitioners evaluate whether a dataset is truly ready to be used in production-level AI workflows. From fairness to privacy, it provides a structured lens through which we can understand the strengths and gaps in our data.

Why this work matters

In machine learning, one principle always holds true:

“Garbage in, garbage out.”

Even the most advanced models can underperform or amplify harmful biases if trained on incomplete, imbalanced, or poorly understood data. This is where AIDRIN steps in. It provides practical tools to assess datasets across key dimensions like privacy, fairness, class balance, interpretability, and support for multiple modalities.

By making these characteristics measurable and transparent, AIDRIN empowers teams to make informed decisions early in the pipeline. It helps ensure that datasets are not only large or complex, but also trustworthy, representative, and purpose-fit.

My focus this summer

As part of my GSoC 2025 project, I’ll be focusing on extending AIDRIN’s evaluation capabilities. A big part of this involves strengthening its support for privacy metrics and designing tools that can handle non-tabular datasets, such as image-based data.

The goal is to expand AIDRIN’s reach without compromising on interpretability or ease of use. More technical insights and updates will follow in the next posts as the summer progresses.

What comes next

As the AI community continues to evolve, there’s a growing shift toward data-centric practices. I believe frameworks like AIDRIN are essential for helping us move beyond the question of “Does the model work?” toward a deeper and more meaningful one: “Was the data ready in the first place?”

Over the next few weeks, I’ll be working on development, testing, and integration. I’m excited to contribute to a tool that emphasizes transparency and reproducibility across the AI lifecycle, and to share lessons and ideas with others who care about responsible AI.

If you’re exploring similar challenges or working in the space of dataset evaluation and readiness, I’d love to connect and exchange thoughts. You can also read my full GSoC 2025 proposal below for more context around the project scope and vision:

👉 Read my GSoC 2025 proposal here

This is the first in a 3-part blog series documenting my GSoC journey with AIDRIN. Stay tuned for technical updates and behind-the-scenes insights as the summer unfolds!

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Sun, 08 Jun 2025 00:00:00 +0000

Google Summer of Code ‘25: Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

My project aims to introduce GPU-to-GPU collective communication calls using Nvidia’s NCCL to PyLops-MPI, an extension of the powerful PyLops library.

I’m incredibly grateful for this opportunity and excited to be mentored by two HPC experts, Yuxi Hong from Lawrence Berkeley National Laboratory and Matteo Ravasi from ShearWater GeoServices.

Here’s also the link to my original proposal

What is PyLops-MPI and NCCL ?

PyLops is a Python library that provides a rich collection of linear operators to solve inverse problems. Its MPI extension, PyLops-MPI, takes this a step further by enabling these operations to run on large-scale, distributed computing systems like HPC using the Message-Passing Interface (MPI).

Where does NCCL fit in? The NVIDIA Collective Communication Library (NCCL) is a library of highly-optimized routines for collective communication between GPU. It offers the opportunity to close the performance gap in PyLops-MPI. As we now offload more and more computationally intensive tasks to GPUs, the communication between them can become a bottleneck. NCCL offers a powerful solution to this problem, enabling high-bandwidth, low-latency communication that can significantly boost performance.

Motivation and What was Missing

As a student with a background in geophysics (B.Sc) and now pursuing computer science (M.Sc), I’ve experienced firsthand the challenges of scaling scientific computing research from a personal desktop to a high-performance computing (HPC) cluster. It can be a significant hurdle. My project aims to ease this transition for PyLops-MPI users. PyLops-MPI is something I wish existed while I was doing my undergraduate reseach !

Currently, PyLops-MPI is “CUDA-aware,” meaning it can offload computations to GPUs. However, the communication between those GPUs is still handled by the underlying MPI implementation, which isn’t always optimal. This project will address this gap by integrating NCCL to handle GPU-to-GPU communication directly. If the compute is done in GPU, there shouldn’t be a copy from GPU to CPU, transfer with MPI, and put them back to GPU again.

This will be especially impactful for memory-bound problems where high-bandwidth communication is critical. By the end of this project, we’ll have a clear, quantifiable understanding of the performance gains achieved.

My Best-Laid Plan

My approach is grounded in good software engineering practices to ensure that this new feature is both robust and genuinely useful. I was impressed by the code quality (enjoyable read) of the repository - and so I commit not to break that.

First and foremost, the goal is to seamlessly integrate NCCL without breaking what already works. A significant part of my effort will be dedicated to rigorous testing. This means not only ensuring that all existing tests pass but also developing a new, comprehensive test suite to validate the correctness of the GPU-to-GPU communication across different hardware setups.

Once we’re confident that the integration is solid, the exciting part begins: benchmarking (or you may call it “Moment of Truth”)! The plan is to measure the performance of end-to-end iterative solvers. These solvers are a perfect test case because they involve a mix of intensive gradient computations on the GPU and frequent AllReduce calls to sync up processes. This will give us a clear picture of the speedup and efficiency gains from using NCCL.

Finally, to make sure this work benefits the entire community, I will create clear documentation and tutorials. The goal is to make it easy for any user to leverage this new GPU-accelerated communication in their own research and applications.

Exploration of I/O Reproducibility with HDF5

Wed, 19 Feb 2025 09:00:00 -0700

Parallel I/O is a critical component in high-performance computing (HPC), allowing multiple processes to read and write data concurrently from a shared storage system. HDF5—a widely adopted data model and library for managing complex scientific data—supports parallel I/O but introduces challenges in I/O reproducibility, where repeated executions do not always produce identical results. This lack of reproducibility can stem from non-deterministic execution orders, variations in collective buffering strategies, and race conditions in metadata and dataset chunking operations within HDF5’s parallel I/O hierarchy. Moreover, many HDF5 operations that leverage MPI I/O require collective communication; that is, all processes within a communicator must participate in operations such as metadata creation, chunk allocation, and data aggregation. These collective calls ensure that the file structure and data layout remain consistent across processes, but they also introduce additional synchronization complexity that can impact reproducibility if not properly managed. In HPC scientific workflows, consistent I/O reproducibility is essential for accurate debugging, validation, and benchmarking, ensuring that scientific results are both verifiable and trustworthy. Tools such as h5bench—a suite of I/O kernels designed to exercise HDF5 I/O on parallel file systems—play an important role in identifying these reproducibility challenges, tuning performance, and ultimately supporting the overall robustness of large-scale scientific applications.

Workplan

The proposed work will include (1) analyzing and characterizing parallel I/O operations in HDF5 with h5bench miniapps, (2) exploring and validating potential reproducibility challenges within the parallel I/O hierarchy (e.g., MPI I/O), and (3) implementing solutions to address parallel I/O reproducibility.

Topics: Parallel I/O MPI-I/O Reproducibility HPC HDF5
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo and [Wei Zhang]Wei Zhang

LBL | UCSC OSPO

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 2

Complex Number Support PR #148

Benchmark Instrumentation PR #157

Benchmark Result

The impact of this GSoC project is clear:

Things left to do

AIDRIN Privacy-Centric Enhancements: Backend & UX Upgrades

Privacy Metrics: Making Data Safer

Speeding Things Up (So You Don’t Have To Wait Around)

Small Touch Ups That (Hopefully) Make a Big Difference

Docs, Docs, Docs

Huge Thanks to My Mentors and the Team

What’s Next?

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 1

What is PyLops-MPI?

The Goal of the Project

What is a Collective Communication anyway?

What we achieved, so far

Core Changes in DistributedArray (PR #130)

The CuPy’s NCCL API

Keep things small: Dependency management (PR #132 and PR #135)

The Basic Operator with NCCL PR 137

Implicit Communicator Propagation

Optional Dual-Communicator Design

What’s Next?

Improving AI Data Pipelines in AIDRIN: A Privacy-Centric and Multimodal Expansion

Why this work matters

My focus this summer

What comes next

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Google Summer of Code ‘25: Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

What is PyLops-MPI and NCCL ?

Motivation and What was Missing

My Best-Laid Plan

Exploration of I/O Reproducibility with HDF5

Workplan