NCCL | UCSC OSPO

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Fri, 05 Sep 2025 00:00:00 +0000

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 2

Hello all! 👋 This is Tharit again. I want to share this blog post about my Part 2 of Google Summer of Code projects. In case you miss it, you can take a look at Part1 as well. Without further introduction, these following supports were added since last time.

Complex Number Support PR #148

Between this PR and the previous PR, there are lots of debugging and testing to make sure that all existing MPILinearOperator works under NCCL as they do with mpi4py PR #141, #142 #145

Most of the PyLops-MPI users are scientists and engineers working on the scientific problems - and most of the scientific problem involves complex numbers (Fourier Transform touches many things). NCCL does not support the complex number out-of-the-box.

It turned out that adding complex-number support was not the big issue. The complex number is simply the contiguous array of, says, float64. Unlike typical float64, one element of complex128 number is then represented by two float64. Things get more complicate if we start to talk about the complex number arithmatic. Luckily, NCCL semantics only supports element-wise ncclSum, ncclProd, ncclMin, ncclMax, ncclAvg. Wrapping element-wise operations for complex number is straightforward.

The change to PyLops-MPI _nccl.py itself is minimal. We simply added the function below and this hides the complexity of buffer-size management from users.

def _nccl_buf_size(buf, count=None):
 """ Get an appropriate buffer size according to the dtype of buf
 if buf.dtype in ['complex64', 'complex128']:
 return 2 * count if count else 2 * buf.size
 else:
 return count if count else buf.size

The conceptual is quite simple. But mechanically, to get it right in the general case required some extensive bug fixing, particularly in the call to _allgather as noted earlier in the “Core Change” section. The array needs some preprocessing (to align with NCCL semantics) and post-processing so that the result from Pylops-MPI’s NCCL allgather matches with the PyLops-MPI allgather. This is because Pylops-MPI must be able to switch between mpi4py and NCCL seamlessly from the user’s perspective. To make it concrete, here is how we do the _allgather() with NCCL

 def _allgather(self, send_buf, recv_buf=None):
 """Allgather operation
 """
 if deps.nccl_enabled and self.base_comm_nccl:
 if isinstance(send_buf, (tuple, list, int)):
 return nccl_allgather(self.base_comm_nccl, send_buf, recv_buf)
 else:
 send_shapes = self.base_comm.allgather(send_buf.shape)
 (padded_send, padded_recv) = _prepare_nccl_allgather_inputs(send_buf, send_shapes)
 raw_recv = nccl_allgather(self.base_comm_nccl, padded_send, recv_buf if recv_buf else padded_recv)
 return _unroll_nccl_allgather_recv(raw_recv, padded_send.shape, send_shapes)
 # < snip - MPI allgather >

After this feature was added, the PyLops-MPI with NCCL now catches up with its original MPI implementation, i.e., the test coverage is now the same 306 tests passed !

Benchmark Instrumentation PR #157

Profiling distributed GPU operations is critical to understanding performance bottlenecks. To make this easier, we added a lightweight benchmark instrumentation framework in PyLops-MPI. The goal was to allow developers to mark execution points in a function and collect timing information for these markers.

The core of the implementation is a @benchmark decorator. Inside a decorated function, developers can call mark(label) to record the time at specific points. After the function completes, the timings are reported in a human-readable format. This design is inspired by C++-style instrumentation, letting developers place markers directly in the code where they are most informative.

But because we are in Python, to handle nested function calls, we collect the timing information as a stack (bottom-up call graph) and parse the result at the end of the decorated function. Here’s is the illustration:

@benchmark
def outer_func_with_mark(par):
 mark("Outer func start")
 inner_func_with_mark(par) # <- this does `dot` and is also decorated
 dist_arr = DistributedArray(global_shape=par['global_shape'],
 partition=par['partition'],
 dtype=par['dtype'], axis=par['axis'])
 dist_arr + dist_arr
 mark("Outer func ends")

The text output is

[decorator]outer_func_with_mark: total runtime: 0.001206 s
 [decorator]inner_func_with_mark: total runtime: 0.000351 s
 Begin array constructor-->Begin dot: 0.000026 s
 Begin dot-->Finish dot: 0.000322 s
 Outer func start-->Outer func ends: 0.001202 s

Benchmarking is controlled via the environment variable BENCH_PYLOPS_MPI. It defaults to 1 (enable) but can be set to 0 to skip benchmarking for clean output. This means users can leave the decorated code unchanged and disable the benchmark through the environment variable. This is inspired by the C++ debug flag set during the compilation. Moreover, careful attention had to be made on concurrency issue of benchmarking because the time is recorded by CPU while the NCCL issues the operation in an async manner to CUDA stream PR #163 is an example of this.

Benchmark Result

This was the moment of truth. Our 12-week hardwork would be judged by a set of hard cold numbers. Our expectation was that

If the system does not have proprietary NVLink for GPU-GPU communication but is NCCL-compatible, the communication using CuPy + NCCL should still be faster than NumPy + MPI (and possiblyCuPy + MPI) in PyLops-MPI i.e., there should be a benefit from using NCCL from communication-related optimizations enabled by this project.

The result below was from NCSA UIUC Delta system 4-Way NVIDIA A40 GPU (no NVLink) with the allreduce operation.

That meets our expection. One thing to note here is: we see that actually the CuPy + MPI communication being slower than the NumPy + MPI. This is because the current implementation of PyLops-MPI uses non-buffered calls of mpi4py - see detail here. The choice was made due to its simplicity as it allowed send and receiving generic Python object wrapped in a list and thus allowed fast development process. However, These require the memory copy from GPU to CPU, do communication, and copy memory from CPU to GPU (pickle protocol) - see our discussion with mpi4py community here. This leads us to “Things left to do” section (later).

If the system has an NVLink for GPU-GPU communication, we will be able to see a significant gain in performance of PyLops-MPI with NCCL.

The result below is also from NCSA UIUC Delta system 8-Way NVIDIA H200 GPU (with NVLink) but we only use 4 GPUs to compare with previous result. This is also with the allreduce operation.

Here we unleash the true power of NCCL and its infrasture as you can see that the bandwidth of PyLops-MPI with NCCL is 800x of the MPI implementation !. It may not make much sense to compare the number with NumPy+MPI because there is drastic hardware infrastructure upgrade involved.

To top things off, we also ran the experiment trying to saturate the communication with the array size going up to 32 GB in total. We can see that we have the linear scaling i.e. time vs. data size grows linearly.

Finally, we ran an experiment with the application of Least-squares Migration, which is an iterative inversion scheme:

Each iteration applies a forward A and an adjoint A.T operation to form residuals and gradients.
A gradient accumulation requires a global reduction across processes with allreduce. Note that the computation is not trivial and so the total run-time of CPU and GPU is not fairly comparable (notice that in H200, the CuPy+MPI is not the slowest anymore). But we want to give the idea of how things piece together in the real application.

The impact of this GSoC project is clear:

With our NCCL-enabled PyLops-MPI,

if you don’t have access to the state-of-the-art infrastructure, PyLops-MPI with NCCL can still 10x the communication bandwith (A40 case)
if you do, we allow you to get the most out of the system (H200 case).

And the best thing is to use NCCL with PyLops-MPI, it requires minimal code change as shown in this LSM Tutorial and illustrated below. Only two change require from the code that run on MPI: the array must be allocated in GPU and nccl has to be passed to the DistributedArray. And that’s it !

nccl_comm = pylops_mpi.utils._nccl.initialize_nccl_comm()

# <snip - same set-up as running with MPI>

lsm = LSM(
 # <snip>
 cp.asarray(wav.astype(np.float32)), # Copy to GPU
 # <snip>
 engine="cuda",
 dtype=np.float32
)
lsm.Demop.trav_srcs = cp.asarray(lsm.Demop.trav_srcs.astype(np.float32)) # Copy to GPU
lsm.Demop.trav_recs = cp.asarray(lsm.Demop.trav_recs.astype(np.float32)) # Copy to GPU

x0 = pylops_mpi.DistributedArray(VStack.shape[1],
 partition=pylops_mpi.Partition.BROADCAST,
 base_comm_nccl=nccl_comm, # Explicitly pass nccl communicator
 engine="cupy") # Must use CuPy
# <snip - the rest is the same>

Things left to do

CUDA-Aware MPI: As we pointed out in the A40 experiment that current implementation of PyLops-MPI use non-buffered calls of mpi4py and thus introduces the memory copying from GPU to CPU. We aim to optimize this by introducing the buffered calls. However, this is not a trivial task because some of the MPI-related code was developed based on the semantics that the communication return the list object while the buffered call will return the array instead.

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Wed, 23 Jul 2025 00:00:00 +0000

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 1

Hello all! 👋 My name is Tharit, and I’m a computer science student at the University of Texas at Austin. This summer, I am fortunate to participate in the Google Summer of Code (GSoC) 2025 program, hosted by UC OSPO and the PyLops team. My project focuses on enabling NCCL GPU-to-GPU communication in PyLops-MPI, under the guidance of mentors Matteo Ravasi and Yuxi Hong.

You might have come across this post if you’re a PyLops user interested in scaling PyLops-MPI with GPU/NCCL support, or if you’re exploring GSoC projects and wondering what we are up to. Either way, I hope this post gives you useful insights.

What is PyLops-MPI?

If you’ve worked with inverse problems, you’ve likely come across PyLops. It’s a Python library that provides an extensive suite of linear operators and solvers. Operators are designed with a clear focus on the forward and adjoint pair (A and A.T) whilst solvers take operators and data to solve the associate inverse problem. In fields such as geophysics, astrophysics, or medical imaging,, inverse problem are solved routinely to image the Earth, the space, or the human body from remote measurements. In all cases, real-life problems tend to consume a lot of computing and require a lot of memory. PyLops allows users to express these problems in an abstract manner that is reminiscent of the underlying equations whilst not compromising on efficiency.

PyLops-MPI is the distributed extension of PyLops, introduced during GSoC 2023. It enables users to scale their computations over CPU and GPU clusters via MPI. However, up until now, even GPU-based communications were routed through MPI, introducing potential performance bottlenecks.

The Goal of the Project

Our goal is to take PyLops-MPI to the next level by enabling GPU-to-GPU collective communications directly using NVIDIA NCCL. This allows full utilization of high-bandwidth interconnects like NVLink, and avoids unnecessary memory transfers through the host CPU. This blog marks the midpoint of the program (week 6 of 12), and I’d like to reflect on the progress so far, challenges faced, and what’s coming next.

What is a Collective Communication anyway?

In PyLops-MPI, distributed computations require nodes to exchange information, for example, during gradient computations or reductions in iterative solvers. A naive implementation (useful for a thought experiment) would involve each node taking turns broadcasting data, which can be quite slow. NVIDIA’s NCCL abstracts away the complexity of topology-aware communication. For example, in the image below, if GPUs are most effective by communicating in a ring fashion for all-reduce operation, NCCL will automatically pick that layout and not use GPU 01-GPU 04 and GPU 02-GPU 03 communication links.

Example of a compute node with 4 GPUs attached, directly connected to each other with NVLink

What we achieved, so far

It is probably best to tell stories through the sequence of pull requests.

Core Changes in DistributedArray (PR #130)

This PR introduces NCCL support into the DistributedArray class. The design allows users to optionally pass both a NcclCommunicator and a MPI.Comm. By doing so, small control data (e.g., shape, dtype) is still exchanged via MPI, leveraging Python’s flexibility and minimizing performance impact. As you will see, this decision to keep two communicators turns out to be a good call. This is how the __init__ method of DistributedArray looks like with the new addition in bold:

 def __init__(self, global_shape: Union[Tuple, Integral],
 base_comm: Optional[MPI.Comm] = MPI.COMM_WORLD,
 base_comm_nccl: Optional[NcclCommunicatorType] = None, # Added to this line
 partition: Partition = Partition.SCATTER, axis: int = 0,
 local_shapes: Optional[List[Union[Tuple, Integral]]] = None,
 mask: Optional[List[Integral]] = None)

The CuPy’s NCCL API

NCCL’s API (mirroring its C++ origins) is minimalistic, requiring manual memory management. One prominent example is the implementation of allGather() Previously, using mpi4py we could leverage the Python dynamic typing nature (everything is an object, so one just sends the object). This means mpi4py allows different ranks to send different sizes of arrays. NCCL requires every rank in the communicator to send the same size. To work around this, we implemented padding and reshaping logic in multi-dimensional arrays. NCCL treats arrays as contiguous byte streams, so padding must be handled carefully ¹.

Moreover, we had to accommodate NCCL’s lower-level API, lacking conveniences like communicator’s split variants. Internally, we introduced unified abstractions such as _allgather(), _allreduce(), send(), recv(). etc to DistributedArray and modified the communication model to work seamlessly whether MPI or NCCL is used. By doing this, other developers can focus on developing new operators (that suit their needs), and abstract away the existence of different communicators.

Example of a challenge coming from having an unevenly distributed array

Keep things small: Dependency management (PR #132 and PR #135)

Despite adding this new capability, we are fully aware that not every user has access to a cluster of GPU and therefore we don’t make NCCL and CuPy mandatory dependencies. The first time someoneinstalls and experiments with PyLops-MPI is likely to run it on a single-node desktop. And we don’t want to introduce such complexity early on. This means that our code has to accommodate “optional dependency” or have a “protected import”. If we have import cupy as cp at the beginning of DistributedArray, users without GPU will encounter an error before doing anything useful at all. In other words, our library should import CuPy and NCCL automatically when both the system enables it and users also ask for it. The pattern looks like this:

nccl_test = util.find_spec("cupy") is not None and int(os.getenv("NCCL_PYLOPS_MPI", 1)) == 1
if nccl_test:
 # try import CuPy and then check for NCCL
 if nccl is available:
 # success
 else:
 # unable to import but the package is installed
else:
 # package is not installed or the environment variable disables it
# Finally, set nccl_enabled flag for other module to use for protected import

This helps preserve PyLops-MPI’s minimal base installation. This required carefully isolating imports and adapting the module resolution logic using a backend dispatching mechanism. This is something I never thought of or took it into consideration before.

The Basic Operator with NCCL PR 137

We chose MPIVStack as the first operator to implement NCCL support due to its simplicity. Several design choices emerged:

Implicit Communicator Propagation

We updated forward and adjoint calls to propagate the base_comm_nccl from input to output automatically. This way, if x is NCCL-enabled, then y = A @ x or A.H @ x will also be NCCL-enabled. This avoids mismatches and keeps operator pipelines consistent.

Interestingly, the operator itself did not need to explicitly take base_comm_nccl as an argument, reducing complexity for developers extending PyLops-MPI. This point is contrary to our initial expectation. The operator does not have to take base_comm_nccl like DistributedArray did. This is good news. This reduces the potential that other developers may have to deal with different cases of communication when adding new operators.

Optional Dual-Communicator Design

As with DistributedArray, the ability to pass both an MPI communicator and an NCCL communicator proved to be a sound decision. By maintaining NCCL as an optional backend, we gain fine-grained control over which communication paths use NCCL versus MPI. This flexibility allowed us to optimize performance-critical paths while retaining MPI for control messages and small metadata transfers.

In particular, in the communication of ghost cells, which are used for computation around the boundary, like in derivative calculation, small metadata, such as cell_fronts (typically lists of rank-sized integers) continues to be efficiently transmitted via MPI. These metadata are needed for send/receiver buffer allocations. This leverages Python’s object serialization model (list[int]) without incurring GPU synchronization costs. But the actual cell array itself is communicated with NCCL since these arrays can be large.

What’s Next?

Aside from enabling NCCL support for the remaining operators and their full test coverage, some more exciting upcoming updates are

Complex-number type support for NCCL
Benchmarking results on a real HPC system

Stay tuned for Part 2, and thanks for reading!

For the best performance mpi4py would require the buffer memory allocation as well. The mpi4py package provides two interface: buffered and non-bufferred. Currently PyLops-MPI takes the non-buffered approach. This suggest a room of optimization. ↩︎

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Sun, 08 Jun 2025 00:00:00 +0000

Google Summer of Code ‘25: Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

My project aims to introduce GPU-to-GPU collective communication calls using Nvidia’s NCCL to PyLops-MPI, an extension of the powerful PyLops library.

I’m incredibly grateful for this opportunity and excited to be mentored by two HPC experts, Yuxi Hong from Lawrence Berkeley National Laboratory and Matteo Ravasi from ShearWater GeoServices.

Here’s also the link to my original proposal

What is PyLops-MPI and NCCL ?

PyLops is a Python library that provides a rich collection of linear operators to solve inverse problems. Its MPI extension, PyLops-MPI, takes this a step further by enabling these operations to run on large-scale, distributed computing systems like HPC using the Message-Passing Interface (MPI).

Where does NCCL fit in? The NVIDIA Collective Communication Library (NCCL) is a library of highly-optimized routines for collective communication between GPU. It offers the opportunity to close the performance gap in PyLops-MPI. As we now offload more and more computationally intensive tasks to GPUs, the communication between them can become a bottleneck. NCCL offers a powerful solution to this problem, enabling high-bandwidth, low-latency communication that can significantly boost performance.

Motivation and What was Missing

As a student with a background in geophysics (B.Sc) and now pursuing computer science (M.Sc), I’ve experienced firsthand the challenges of scaling scientific computing research from a personal desktop to a high-performance computing (HPC) cluster. It can be a significant hurdle. My project aims to ease this transition for PyLops-MPI users. PyLops-MPI is something I wish existed while I was doing my undergraduate reseach !

Currently, PyLops-MPI is “CUDA-aware,” meaning it can offload computations to GPUs. However, the communication between those GPUs is still handled by the underlying MPI implementation, which isn’t always optimal. This project will address this gap by integrating NCCL to handle GPU-to-GPU communication directly. If the compute is done in GPU, there shouldn’t be a copy from GPU to CPU, transfer with MPI, and put them back to GPU again.

This will be especially impactful for memory-bound problems where high-bandwidth communication is critical. By the end of this project, we’ll have a clear, quantifiable understanding of the performance gains achieved.

My Best-Laid Plan

My approach is grounded in good software engineering practices to ensure that this new feature is both robust and genuinely useful. I was impressed by the code quality (enjoyable read) of the repository - and so I commit not to break that.

First and foremost, the goal is to seamlessly integrate NCCL without breaking what already works. A significant part of my effort will be dedicated to rigorous testing. This means not only ensuring that all existing tests pass but also developing a new, comprehensive test suite to validate the correctness of the GPU-to-GPU communication across different hardware setups.

Once we’re confident that the integration is solid, the exciting part begins: benchmarking (or you may call it “Moment of Truth”)! The plan is to measure the performance of end-to-end iterative solvers. These solvers are a perfect test case because they involve a mix of intensive gradient computations on the GPU and frequent AllReduce calls to sync up processes. This will give us a clear picture of the speedup and efficiency gains from using NCCL.

Finally, to make sure this work benefits the entire community, I will create clear documentation and tutorials. The goal is to make it easy for any user to leverage this new GPU-accelerated communication in their own research and applications.

NCCL | UCSC OSPO

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 2

Complex Number Support PR #148

Benchmark Instrumentation PR #157

Benchmark Result

The impact of this GSoC project is clear:

Things left to do

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 1

What is PyLops-MPI?

The Goal of the Project

What is a Collective Communication anyway?

What we achieved, so far

Core Changes in DistributedArray (PR #130)

The CuPy’s NCCL API

Keep things small: Dependency management (PR #132 and PR #135)

The Basic Operator with NCCL PR 137

Implicit Communicator Propagation

Optional Dual-Communicator Design

What’s Next?

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Google Summer of Code ‘25: Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

What is PyLops-MPI and NCCL ?

Motivation and What was Missing

My Best-Laid Plan