<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>NCCL | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/category/nccl/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/category/nccl/index.xml" rel="self" type="application/rss+xml"/><description>NCCL</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Fri, 05 Sep 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>NCCL</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/category/nccl/</link></image><item><title>Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250905-tharit/</link><pubDate>Fri, 05 Sep 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250905-tharit/</guid><description>&lt;h1 id="enabling-nccl-gpu-gpu-communication-in-pylops-mpi---google-summer-of-code-project-2025---part-2">Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 2&lt;/h1>
&lt;p>Hello all! 👋 This is Tharit again. I want to share this blog post about my Part 2 of Google Summer of Code projects. In case you miss it, you can take a look at &lt;a href="https://ucsc-ospo.github.io/report/osre25/lbl/pylops-mpi/20250723-tharit/" target="_blank" rel="noopener">Part1&lt;/a> as well. Without further introduction, these following supports were added since last time.&lt;/p>
&lt;ul>
&lt;li>
&lt;h3 id="complex-number-support-pr-148httpsgithubcompylopspylops-mpipull148">Complex Number Support &lt;a href="https://github.com/PyLops/pylops-mpi/pull/148" target="_blank" rel="noopener">PR #148&lt;/a>&lt;/h3>
&lt;/li>
&lt;/ul>
&lt;p>&lt;em>Between this PR and the previous PR, there are lots of debugging and testing to make sure that all existing &lt;code>MPILinearOperator&lt;/code> works under NCCL as they do with &lt;code>mpi4py&lt;/code> PR &lt;a href="https://github.com/PyLops/pylops-mpi/pull/141" target="_blank" rel="noopener">#141&lt;/a>, &lt;a href="https://github.com/PyLops/pylops-mpi/pull/142" target="_blank" rel="noopener">#142&lt;/a> &lt;a href="https://github.com/PyLops/pylops-mpi/pull/145" target="_blank" rel="noopener">#145&lt;/a>&lt;/em>&lt;/p>
&lt;p>Most of the PyLops-MPI users are scientists and engineers working on the scientific problems - and most of the scientific problem involves complex numbers (Fourier Transform touches many things). &lt;em>NCCL does not support the complex number out-of-the-box&lt;/em>.&lt;/p>
&lt;p>It turned out that adding complex-number support was not the big issue. The complex number is simply the contiguous array of, says, &lt;code>float64&lt;/code>. Unlike typical &lt;code>float64&lt;/code>, one element of &lt;code>complex128&lt;/code> number is then represented by two &lt;code>float64&lt;/code>. Things get more complicate if we start to talk about the complex number arithmatic. Luckily, &lt;a href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.ncclRedOp_t" target="_blank" rel="noopener">NCCL semantics&lt;/a> only supports &lt;em>element-wise&lt;/em> &lt;code>ncclSum&lt;/code>, &lt;code>ncclProd&lt;/code>, &lt;code>ncclMin&lt;/code>, &lt;code>ncclMax&lt;/code>, &lt;code>ncclAvg&lt;/code>. Wrapping element-wise operations for complex number is straightforward.&lt;/p>
&lt;p>The change to PyLops-MPI &lt;code>_nccl.py&lt;/code> itself is minimal. We simply added the function below and this hides the complexity of buffer-size management from users.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">_nccl_buf_size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">buf&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">count&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">None&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34; Get an appropriate buffer size according to the dtype of buf
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> if buf.dtype in [&amp;#39;complex64&amp;#39;, &amp;#39;complex128&amp;#39;]:
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> return 2 * count if count else 2 * buf.size
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> else:
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> return count if count else buf.size
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The conceptual is quite simple. But mechanically, to get it right in the general case required some extensive bug fixing, particularly in the call to &lt;code>_allgather &lt;/code>as noted earlier in the &amp;ldquo;Core Change&amp;rdquo; section. The array needs some preprocessing (to align with NCCL semantics) and post-processing so that the result from Pylops-MPI’s NCCL allgather matches with the PyLops-MPI allgather. This is because Pylops-MPI must be able to switch between &lt;code>mpi4py&lt;/code> and NCCL seamlessly from the user&amp;rsquo;s perspective. To make it concrete, here is how we do the &lt;code>_allgather()&lt;/code> with NCCL&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl"> &lt;span class="k">def&lt;/span> &lt;span class="nf">_allgather&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">send_buf&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">recv_buf&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">None&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;Allgather operation
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">deps&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">nccl_enabled&lt;/span> &lt;span class="ow">and&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">base_comm_nccl&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="nb">isinstance&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">send_buf&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="nb">tuple&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">list&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">int&lt;/span>&lt;span class="p">)):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">nccl_allgather&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">base_comm_nccl&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">send_buf&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">recv_buf&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">send_shapes&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">base_comm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">allgather&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">send_buf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">(&lt;/span>&lt;span class="n">padded_send&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">padded_recv&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">_prepare_nccl_allgather_inputs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">send_buf&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">send_shapes&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">raw_recv&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nccl_allgather&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">base_comm_nccl&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">padded_send&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">recv_buf&lt;/span> &lt;span class="k">if&lt;/span> &lt;span class="n">recv_buf&lt;/span> &lt;span class="k">else&lt;/span> &lt;span class="n">padded_recv&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">_unroll_nccl_allgather_recv&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">raw_recv&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">padded_send&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">send_shapes&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># &amp;lt; snip - MPI allgather &amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>After this feature was added, the PyLops-MPI with NCCL now catches up with its original MPI implementation, i.e., the test coverage is now the same 306 tests passed !&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;h3 id="benchmark-instrumentation-pr-157httpsgithubcompylopspylops-mpipull157">Benchmark Instrumentation &lt;a href="https://github.com/PyLops/pylops-mpi/pull/157" target="_blank" rel="noopener">PR #157&lt;/a>&lt;/h3>
&lt;/li>
&lt;/ul>
&lt;p>Profiling distributed GPU operations is critical to understanding performance bottlenecks. To make this easier, we added a &lt;em>lightweight benchmark instrumentation&lt;/em> framework in PyLops-MPI. The goal was to allow developers to mark execution points in a function and collect timing information for these markers.&lt;/p>
&lt;p>The core of the implementation is a &lt;code>@benchmark decorator&lt;/code>. Inside a decorated function, developers can call &lt;code>mark(label)&lt;/code> to record the time at specific points. After the function completes, the timings are reported in a human-readable format. This design is inspired by C++-style instrumentation, letting developers place markers directly in the code where they are most informative.&lt;/p>
&lt;p>But because we are in Python, to handle nested function calls, we collect the timing information as a stack (bottom-up call graph) and parse the result at the end of the decorated function. Here&amp;rsquo;s is the illustration:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="nd">@benchmark&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">outer_func_with_mark&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mark&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Outer func start&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">inner_func_with_mark&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># &amp;lt;- this does `dot` and is also decorated&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dist_arr&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">DistributedArray&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">global_shape&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;global_shape&amp;#39;&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">partition&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;partition&amp;#39;&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dtype&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;dtype&amp;#39;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">axis&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;axis&amp;#39;&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dist_arr&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">dist_arr&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mark&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Outer func ends&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The text output is&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">[decorator]outer_func_with_mark: total runtime: 0.001206 s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> [decorator]inner_func_with_mark: total runtime: 0.000351 s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> Begin array constructor--&amp;gt;Begin dot: 0.000026 s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> Begin dot--&amp;gt;Finish dot: 0.000322 s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> Outer func start--&amp;gt;Outer func ends: 0.001202 s
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Benchmarking is controlled via the environment variable &lt;code>BENCH_PYLOPS_MPI&lt;/code>. It defaults to &lt;code>1&lt;/code> (enable) but can be set to &lt;code>0&lt;/code> to skip benchmarking for clean output. &lt;strong>This means users can leave the decorated code unchanged and disable the benchmark through the environment variable&lt;/strong>. This is inspired by the C++ debug flag set during the compilation. Moreover, careful attention had to be made on concurrency issue of benchmarking because the time is recorded by CPU while the NCCL issues the operation in an async manner to CUDA stream &lt;a href="https://github.com/PyLops/pylops-mpi/pull/163" target="_blank" rel="noopener">PR #163&lt;/a> is an example of this.&lt;/p>
&lt;ul>
&lt;li>
&lt;h3 id="benchmark-result">Benchmark Result&lt;/h3>
&lt;/li>
&lt;/ul>
&lt;p>This was the moment of truth. Our 12-week hardwork would be judged by a set of hard cold numbers. Our expectation was that&lt;/p>
&lt;ul>
&lt;li>If the system does not have proprietary NVLink for GPU-GPU communication but is NCCL-compatible, the communication using &lt;code>CuPy + NCCL&lt;/code> should still be faster than &lt;code>NumPy + MPI&lt;/code> (and possibly&lt;code>CuPy + MPI&lt;/code>) in PyLops-MPI i.e., there should be a benefit from using NCCL from communication-related optimizations enabled by this project.&lt;/li>
&lt;/ul>
&lt;p>The result below was from NCSA UIUC Delta system &lt;a href="https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/architecture.html" target="_blank" rel="noopener">4-Way NVIDIA A40 GPU&lt;/a> (no NVLink) with the &lt;code>allreduce&lt;/code> operation.&lt;/p>
&lt;p align="center">
&lt;img width="400" height="300" alt="image" src="https://gist.github.com/user-attachments/assets/b139e63d-11ed-47f4-95f8-5e86bed26312" />
&lt;/p>
&lt;p>That meets our expection. One thing to note here is: we see that actually the &lt;code>CuPy + MPI&lt;/code> communication being slower than the &lt;code>NumPy + MPI&lt;/code>. This is because the current implementation of PyLops-MPI uses non-buffered calls of &lt;code>mpi4py&lt;/code> - see detail &lt;a href="https://mpi4py.readthedocs.io/en/stable/tutorial.html" target="_blank" rel="noopener">here&lt;/a>. The choice was made due to its simplicity as it allowed send and receiving generic Python object wrapped in a &lt;code>list&lt;/code> and thus allowed fast development process. However, These require the memory copy from GPU to CPU, do communication, and copy memory from CPU to GPU (pickle protocol) - see our discussion with &lt;code>mpi4py&lt;/code> community &lt;a href="https://github.com/mpi4py/mpi4py/discussions/657" target="_blank" rel="noopener">here&lt;/a>. This leads us to “Things left to do” section (later).&lt;/p>
&lt;ul>
&lt;li>If the system has an NVLink for GPU-GPU communication, we will be able to see a significant gain in performance of PyLops-MPI with NCCL.&lt;/li>
&lt;/ul>
&lt;p>The result below is also from NCSA UIUC Delta system &lt;a href="https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/architecture.html" target="_blank" rel="noopener">8-Way NVIDIA H200 GPU&lt;/a> (with NVLink) but we only use 4 GPUs to compare with previous result. This is also with the &lt;code>allreduce&lt;/code> operation.&lt;/p>
&lt;p align="center">
&lt;img width="400" height="300" alt="image" src="https://gist.github.com/user-attachments/assets/b3d83547-b9af-4b1c-87c0-ace2302eb140" />
&lt;/p>
&lt;p>Here we unleash the true power of NCCL and its infrasture as you can see that &lt;strong>the bandwidth of PyLops-MPI with NCCL is 800x of the MPI implementation !&lt;/strong>. It may not make much sense to compare the number with &lt;code>NumPy+MPI&lt;/code> because there is drastic hardware infrastructure upgrade involved.&lt;/p>
&lt;p>To top things off, we also ran the experiment trying to saturate the communication with the array size going up to 32 GB in total. We can see that we have the linear scaling i.e. time vs. data size grows linearly.&lt;/p>
&lt;p align="center">
&lt;img width="400" height="300" alt="image" src="https://gist.github.com/user-attachments/assets/e5a95fdc-8db7-4caf-925f-256f504603bc" />
&lt;/p>
&lt;p>Finally, we ran an experiment with the application of &lt;a href="https://wiki.seg.org/wiki/Least-squares_migration" target="_blank" rel="noopener">Least-squares Migration&lt;/a>, which is an iterative inversion scheme:&lt;/p>
&lt;ul>
&lt;li>Each iteration applies a forward &lt;code>A&lt;/code> and an adjoint &lt;code>A.T&lt;/code> operation to form residuals and gradients.&lt;/li>
&lt;li>A gradient accumulation requires a global reduction across processes with &lt;code>allreduce&lt;/code>.
Note that the computation is not trivial and so the total run-time of CPU and GPU is not fairly comparable (notice that in H200, the CuPy+MPI is not the slowest anymore). But we want to give the idea of how things piece together in the real application.&lt;/li>
&lt;/ul>
&lt;div align="center">
&lt;img width="400" height="300" alt="kirchA40"
src="https://gist.github.com/user-attachments/assets/46c3a76a-20a3-40c3-981e-6e1c4acecb49" />
&lt;img width="400" height="300" alt="kirchhoff_h200"
src="https://gist.github.com/user-attachments/assets/1439304a-8f78-4640-a78b-ba37238b26e6" />
&lt;/div>
&lt;h3 id="the-impact-of-this-gsoc-project-is-clear">The impact of this GSoC project is clear:&lt;/h3>
&lt;p>With our NCCL-enabled PyLops-MPI,&lt;/p>
&lt;ul>
&lt;li>if you don&amp;rsquo;t have access to the state-of-the-art infrastructure, PyLops-MPI with NCCL can still 10x the communication bandwith (A40 case)&lt;/li>
&lt;li>if you do, we allow you to get the most out of the system (H200 case).&lt;/li>
&lt;/ul>
&lt;p>And the best thing is to use NCCL with PyLops-MPI, it requires minimal code change as shown in this &lt;a href="https://github.com/PyLops/pylops-mpi/blob/main/tutorials_nccl/lsm_nccl.py" target="_blank" rel="noopener">LSM Tutorial&lt;/a> and illustrated below. Only two change require from the code that run on MPI: the array must be allocated in GPU and nccl has to be passed to the &lt;code>DistributedArray&lt;/code>. And that&amp;rsquo;s it !&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">nccl_comm&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pylops_mpi&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">utils&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">_nccl&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">initialize_nccl_comm&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># &amp;lt;snip - same set-up as running with MPI&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">lsm&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">LSM&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># &amp;lt;snip&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">cp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">asarray&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">wav&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">float32&lt;/span>&lt;span class="p">)),&lt;/span> &lt;span class="c1"># Copy to GPU&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># &amp;lt;snip&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">engine&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;cuda&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dtype&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">float32&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">lsm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Demop&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">trav_srcs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">cp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">asarray&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">lsm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Demop&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">trav_srcs&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">float32&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="c1"># Copy to GPU&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">lsm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Demop&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">trav_recs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">cp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">asarray&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">lsm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Demop&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">trav_recs&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">float32&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="c1"># Copy to GPU&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">x0&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pylops_mpi&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">DistributedArray&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">VStack&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">partition&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">pylops_mpi&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Partition&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">BROADCAST&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">base_comm_nccl&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">nccl_comm&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="c1"># Explicitly pass nccl communicator&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">engine&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;cupy&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># Must use CuPy&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># &amp;lt;snip - the rest is the same&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="things-left-to-do">Things left to do&lt;/h3>
&lt;ul>
&lt;li>CUDA-Aware MPI: As we pointed out in the A40 experiment that current implementation of PyLops-MPI use non-buffered calls of &lt;code>mpi4py&lt;/code> and thus introduces the memory copying from GPU to CPU. We aim to optimize this by introducing the buffered calls. However, this is not a trivial task because some of the MPI-related code was developed based on the semantics that the communication return the &lt;code>list&lt;/code> object while the buffered call will return the array instead.&lt;/li>
&lt;/ul></description></item><item><title>Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250723-tharit/</link><pubDate>Wed, 23 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250723-tharit/</guid><description>&lt;h1 id="enabling-nccl-gpu-gpu-communication-in-pylops-mpi---google-summer-of-code-project-2025---part-1">Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 1&lt;/h1>
&lt;p>Hello all! 👋 My name is Tharit, and I&amp;rsquo;m a computer science student at the University of Texas at Austin. This summer, I am fortunate to participate in the Google Summer of Code (GSoC) 2025 program, hosted by &lt;a href="https://ucsc-ospo.github.io/" target="_blank" rel="noopener">UC OSPO&lt;/a> and the &lt;a href="https://github.com/PyLops/pylops" target="_blank" rel="noopener">PyLops&lt;/a> team. My project focuses on enabling NCCL GPU-to-GPU communication in &lt;a href="https://github.com/PyLops/pylops-mpi" target="_blank" rel="noopener">PyLops-MPI&lt;/a>, under the guidance of mentors Matteo Ravasi and Yuxi Hong.&lt;/p>
&lt;p>You might have come across this post if you&amp;rsquo;re a PyLops user interested in scaling &lt;a href="https://github.com/PyLops/pylops-mpi" target="_blank" rel="noopener">PyLops-MPI&lt;/a> with GPU/NCCL support, or if you&amp;rsquo;re exploring GSoC projects and wondering what we are up to. Either way, I hope this post gives you useful insights.&lt;/p>
&lt;h2 id="what-is-pylops-mpi">What is PyLops-MPI?&lt;/h2>
&lt;p>If you&amp;rsquo;ve worked with inverse problems, you&amp;rsquo;ve likely come across &lt;a href="https://github.com/PyLops/pylops" target="_blank" rel="noopener">PyLops&lt;/a>. It&amp;rsquo;s a Python library that provides an extensive suite of linear operators and solvers. Operators are designed with a clear focus on the forward and adjoint pair (A and A.T) whilst solvers take operators and data to solve the associate inverse problem. In fields such as geophysics, astrophysics, or medical imaging,, inverse problem are solved routinely to &lt;a href="https://www.ae.utexas.edu/news/inverse-problem-solving-bui-than" target="_blank" rel="noopener">image the Earth, the space, or the human body from remote measurements&lt;/a>. In all cases, real-life problems tend to consume a lot of computing and require a lot of memory. PyLops allows users to express these problems in an abstract manner that is reminiscent of the underlying equations whilst not compromising on efficiency.&lt;/p>
&lt;p>&lt;a href="https://github.com/PyLops/pylops-mpi" target="_blank" rel="noopener">PyLops-MPI&lt;/a> is the distributed extension of PyLops, introduced during &lt;a href="https://summerofcode.withgoogle.com/archive/2023/projects/eNJTJO25" target="_blank" rel="noopener">GSoC 2023&lt;/a>. It enables users to scale their computations over CPU and GPU clusters via MPI. However, up until now, even GPU-based communications were routed through MPI, introducing potential performance bottlenecks.&lt;/p>
&lt;h2 id="the-goal-of-the-project">The Goal of the Project&lt;/h2>
&lt;p>Our goal is to take PyLops-MPI to the next level by enabling GPU-to-GPU collective communications directly using NVIDIA NCCL. This allows full utilization of high-bandwidth interconnects like &lt;a href="https://www.nvidia.com/en-us/data-center/nvlink/" target="_blank" rel="noopener">NVLink&lt;/a>, and avoids unnecessary memory transfers through the host CPU.
This blog marks the midpoint of the program (week 6 of 12), and I’d like to reflect on the progress so far, challenges faced, and what&amp;rsquo;s coming next.&lt;/p>
&lt;h2 id="what-is-a-collective-communication-anyway">What is a Collective Communication anyway?&lt;/h2>
&lt;p>In PyLops-MPI, distributed computations require nodes to exchange information, for example, during gradient computations or reductions in iterative solvers. A naive implementation (useful for a thought experiment) would involve each node taking turns broadcasting data, which can be quite slow.
NVIDIA’s NCCL abstracts away the complexity of topology-aware communication. For example, in the image below, if GPUs are most effective by communicating in a ring fashion for all-reduce operation, NCCL will automatically pick that layout and not use GPU 01-GPU 04 and GPU 02-GPU 03 communication links.&lt;/p>
&lt;p align="center">
&lt;img src="network.png" alt="network" width="800/>
&lt;/p>
&lt;p>&lt;em>Example of a compute node with 4 GPUs attached, directly connected to each other with NVLink&lt;/em>&lt;/p>
&lt;h2 id="what-we-achieved-so-far">What we achieved, so far&lt;/h2>
&lt;p>It is probably best to tell stories through the sequence of pull requests.&lt;/p>
&lt;h3 id="core-changes-in-distributedarray-pr-130httpsgithubcompylopspylops-mpipull130">Core Changes in DistributedArray (&lt;a href="https://github.com/PyLops/pylops-mpi/pull/130" target="_blank" rel="noopener">PR #130&lt;/a>)&lt;/h3>
&lt;p>This PR introduces NCCL support into the &lt;code>DistributedArray&lt;/code> class. The design allows users to optionally pass both a &lt;code>NcclCommunicator&lt;/code> and a &lt;code>MPI.Comm&lt;/code>. By doing so, small control data (e.g., shape, dtype) is still exchanged via MPI, leveraging Python&amp;rsquo;s flexibility and minimizing performance impact. As you will see, this decision to keep two communicators turns out to be a good call.
This is how the &lt;code>__init__&lt;/code> method of &lt;code>DistributedArray&lt;/code> looks like with the new addition in bold:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl"> &lt;span class="k">def&lt;/span> &lt;span class="fm">__init__&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">global_shape&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Union&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">Tuple&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Integral&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">base_comm&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Optional&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">MPI&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Comm&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">MPI&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">COMM_WORLD&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">base_comm_nccl&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Optional&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">NcclCommunicatorType&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">None&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="c1"># Added to this line&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">partition&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Partition&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Partition&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">SCATTER&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">axis&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="nb">int&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">local_shapes&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Optional&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">List&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">Union&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">Tuple&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Integral&lt;/span>&lt;span class="p">]]]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">None&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mask&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Optional&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">List&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">Integral&lt;/span>&lt;span class="p">]]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">None&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="the-cupys-nccl-api">The CuPy&amp;rsquo;s NCCL API&lt;/h3>
&lt;p>NCCL&amp;rsquo;s API (&lt;a href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/colls.html" target="_blank" rel="noopener">mirroring its C++ origins&lt;/a>) is minimalistic, requiring manual memory management. One prominent example is the implementation of &lt;code>allGather()&lt;/code> Previously, using &lt;code>mpi4py&lt;/code> we could leverage the Python dynamic typing nature (everything is an object, so one just sends the object). This means &lt;code>mpi4py&lt;/code> allows different ranks to send different sizes of arrays. NCCL requires every rank in the communicator to send the same size. To work around this, we implemented padding and reshaping logic in multi-dimensional arrays. NCCL treats arrays as contiguous byte streams, so padding must be handled carefully &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>.&lt;/p>
&lt;p>Moreover, we had to accommodate NCCL’s lower-level API, lacking conveniences like communicator’s split variants. Internally, we introduced unified abstractions such as &lt;code>_allgather()&lt;/code>, &lt;code>_allreduce()&lt;/code>, &lt;code>send()&lt;/code>, &lt;code>recv()&lt;/code>. etc to &lt;code>DistributedArray&lt;/code> and modified the communication model to work seamlessly whether MPI or NCCL is used. By doing this, other developers can focus on developing new operators (that suit their needs), and abstract away the existence of different communicators.&lt;/p>
&lt;p align="center">
&lt;img src="partition.png" alt="partition" width="800/>
&lt;/p>
&lt;p>&lt;em>Example of a challenge coming from having an unevenly distributed array&lt;/em>&lt;/p>
&lt;h3 id="keep-things-small-dependency-management-pr-132httpsgithubcompylopspylops-mpipull132-and-pr-135httpsgithubcompylopspylops-mpipull135">Keep things small: Dependency management (&lt;a href="https://github.com/PyLops/pylops-mpi/pull/132" target="_blank" rel="noopener">PR #132&lt;/a> and &lt;a href="https://github.com/PyLops/pylops-mpi/pull/135" target="_blank" rel="noopener">PR #135&lt;/a>)&lt;/h3>
&lt;p>Despite adding this new capability, we are fully aware that not every user has access to a cluster of GPU and therefore we don’t make NCCL and CuPy mandatory dependencies. The first time someoneinstalls and experiments with PyLops-MPI is likely to run it on a single-node desktop. And we don’t want to introduce such complexity early on. This means that our code has to accommodate “optional dependency” or have a “protected import”. If we have &lt;code>import cupy as cp&lt;/code> at the beginning of &lt;code>DistributedArray&lt;/code>, users without GPU will encounter an error before doing anything useful at all.
In other words, our library should import CuPy and NCCL automatically when both the system enables it and users also ask for it. The pattern looks like &lt;a href="%28https://github.com/PyLops/pylops-mpi/blob/main/pylops_mpi/utils/deps.py%29">this&lt;/a>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">nccl_test&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">util&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">find_spec&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;cupy&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="ow">is&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="kc">None&lt;/span> &lt;span class="ow">and&lt;/span> &lt;span class="nb">int&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">getenv&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;NCCL_PYLOPS_MPI&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">if&lt;/span> &lt;span class="n">nccl_test&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># try import CuPy and then check for NCCL&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">nccl&lt;/span> &lt;span class="ow">is&lt;/span> &lt;span class="n">available&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># success&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># unable to import but the package is installed&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># package is not installed or the environment variable disables it&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Finally, set nccl_enabled flag for other module to use for protected import&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This helps preserve PyLops-MPI’s minimal base installation. This required carefully isolating imports and adapting the module resolution logic using a backend dispatching mechanism.
This is something I never thought of or took it into consideration before.&lt;/p>
&lt;h3 id="the-basic-operator-with-nccl-pr-137httpsgithubcompylopspylops-mpipull137">The Basic Operator with NCCL &lt;a href="https://github.com/PyLops/pylops-mpi/pull/137" target="_blank" rel="noopener">PR 137&lt;/a>&lt;/h3>
&lt;p>We chose &lt;code>MPIVStack&lt;/code> as the first operator to implement NCCL support due to its simplicity. Several design choices emerged:&lt;/p>
&lt;h4 id="implicit-communicator-propagation">Implicit Communicator Propagation&lt;/h4>
&lt;p>We updated forward and adjoint calls to propagate the &lt;code>base_comm_nccl&lt;/code> from input to output automatically. This way, if &lt;code>x&lt;/code> is NCCL-enabled, then &lt;code>y = A @ x&lt;/code> or &lt;code>A.H @ x&lt;/code> will also be NCCL-enabled. This avoids mismatches and keeps operator pipelines consistent.&lt;/p>
&lt;p>Interestingly, the operator itself did not need to explicitly take base_comm_nccl as an argument, reducing complexity for developers extending PyLops-MPI. This point is contrary to our initial expectation. The operator does not have to take &lt;code>base_comm_nccl&lt;/code> like &lt;code>DistributedArray&lt;/code> did. This is good news. This reduces the potential that other developers may have to deal with different cases of communication when adding new operators.&lt;/p>
&lt;h4 id="optional-dual-communicator-design">Optional Dual-Communicator Design&lt;/h4>
&lt;p>As with DistributedArray, the ability to pass both an MPI communicator and an NCCL communicator proved to be a sound decision. By maintaining NCCL as an optional backend, we gain fine-grained control over which communication paths use NCCL versus MPI. This flexibility allowed us to optimize performance-critical paths while retaining MPI for control messages and small metadata transfers.&lt;/p>
&lt;p>In particular, in the communication of ghost cells, which are used for computation around the boundary, like in derivative calculation, small metadata, such as cell_fronts (typically lists of rank-sized integers) continues to be efficiently transmitted via MPI. These metadata are needed for send/receiver buffer allocations. This leverages Python&amp;rsquo;s object serialization model (&lt;code>list[int]&lt;/code>) without incurring GPU synchronization costs. But the actual cell array itself is communicated with NCCL since these arrays can be large.&lt;/p>
&lt;h2 id="whats-next">What’s Next?&lt;/h2>
&lt;p>Aside from enabling NCCL support for the remaining operators and their full test coverage, some more exciting upcoming updates are&lt;/p>
&lt;ul>
&lt;li>Complex-number type support for NCCL&lt;/li>
&lt;li>Benchmarking results on a real HPC system&lt;/li>
&lt;/ul>
&lt;p>Stay tuned for Part 2, and thanks for reading!&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>For the best performance mpi4py would require the buffer memory allocation as well. The mpi4py package provides two interface: buffered and non-bufferred. Currently PyLops-MPI takes the non-buffered approach. This suggest a room of optimization.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250608-tharit/</link><pubDate>Sun, 08 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250608-tharit/</guid><description>&lt;h1 id="google-summer-of-code-25-optimizing-and-benchmarking-gpu-collective-communication-of-pylops-mpi-with-nccl">Google Summer of Code ‘25: Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL&lt;/h1>
&lt;p>My project aims to introduce GPU-to-GPU collective communication calls using Nvidia&amp;rsquo;s NCCL to &lt;a href="https://github.com/PyLops/pylops-mpi" target="_blank" rel="noopener">PyLops-MPI&lt;/a>, an extension of the powerful &lt;a href="https://github.com/PyLops/pylops" target="_blank" rel="noopener">PyLops&lt;/a> library.&lt;/p>
&lt;p>I&amp;rsquo;m incredibly grateful for this opportunity and excited to be mentored by two HPC experts, Yuxi Hong from Lawrence Berkeley National Laboratory and Matteo Ravasi from ShearWater GeoServices.&lt;/p>
&lt;p>Here&amp;rsquo;s also the link to my original &lt;a href="https://summerofcode.withgoogle.com/programs/2025/projects/C2XSZp2E" target="_blank" rel="noopener">proposal&lt;/a>&lt;/p>
&lt;h2 id="what-is-pylops-mpi-and-nccl-">What is PyLops-MPI and NCCL ?&lt;/h2>
&lt;p>PyLops is a Python library that provides a rich collection of linear operators to solve inverse problems. Its MPI extension, PyLops-MPI, takes this a step further by enabling these operations to run on large-scale, distributed computing systems like HPC using the Message-Passing Interface (MPI).&lt;/p>
&lt;p>Where does NCCL fit in? The NVIDIA Collective Communication Library (NCCL) is a library of highly-optimized routines for collective communication between GPU. It offers the opportunity to close the performance gap in PyLops-MPI. As we now offload more and more computationally intensive tasks to GPUs, the communication between them can become a bottleneck. NCCL offers a powerful solution to this problem, enabling high-bandwidth, low-latency communication that can significantly boost performance.&lt;/p>
&lt;h2 id="motivation-and-what-was-missing">Motivation and What was Missing&lt;/h2>
&lt;p>As a student with a background in geophysics (B.Sc) and now pursuing computer science (M.Sc), I&amp;rsquo;ve experienced firsthand the challenges of scaling scientific computing research from a personal desktop to a high-performance computing (HPC) cluster. It can be a significant hurdle. My project aims to ease this transition for PyLops-MPI users. PyLops-MPI is something I wish existed while I was doing my undergraduate reseach !&lt;/p>
&lt;p>Currently, PyLops-MPI is &amp;ldquo;CUDA-aware,&amp;rdquo; meaning it can offload computations to GPUs. However, the communication between those GPUs is still handled by the underlying MPI implementation, which isn&amp;rsquo;t always optimal. This project will address this gap by integrating NCCL to handle GPU-to-GPU communication directly. If the compute is done in GPU, there shouldn&amp;rsquo;t be a copy from GPU to CPU, transfer with MPI, and put them back to GPU again.&lt;/p>
&lt;p>This will be especially impactful for memory-bound problems where high-bandwidth communication is critical. By the end of this project, we&amp;rsquo;ll have a clear, quantifiable understanding of the performance gains achieved.&lt;/p>
&lt;h2 id="my-best-laid-plan">My Best-Laid Plan&lt;/h2>
&lt;p>My approach is grounded in good software engineering practices to ensure that this new feature is both robust and genuinely useful. I was impressed by the code quality (enjoyable read) of the repository - and so I commit not to break that.&lt;/p>
&lt;p>First and foremost, the goal is to seamlessly integrate NCCL without breaking what already works. A significant part of my effort will be dedicated to rigorous testing. This means not only ensuring that all existing tests pass but also developing a new, comprehensive test suite to validate the correctness of the GPU-to-GPU communication across different hardware setups.&lt;/p>
&lt;p>Once we&amp;rsquo;re confident that the integration is solid, the exciting part begins: benchmarking (or you may call it &amp;ldquo;Moment of Truth&amp;rdquo;)! The plan is to measure the performance of end-to-end iterative solvers. These solvers are a perfect test case because they involve a mix of intensive gradient computations on the GPU and frequent AllReduce calls to sync up processes. This will give us a clear picture of the speedup and efficiency gains from using NCCL.&lt;/p>
&lt;p>Finally, to make sure this work benefits the entire community, I will create clear documentation and tutorials. The goal is to make it easy for any user to leverage this new GPU-accelerated communication in their own research and applications.&lt;/p></description></item></channel></rss>