<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>LBL | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/lbl/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/lbl/index.xml" rel="self" type="application/rss+xml"/><description>LBL</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Fri, 05 Sep 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>LBL</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/lbl/</link></image><item><title>Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250905-tharit/</link><pubDate>Fri, 05 Sep 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250905-tharit/</guid><description>&lt;h1 id="enabling-nccl-gpu-gpu-communication-in-pylops-mpi---google-summer-of-code-project-2025---part-2">Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 2&lt;/h1>
&lt;p>Hello all! 👋 This is Tharit again. I want to share this blog post about my Part 2 of Google Summer of Code projects. In case you miss it, you can take a look at &lt;a href="https://ucsc-ospo.github.io/report/osre25/lbl/pylops-mpi/20250723-tharit/" target="_blank" rel="noopener">Part1&lt;/a> as well. Without further introduction, these following supports were added since last time.&lt;/p>
&lt;ul>
&lt;li>
&lt;h3 id="complex-number-support-pr-148httpsgithubcompylopspylops-mpipull148">Complex Number Support &lt;a href="https://github.com/PyLops/pylops-mpi/pull/148" target="_blank" rel="noopener">PR #148&lt;/a>&lt;/h3>
&lt;/li>
&lt;/ul>
&lt;p>&lt;em>Between this PR and the previous PR, there are lots of debugging and testing to make sure that all existing &lt;code>MPILinearOperator&lt;/code> works under NCCL as they do with &lt;code>mpi4py&lt;/code> PR &lt;a href="https://github.com/PyLops/pylops-mpi/pull/141" target="_blank" rel="noopener">#141&lt;/a>, &lt;a href="https://github.com/PyLops/pylops-mpi/pull/142" target="_blank" rel="noopener">#142&lt;/a> &lt;a href="https://github.com/PyLops/pylops-mpi/pull/145" target="_blank" rel="noopener">#145&lt;/a>&lt;/em>&lt;/p>
&lt;p>Most of the PyLops-MPI users are scientists and engineers working on the scientific problems - and most of the scientific problem involves complex numbers (Fourier Transform touches many things). &lt;em>NCCL does not support the complex number out-of-the-box&lt;/em>.&lt;/p>
&lt;p>It turned out that adding complex-number support was not the big issue. The complex number is simply the contiguous array of, says, &lt;code>float64&lt;/code>. Unlike typical &lt;code>float64&lt;/code>, one element of &lt;code>complex128&lt;/code> number is then represented by two &lt;code>float64&lt;/code>. Things get more complicate if we start to talk about the complex number arithmatic. Luckily, &lt;a href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.ncclRedOp_t" target="_blank" rel="noopener">NCCL semantics&lt;/a> only supports &lt;em>element-wise&lt;/em> &lt;code>ncclSum&lt;/code>, &lt;code>ncclProd&lt;/code>, &lt;code>ncclMin&lt;/code>, &lt;code>ncclMax&lt;/code>, &lt;code>ncclAvg&lt;/code>. Wrapping element-wise operations for complex number is straightforward.&lt;/p>
&lt;p>The change to PyLops-MPI &lt;code>_nccl.py&lt;/code> itself is minimal. We simply added the function below and this hides the complexity of buffer-size management from users.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">_nccl_buf_size&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">buf&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">count&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">None&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34; Get an appropriate buffer size according to the dtype of buf
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> if buf.dtype in [&amp;#39;complex64&amp;#39;, &amp;#39;complex128&amp;#39;]:
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> return 2 * count if count else 2 * buf.size
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> else:
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> return count if count else buf.size
&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The conceptual is quite simple. But mechanically, to get it right in the general case required some extensive bug fixing, particularly in the call to &lt;code>_allgather &lt;/code>as noted earlier in the &amp;ldquo;Core Change&amp;rdquo; section. The array needs some preprocessing (to align with NCCL semantics) and post-processing so that the result from Pylops-MPI’s NCCL allgather matches with the PyLops-MPI allgather. This is because Pylops-MPI must be able to switch between &lt;code>mpi4py&lt;/code> and NCCL seamlessly from the user&amp;rsquo;s perspective. To make it concrete, here is how we do the &lt;code>_allgather()&lt;/code> with NCCL&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl"> &lt;span class="k">def&lt;/span> &lt;span class="nf">_allgather&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">send_buf&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">recv_buf&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="kc">None&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="s2">&amp;#34;&amp;#34;&amp;#34;Allgather operation
&lt;/span>&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="s2"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">deps&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">nccl_enabled&lt;/span> &lt;span class="ow">and&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">base_comm_nccl&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="nb">isinstance&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">send_buf&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="p">(&lt;/span>&lt;span class="nb">tuple&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">list&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="nb">int&lt;/span>&lt;span class="p">)):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">nccl_allgather&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">base_comm_nccl&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">send_buf&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">recv_buf&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">send_shapes&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">base_comm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">allgather&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">send_buf&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="p">(&lt;/span>&lt;span class="n">padded_send&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">padded_recv&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">_prepare_nccl_allgather_inputs&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">send_buf&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">send_shapes&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">raw_recv&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">nccl_allgather&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">base_comm_nccl&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">padded_send&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">recv_buf&lt;/span> &lt;span class="k">if&lt;/span> &lt;span class="n">recv_buf&lt;/span> &lt;span class="k">else&lt;/span> &lt;span class="n">padded_recv&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">return&lt;/span> &lt;span class="n">_unroll_nccl_allgather_recv&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">raw_recv&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">padded_send&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">send_shapes&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># &amp;lt; snip - MPI allgather &amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>After this feature was added, the PyLops-MPI with NCCL now catches up with its original MPI implementation, i.e., the test coverage is now the same 306 tests passed !&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>
&lt;h3 id="benchmark-instrumentation-pr-157httpsgithubcompylopspylops-mpipull157">Benchmark Instrumentation &lt;a href="https://github.com/PyLops/pylops-mpi/pull/157" target="_blank" rel="noopener">PR #157&lt;/a>&lt;/h3>
&lt;/li>
&lt;/ul>
&lt;p>Profiling distributed GPU operations is critical to understanding performance bottlenecks. To make this easier, we added a &lt;em>lightweight benchmark instrumentation&lt;/em> framework in PyLops-MPI. The goal was to allow developers to mark execution points in a function and collect timing information for these markers.&lt;/p>
&lt;p>The core of the implementation is a &lt;code>@benchmark decorator&lt;/code>. Inside a decorated function, developers can call &lt;code>mark(label)&lt;/code> to record the time at specific points. After the function completes, the timings are reported in a human-readable format. This design is inspired by C++-style instrumentation, letting developers place markers directly in the code where they are most informative.&lt;/p>
&lt;p>But because we are in Python, to handle nested function calls, we collect the timing information as a stack (bottom-up call graph) and parse the result at the end of the decorated function. Here&amp;rsquo;s is the illustration:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="nd">@benchmark&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">def&lt;/span> &lt;span class="nf">outer_func_with_mark&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">):&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mark&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Outer func start&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">inner_func_with_mark&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># &amp;lt;- this does `dot` and is also decorated&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dist_arr&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">DistributedArray&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">global_shape&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;global_shape&amp;#39;&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">partition&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;partition&amp;#39;&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dtype&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;dtype&amp;#39;&lt;/span>&lt;span class="p">],&lt;/span> &lt;span class="n">axis&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">par&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="s1">&amp;#39;axis&amp;#39;&lt;/span>&lt;span class="p">])&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dist_arr&lt;/span> &lt;span class="o">+&lt;/span> &lt;span class="n">dist_arr&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mark&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;Outer func ends&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The text output is&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">[decorator]outer_func_with_mark: total runtime: 0.001206 s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> [decorator]inner_func_with_mark: total runtime: 0.000351 s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> Begin array constructor--&amp;gt;Begin dot: 0.000026 s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> Begin dot--&amp;gt;Finish dot: 0.000322 s
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> Outer func start--&amp;gt;Outer func ends: 0.001202 s
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Benchmarking is controlled via the environment variable &lt;code>BENCH_PYLOPS_MPI&lt;/code>. It defaults to &lt;code>1&lt;/code> (enable) but can be set to &lt;code>0&lt;/code> to skip benchmarking for clean output. &lt;strong>This means users can leave the decorated code unchanged and disable the benchmark through the environment variable&lt;/strong>. This is inspired by the C++ debug flag set during the compilation. Moreover, careful attention had to be made on concurrency issue of benchmarking because the time is recorded by CPU while the NCCL issues the operation in an async manner to CUDA stream &lt;a href="https://github.com/PyLops/pylops-mpi/pull/163" target="_blank" rel="noopener">PR #163&lt;/a> is an example of this.&lt;/p>
&lt;ul>
&lt;li>
&lt;h3 id="benchmark-result">Benchmark Result&lt;/h3>
&lt;/li>
&lt;/ul>
&lt;p>This was the moment of truth. Our 12-week hardwork would be judged by a set of hard cold numbers. Our expectation was that&lt;/p>
&lt;ul>
&lt;li>If the system does not have proprietary NVLink for GPU-GPU communication but is NCCL-compatible, the communication using &lt;code>CuPy + NCCL&lt;/code> should still be faster than &lt;code>NumPy + MPI&lt;/code> (and possibly&lt;code>CuPy + MPI&lt;/code>) in PyLops-MPI i.e., there should be a benefit from using NCCL from communication-related optimizations enabled by this project.&lt;/li>
&lt;/ul>
&lt;p>The result below was from NCSA UIUC Delta system &lt;a href="https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/architecture.html" target="_blank" rel="noopener">4-Way NVIDIA A40 GPU&lt;/a> (no NVLink) with the &lt;code>allreduce&lt;/code> operation.&lt;/p>
&lt;p align="center">
&lt;img width="400" height="300" alt="image" src="https://gist.github.com/user-attachments/assets/b139e63d-11ed-47f4-95f8-5e86bed26312" />
&lt;/p>
&lt;p>That meets our expection. One thing to note here is: we see that actually the &lt;code>CuPy + MPI&lt;/code> communication being slower than the &lt;code>NumPy + MPI&lt;/code>. This is because the current implementation of PyLops-MPI uses non-buffered calls of &lt;code>mpi4py&lt;/code> - see detail &lt;a href="https://mpi4py.readthedocs.io/en/stable/tutorial.html" target="_blank" rel="noopener">here&lt;/a>. The choice was made due to its simplicity as it allowed send and receiving generic Python object wrapped in a &lt;code>list&lt;/code> and thus allowed fast development process. However, These require the memory copy from GPU to CPU, do communication, and copy memory from CPU to GPU (pickle protocol) - see our discussion with &lt;code>mpi4py&lt;/code> community &lt;a href="https://github.com/mpi4py/mpi4py/discussions/657" target="_blank" rel="noopener">here&lt;/a>. This leads us to “Things left to do” section (later).&lt;/p>
&lt;ul>
&lt;li>If the system has an NVLink for GPU-GPU communication, we will be able to see a significant gain in performance of PyLops-MPI with NCCL.&lt;/li>
&lt;/ul>
&lt;p>The result below is also from NCSA UIUC Delta system &lt;a href="https://docs.ncsa.illinois.edu/systems/delta/en/latest/user_guide/architecture.html" target="_blank" rel="noopener">8-Way NVIDIA H200 GPU&lt;/a> (with NVLink) but we only use 4 GPUs to compare with previous result. This is also with the &lt;code>allreduce&lt;/code> operation.&lt;/p>
&lt;p align="center">
&lt;img width="400" height="300" alt="image" src="https://gist.github.com/user-attachments/assets/b3d83547-b9af-4b1c-87c0-ace2302eb140" />
&lt;/p>
&lt;p>Here we unleash the true power of NCCL and its infrasture as you can see that &lt;strong>the bandwidth of PyLops-MPI with NCCL is 800x of the MPI implementation !&lt;/strong>. It may not make much sense to compare the number with &lt;code>NumPy+MPI&lt;/code> because there is drastic hardware infrastructure upgrade involved.&lt;/p>
&lt;p>To top things off, we also ran the experiment trying to saturate the communication with the array size going up to 32 GB in total. We can see that we have the linear scaling i.e. time vs. data size grows linearly.&lt;/p>
&lt;p align="center">
&lt;img width="400" height="300" alt="image" src="https://gist.github.com/user-attachments/assets/e5a95fdc-8db7-4caf-925f-256f504603bc" />
&lt;/p>
&lt;p>Finally, we ran an experiment with the application of &lt;a href="https://wiki.seg.org/wiki/Least-squares_migration" target="_blank" rel="noopener">Least-squares Migration&lt;/a>, which is an iterative inversion scheme:&lt;/p>
&lt;ul>
&lt;li>Each iteration applies a forward &lt;code>A&lt;/code> and an adjoint &lt;code>A.T&lt;/code> operation to form residuals and gradients.&lt;/li>
&lt;li>A gradient accumulation requires a global reduction across processes with &lt;code>allreduce&lt;/code>.
Note that the computation is not trivial and so the total run-time of CPU and GPU is not fairly comparable (notice that in H200, the CuPy+MPI is not the slowest anymore). But we want to give the idea of how things piece together in the real application.&lt;/li>
&lt;/ul>
&lt;div align="center">
&lt;img width="400" height="300" alt="kirchA40"
src="https://gist.github.com/user-attachments/assets/46c3a76a-20a3-40c3-981e-6e1c4acecb49" />
&lt;img width="400" height="300" alt="kirchhoff_h200"
src="https://gist.github.com/user-attachments/assets/1439304a-8f78-4640-a78b-ba37238b26e6" />
&lt;/div>
&lt;h3 id="the-impact-of-this-gsoc-project-is-clear">The impact of this GSoC project is clear:&lt;/h3>
&lt;p>With our NCCL-enabled PyLops-MPI,&lt;/p>
&lt;ul>
&lt;li>if you don&amp;rsquo;t have access to the state-of-the-art infrastructure, PyLops-MPI with NCCL can still 10x the communication bandwith (A40 case)&lt;/li>
&lt;li>if you do, we allow you to get the most out of the system (H200 case).&lt;/li>
&lt;/ul>
&lt;p>And the best thing is to use NCCL with PyLops-MPI, it requires minimal code change as shown in this &lt;a href="https://github.com/PyLops/pylops-mpi/blob/main/tutorials_nccl/lsm_nccl.py" target="_blank" rel="noopener">LSM Tutorial&lt;/a> and illustrated below. Only two change require from the code that run on MPI: the array must be allocated in GPU and nccl has to be passed to the &lt;code>DistributedArray&lt;/code>. And that&amp;rsquo;s it !&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">nccl_comm&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pylops_mpi&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">utils&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">_nccl&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">initialize_nccl_comm&lt;/span>&lt;span class="p">()&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># &amp;lt;snip - same set-up as running with MPI&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">lsm&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">LSM&lt;/span>&lt;span class="p">(&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># &amp;lt;snip&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">cp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">asarray&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">wav&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">float32&lt;/span>&lt;span class="p">)),&lt;/span> &lt;span class="c1"># Copy to GPU&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># &amp;lt;snip&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">engine&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;cuda&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">dtype&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">float32&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">lsm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Demop&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">trav_srcs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">cp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">asarray&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">lsm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Demop&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">trav_srcs&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">float32&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="c1"># Copy to GPU&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">lsm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Demop&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">trav_recs&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">cp&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">asarray&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">lsm&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Demop&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">trav_recs&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">astype&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">np&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">float32&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="c1"># Copy to GPU&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="n">x0&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">pylops_mpi&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">DistributedArray&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">VStack&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">shape&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="mi">1&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">partition&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">pylops_mpi&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Partition&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">BROADCAST&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">base_comm_nccl&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="n">nccl_comm&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="c1"># Explicitly pass nccl communicator&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">engine&lt;/span>&lt;span class="o">=&lt;/span>&lt;span class="s2">&amp;#34;cupy&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="c1"># Must use CuPy&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># &amp;lt;snip - the rest is the same&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="things-left-to-do">Things left to do&lt;/h3>
&lt;ul>
&lt;li>CUDA-Aware MPI: As we pointed out in the A40 experiment that current implementation of PyLops-MPI use non-buffered calls of &lt;code>mpi4py&lt;/code> and thus introduces the memory copying from GPU to CPU. We aim to optimize this by introducing the buffered calls. However, this is not a trivial task because some of the MPI-related code was developed based on the semantics that the communication return the &lt;code>list&lt;/code> object while the buffered call will return the array instead.&lt;/li>
&lt;/ul></description></item><item><title>AIDRIN Privacy-Centric Enhancements: Backend &amp; UX Upgrades</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/aidrin/20250725-harish_balaji/</link><pubDate>Fri, 25 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/aidrin/20250725-harish_balaji/</guid><description>&lt;p>⏱️ Reading time: 5–6 minutes&lt;/p>
&lt;p>Hey everyone,&lt;/p>
&lt;p>If you’ve ever wondered what it takes to make AI data pipelines not just smarter, but safer and more transparent, you’re in the right place. The last few weeks working on AIDRIN for GSoC have been a deep dive into the engine room of privacy and backend systems that power the AIDRIN project. My focus has been on building out the core privacy infrastructure and backend features that power AIDRIN’s ability to give users real, actionable insights about their data. It’s been challenging, sometimes messy, but incredibly rewarding to see these changes make a tangible difference.&lt;/p>
&lt;p>Having Dr. Jean Luca Bez and Prof. Suren Byna as mentors, along with the support of the entire team, has truly made all the difference. Their guidance, encouragement, and collaborative spirit have been a huge part of this journey, whether I’m brainstorming new ideas or just trying to untangle a tricky bug.&lt;/p>
&lt;h2 id="privacy-metrics-making-data-safer">Privacy Metrics: Making Data Safer&lt;/h2>
&lt;p>A major part of my work has been putting data privacy at the front and center in AIDRIN. I focused on integrating essential privacy metrics like k-anonymity, l-diversity, t-closeness, and more, making sure they’re not just theoretical checkboxes, but real tools that users can interact with and understand. Now, these metrics are fully wired up in the backend and visualized in AIDRIN, so privacy risks are no longer just a vague concern. They are something AI data preparers can actually see and act on. Getting these metrics to work seamlessly with different datasets and ensuring their accuracy took some serious backend engineering, but the payoff has been worth it.&lt;/p>
&lt;h2 id="speeding-things-up-so-you-dont-have-to-wait-around">Speeding Things Up (So You Don’t Have To Wait Around)&lt;/h2>
&lt;p>As AIDRIN started handling bigger datasets, some of the calculations can be time-consuming because data has to be accessed every time a metric is computed. To address this, I added caching for previously computed metrics, like class imbalance and privacy checks, and set up asynchronous execution with Celery and Redis. This should make the app super responsive. Rather than waiting for heavy computations to finish, one can start taking notes about other metrics or explore different parts of the app while their results are loading in the background. It’s a small change, but it helps keep the workflow moving smoothly.&lt;/p>
&lt;h2 id="small-touch-ups-that-hopefully-make-a-big-difference">Small Touch Ups That (Hopefully) Make a Big Difference&lt;/h2>
&lt;p>I also spent time on the details that make the app easier to use. Tooltips now explain what the privacy metrics actually mean, error messages are clearer, and there’s a new cache info page where you can see and clear your cached data. The sensitive attribute dropdown is less confusing now, especially if you’re working with quasi-identifiers. These tweaks might seem minor, but they add up and make the app friendlier for everyone.&lt;/p>
&lt;h2 id="docs-docs-docs">Docs, Docs, Docs&lt;/h2>
&lt;p>I’m a big believer that good documentation is just as important as good code. I updated the docs to cover all the new features, added citations for the privacy metrics, and made the install process a bit more straightforward. Hopefully, this means new users and contributors can get up to speed without too much hassle.&lt;/p>
&lt;h2 id="huge-thanks-to-my-mentors-and-the-team">Huge Thanks to My Mentors and the Team&lt;/h2>
&lt;p>I really want to shine a light on Dr. Bez, Prof. Byna, and the entire AIDRIN team here. Their encouragement, practical advice, and collaborative spirit have been a huge part of my progress. Whether I’m stuck on a bug, brainstorming a new feature, or just need a second opinion, there’s always someone ready to help me think things through. Their experience and support have shaped not just the technical side of my work, but also how I approach problem-solving and teamwork.&lt;/p>
&lt;h2 id="whats-next">What’s Next?&lt;/h2>
&lt;p>Looking ahead, I’m planning to expand AIDRIN’s support for multimodal datasets and keep refining the privacy and fairness modules. There’s always something new to learn or improve, and I’m excited to keep building. If you’re interested in data quality, privacy, or open-source AI tools, I’d love to connect and swap ideas.&lt;/p>
&lt;p>Thanks for reading and for following along with my GSoC journey. I’ll be back soon with more updates!&lt;/p>
&lt;p>&lt;em>This is the second post in my 3-part GSoC series with AIDRIN. Stay tuned for the final update.&lt;/em>&lt;/p></description></item><item><title>Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250723-tharit/</link><pubDate>Wed, 23 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250723-tharit/</guid><description>&lt;h1 id="enabling-nccl-gpu-gpu-communication-in-pylops-mpi---google-summer-of-code-project-2025---part-1">Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 1&lt;/h1>
&lt;p>Hello all! 👋 My name is Tharit, and I&amp;rsquo;m a computer science student at the University of Texas at Austin. This summer, I am fortunate to participate in the Google Summer of Code (GSoC) 2025 program, hosted by &lt;a href="https://ucsc-ospo.github.io/" target="_blank" rel="noopener">UC OSPO&lt;/a> and the &lt;a href="https://github.com/PyLops/pylops" target="_blank" rel="noopener">PyLops&lt;/a> team. My project focuses on enabling NCCL GPU-to-GPU communication in &lt;a href="https://github.com/PyLops/pylops-mpi" target="_blank" rel="noopener">PyLops-MPI&lt;/a>, under the guidance of mentors Matteo Ravasi and Yuxi Hong.&lt;/p>
&lt;p>You might have come across this post if you&amp;rsquo;re a PyLops user interested in scaling &lt;a href="https://github.com/PyLops/pylops-mpi" target="_blank" rel="noopener">PyLops-MPI&lt;/a> with GPU/NCCL support, or if you&amp;rsquo;re exploring GSoC projects and wondering what we are up to. Either way, I hope this post gives you useful insights.&lt;/p>
&lt;h2 id="what-is-pylops-mpi">What is PyLops-MPI?&lt;/h2>
&lt;p>If you&amp;rsquo;ve worked with inverse problems, you&amp;rsquo;ve likely come across &lt;a href="https://github.com/PyLops/pylops" target="_blank" rel="noopener">PyLops&lt;/a>. It&amp;rsquo;s a Python library that provides an extensive suite of linear operators and solvers. Operators are designed with a clear focus on the forward and adjoint pair (A and A.T) whilst solvers take operators and data to solve the associate inverse problem. In fields such as geophysics, astrophysics, or medical imaging,, inverse problem are solved routinely to &lt;a href="https://www.ae.utexas.edu/news/inverse-problem-solving-bui-than" target="_blank" rel="noopener">image the Earth, the space, or the human body from remote measurements&lt;/a>. In all cases, real-life problems tend to consume a lot of computing and require a lot of memory. PyLops allows users to express these problems in an abstract manner that is reminiscent of the underlying equations whilst not compromising on efficiency.&lt;/p>
&lt;p>&lt;a href="https://github.com/PyLops/pylops-mpi" target="_blank" rel="noopener">PyLops-MPI&lt;/a> is the distributed extension of PyLops, introduced during &lt;a href="https://summerofcode.withgoogle.com/archive/2023/projects/eNJTJO25" target="_blank" rel="noopener">GSoC 2023&lt;/a>. It enables users to scale their computations over CPU and GPU clusters via MPI. However, up until now, even GPU-based communications were routed through MPI, introducing potential performance bottlenecks.&lt;/p>
&lt;h2 id="the-goal-of-the-project">The Goal of the Project&lt;/h2>
&lt;p>Our goal is to take PyLops-MPI to the next level by enabling GPU-to-GPU collective communications directly using NVIDIA NCCL. This allows full utilization of high-bandwidth interconnects like &lt;a href="https://www.nvidia.com/en-us/data-center/nvlink/" target="_blank" rel="noopener">NVLink&lt;/a>, and avoids unnecessary memory transfers through the host CPU.
This blog marks the midpoint of the program (week 6 of 12), and I’d like to reflect on the progress so far, challenges faced, and what&amp;rsquo;s coming next.&lt;/p>
&lt;h2 id="what-is-a-collective-communication-anyway">What is a Collective Communication anyway?&lt;/h2>
&lt;p>In PyLops-MPI, distributed computations require nodes to exchange information, for example, during gradient computations or reductions in iterative solvers. A naive implementation (useful for a thought experiment) would involve each node taking turns broadcasting data, which can be quite slow.
NVIDIA’s NCCL abstracts away the complexity of topology-aware communication. For example, in the image below, if GPUs are most effective by communicating in a ring fashion for all-reduce operation, NCCL will automatically pick that layout and not use GPU 01-GPU 04 and GPU 02-GPU 03 communication links.&lt;/p>
&lt;p align="center">
&lt;img src="network.png" alt="network" width="800/>
&lt;/p>
&lt;p>&lt;em>Example of a compute node with 4 GPUs attached, directly connected to each other with NVLink&lt;/em>&lt;/p>
&lt;h2 id="what-we-achieved-so-far">What we achieved, so far&lt;/h2>
&lt;p>It is probably best to tell stories through the sequence of pull requests.&lt;/p>
&lt;h3 id="core-changes-in-distributedarray-pr-130httpsgithubcompylopspylops-mpipull130">Core Changes in DistributedArray (&lt;a href="https://github.com/PyLops/pylops-mpi/pull/130" target="_blank" rel="noopener">PR #130&lt;/a>)&lt;/h3>
&lt;p>This PR introduces NCCL support into the &lt;code>DistributedArray&lt;/code> class. The design allows users to optionally pass both a &lt;code>NcclCommunicator&lt;/code> and a &lt;code>MPI.Comm&lt;/code>. By doing so, small control data (e.g., shape, dtype) is still exchanged via MPI, leveraging Python&amp;rsquo;s flexibility and minimizing performance impact. As you will see, this decision to keep two communicators turns out to be a good call.
This is how the &lt;code>__init__&lt;/code> method of &lt;code>DistributedArray&lt;/code> looks like with the new addition in bold:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl"> &lt;span class="k">def&lt;/span> &lt;span class="fm">__init__&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="bp">self&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">global_shape&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Union&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">Tuple&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Integral&lt;/span>&lt;span class="p">],&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">base_comm&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Optional&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">MPI&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">Comm&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">MPI&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">COMM_WORLD&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">base_comm_nccl&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Optional&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">NcclCommunicatorType&lt;/span>&lt;span class="p">]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">None&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="c1"># Added to this line&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">partition&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Partition&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">Partition&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">SCATTER&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">axis&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="nb">int&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="mi">0&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">local_shapes&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Optional&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">List&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">Union&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">Tuple&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="n">Integral&lt;/span>&lt;span class="p">]]]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">None&lt;/span>&lt;span class="p">,&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="n">mask&lt;/span>&lt;span class="p">:&lt;/span> &lt;span class="n">Optional&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">List&lt;/span>&lt;span class="p">[&lt;/span>&lt;span class="n">Integral&lt;/span>&lt;span class="p">]]&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="kc">None&lt;/span>&lt;span class="p">)&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="the-cupys-nccl-api">The CuPy&amp;rsquo;s NCCL API&lt;/h3>
&lt;p>NCCL&amp;rsquo;s API (&lt;a href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/colls.html" target="_blank" rel="noopener">mirroring its C++ origins&lt;/a>) is minimalistic, requiring manual memory management. One prominent example is the implementation of &lt;code>allGather()&lt;/code> Previously, using &lt;code>mpi4py&lt;/code> we could leverage the Python dynamic typing nature (everything is an object, so one just sends the object). This means &lt;code>mpi4py&lt;/code> allows different ranks to send different sizes of arrays. NCCL requires every rank in the communicator to send the same size. To work around this, we implemented padding and reshaping logic in multi-dimensional arrays. NCCL treats arrays as contiguous byte streams, so padding must be handled carefully &lt;sup id="fnref:1">&lt;a href="#fn:1" class="footnote-ref" role="doc-noteref">1&lt;/a>&lt;/sup>.&lt;/p>
&lt;p>Moreover, we had to accommodate NCCL’s lower-level API, lacking conveniences like communicator’s split variants. Internally, we introduced unified abstractions such as &lt;code>_allgather()&lt;/code>, &lt;code>_allreduce()&lt;/code>, &lt;code>send()&lt;/code>, &lt;code>recv()&lt;/code>. etc to &lt;code>DistributedArray&lt;/code> and modified the communication model to work seamlessly whether MPI or NCCL is used. By doing this, other developers can focus on developing new operators (that suit their needs), and abstract away the existence of different communicators.&lt;/p>
&lt;p align="center">
&lt;img src="partition.png" alt="partition" width="800/>
&lt;/p>
&lt;p>&lt;em>Example of a challenge coming from having an unevenly distributed array&lt;/em>&lt;/p>
&lt;h3 id="keep-things-small-dependency-management-pr-132httpsgithubcompylopspylops-mpipull132-and-pr-135httpsgithubcompylopspylops-mpipull135">Keep things small: Dependency management (&lt;a href="https://github.com/PyLops/pylops-mpi/pull/132" target="_blank" rel="noopener">PR #132&lt;/a> and &lt;a href="https://github.com/PyLops/pylops-mpi/pull/135" target="_blank" rel="noopener">PR #135&lt;/a>)&lt;/h3>
&lt;p>Despite adding this new capability, we are fully aware that not every user has access to a cluster of GPU and therefore we don’t make NCCL and CuPy mandatory dependencies. The first time someoneinstalls and experiments with PyLops-MPI is likely to run it on a single-node desktop. And we don’t want to introduce such complexity early on. This means that our code has to accommodate “optional dependency” or have a “protected import”. If we have &lt;code>import cupy as cp&lt;/code> at the beginning of &lt;code>DistributedArray&lt;/code>, users without GPU will encounter an error before doing anything useful at all.
In other words, our library should import CuPy and NCCL automatically when both the system enables it and users also ask for it. The pattern looks like &lt;a href="%28https://github.com/PyLops/pylops-mpi/blob/main/pylops_mpi/utils/deps.py%29">this&lt;/a>:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-python" data-lang="python">&lt;span class="line">&lt;span class="cl">&lt;span class="n">nccl_test&lt;/span> &lt;span class="o">=&lt;/span> &lt;span class="n">util&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">find_spec&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;cupy&amp;#34;&lt;/span>&lt;span class="p">)&lt;/span> &lt;span class="ow">is&lt;/span> &lt;span class="ow">not&lt;/span> &lt;span class="kc">None&lt;/span> &lt;span class="ow">and&lt;/span> &lt;span class="nb">int&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="n">os&lt;/span>&lt;span class="o">.&lt;/span>&lt;span class="n">getenv&lt;/span>&lt;span class="p">(&lt;/span>&lt;span class="s2">&amp;#34;NCCL_PYLOPS_MPI&amp;#34;&lt;/span>&lt;span class="p">,&lt;/span> &lt;span class="mi">1&lt;/span>&lt;span class="p">))&lt;/span> &lt;span class="o">==&lt;/span> &lt;span class="mi">1&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">if&lt;/span> &lt;span class="n">nccl_test&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># try import CuPy and then check for NCCL&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">if&lt;/span> &lt;span class="n">nccl&lt;/span> &lt;span class="ow">is&lt;/span> &lt;span class="n">available&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># success&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># unable to import but the package is installed&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="k">else&lt;/span>&lt;span class="p">:&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &lt;span class="c1"># package is not installed or the environment variable disables it&lt;/span>
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&lt;span class="c1"># Finally, set nccl_enabled flag for other module to use for protected import&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This helps preserve PyLops-MPI’s minimal base installation. This required carefully isolating imports and adapting the module resolution logic using a backend dispatching mechanism.
This is something I never thought of or took it into consideration before.&lt;/p>
&lt;h3 id="the-basic-operator-with-nccl-pr-137httpsgithubcompylopspylops-mpipull137">The Basic Operator with NCCL &lt;a href="https://github.com/PyLops/pylops-mpi/pull/137" target="_blank" rel="noopener">PR 137&lt;/a>&lt;/h3>
&lt;p>We chose &lt;code>MPIVStack&lt;/code> as the first operator to implement NCCL support due to its simplicity. Several design choices emerged:&lt;/p>
&lt;h4 id="implicit-communicator-propagation">Implicit Communicator Propagation&lt;/h4>
&lt;p>We updated forward and adjoint calls to propagate the &lt;code>base_comm_nccl&lt;/code> from input to output automatically. This way, if &lt;code>x&lt;/code> is NCCL-enabled, then &lt;code>y = A @ x&lt;/code> or &lt;code>A.H @ x&lt;/code> will also be NCCL-enabled. This avoids mismatches and keeps operator pipelines consistent.&lt;/p>
&lt;p>Interestingly, the operator itself did not need to explicitly take base_comm_nccl as an argument, reducing complexity for developers extending PyLops-MPI. This point is contrary to our initial expectation. The operator does not have to take &lt;code>base_comm_nccl&lt;/code> like &lt;code>DistributedArray&lt;/code> did. This is good news. This reduces the potential that other developers may have to deal with different cases of communication when adding new operators.&lt;/p>
&lt;h4 id="optional-dual-communicator-design">Optional Dual-Communicator Design&lt;/h4>
&lt;p>As with DistributedArray, the ability to pass both an MPI communicator and an NCCL communicator proved to be a sound decision. By maintaining NCCL as an optional backend, we gain fine-grained control over which communication paths use NCCL versus MPI. This flexibility allowed us to optimize performance-critical paths while retaining MPI for control messages and small metadata transfers.&lt;/p>
&lt;p>In particular, in the communication of ghost cells, which are used for computation around the boundary, like in derivative calculation, small metadata, such as cell_fronts (typically lists of rank-sized integers) continues to be efficiently transmitted via MPI. These metadata are needed for send/receiver buffer allocations. This leverages Python&amp;rsquo;s object serialization model (&lt;code>list[int]&lt;/code>) without incurring GPU synchronization costs. But the actual cell array itself is communicated with NCCL since these arrays can be large.&lt;/p>
&lt;h2 id="whats-next">What’s Next?&lt;/h2>
&lt;p>Aside from enabling NCCL support for the remaining operators and their full test coverage, some more exciting upcoming updates are&lt;/p>
&lt;ul>
&lt;li>Complex-number type support for NCCL&lt;/li>
&lt;li>Benchmarking results on a real HPC system&lt;/li>
&lt;/ul>
&lt;p>Stay tuned for Part 2, and thanks for reading!&lt;/p>
&lt;div class="footnotes" role="doc-endnotes">
&lt;hr>
&lt;ol>
&lt;li id="fn:1">
&lt;p>For the best performance mpi4py would require the buffer memory allocation as well. The mpi4py package provides two interface: buffered and non-bufferred. Currently PyLops-MPI takes the non-buffered approach. This suggest a room of optimization.&amp;#160;&lt;a href="#fnref:1" class="footnote-backref" role="doc-backlink">&amp;#x21a9;&amp;#xfe0e;&lt;/a>&lt;/p>
&lt;/li>
&lt;/ol>
&lt;/div></description></item><item><title>Improving AI Data Pipelines in AIDRIN: A Privacy-Centric and Multimodal Expansion</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/aidrin/20250612-harish_balaji/</link><pubDate>Thu, 12 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/aidrin/20250612-harish_balaji/</guid><description>&lt;p>⏱️ Reading time: 4–5 minutes&lt;/p>
&lt;p>Hi 👋&lt;/p>
&lt;p>I’m Harish Balaji, a Master’s student at NYU with a focus on Artificial Intelligence, Machine Learning, and Cybersecurity. I’m especially interested in building scalable systems that reflect responsible AI principles. For me, data quality isn’t just a technical detail. It’s a foundational aspect of building models that are reliable, fair, and reproducible in the real world.&lt;/p>
&lt;p>This summer, I’m contributing to AIDRIN (AI Data Readiness Inspector) as part of Google Summer of Code 2025. I’m grateful to be working under the mentorship of Dr. Jean Luca Bez and Prof. Suren Byna from the &lt;a href="https://crd.lbl.gov/divisions/scidata/sdm/" target="_blank" rel="noopener">Scientific Data Management Group&lt;/a> at Lawrence Berkeley National Laboratory (LBNL).&lt;/p>
&lt;p>AIDRIN is an open-source framework that helps researchers and practitioners evaluate whether a dataset is truly ready to be used in production-level AI workflows. From fairness to privacy, it provides a structured lens through which we can understand the strengths and gaps in our data.&lt;/p>
&lt;h2 id="why-this-work-matters">Why this work matters&lt;/h2>
&lt;p>In machine learning, one principle always holds true:&lt;/p>
&lt;blockquote>
&lt;p>&amp;ldquo;Garbage in, garbage out.&amp;rdquo;&lt;/p>
&lt;/blockquote>
&lt;p>Even the most advanced models can underperform or amplify harmful biases if trained on incomplete, imbalanced, or poorly understood data. This is where AIDRIN steps in. It provides practical tools to assess datasets across key dimensions like privacy, fairness, class balance, interpretability, and support for multiple modalities.&lt;/p>
&lt;p>By making these characteristics measurable and transparent, AIDRIN empowers teams to make informed decisions early in the pipeline. It helps ensure that datasets are not only large or complex, but also trustworthy, representative, and purpose-fit.&lt;/p>
&lt;h2 id="my-focus-this-summer">My focus this summer&lt;/h2>
&lt;p>As part of my GSoC 2025 project, I’ll be focusing on extending AIDRIN’s evaluation capabilities. A big part of this involves strengthening its support for privacy metrics and designing tools that can handle non-tabular datasets, such as image-based data.&lt;/p>
&lt;p>The goal is to expand AIDRIN’s reach without compromising on interpretability or ease of use. More technical insights and updates will follow in the next posts as the summer progresses.&lt;/p>
&lt;h2 id="what-comes-next">What comes next&lt;/h2>
&lt;p>As the AI community continues to evolve, there’s a growing shift toward data-centric practices. I believe frameworks like AIDRIN are essential for helping us move beyond the question of &lt;em>&amp;ldquo;Does the model work?&amp;rdquo;&lt;/em> toward a deeper and more meaningful one: &lt;em>&amp;ldquo;Was the data ready in the first place?&amp;rdquo;&lt;/em>&lt;/p>
&lt;p>Over the next few weeks, I’ll be working on development, testing, and integration. I’m excited to contribute to a tool that emphasizes transparency and reproducibility across the AI lifecycle, and to share lessons and ideas with others who care about responsible AI.&lt;/p>
&lt;p>If you’re exploring similar challenges or working in the space of dataset evaluation and readiness, I’d love to connect and exchange thoughts. You can also read my full GSoC 2025 proposal below for more context around the project scope and vision:&lt;/p>
&lt;p>👉 &lt;a href="https://drive.google.com/file/d/1RUyU2fHkc8GZ9vTj5SUr6jj84ZaRUvNt/view" target="_blank" rel="noopener">Read my GSoC 2025 proposal here&lt;/a>&lt;/p>
&lt;p>&lt;em>This is the first in a 3-part blog series documenting my GSoC journey with AIDRIN. Stay tuned for technical updates and behind-the-scenes insights as the summer unfolds!&lt;/em>&lt;/p></description></item><item><title>Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250608-tharit/</link><pubDate>Sun, 08 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/lbl/pylops-mpi/20250608-tharit/</guid><description>&lt;h1 id="google-summer-of-code-25-optimizing-and-benchmarking-gpu-collective-communication-of-pylops-mpi-with-nccl">Google Summer of Code ‘25: Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL&lt;/h1>
&lt;p>My project aims to introduce GPU-to-GPU collective communication calls using Nvidia&amp;rsquo;s NCCL to &lt;a href="https://github.com/PyLops/pylops-mpi" target="_blank" rel="noopener">PyLops-MPI&lt;/a>, an extension of the powerful &lt;a href="https://github.com/PyLops/pylops" target="_blank" rel="noopener">PyLops&lt;/a> library.&lt;/p>
&lt;p>I&amp;rsquo;m incredibly grateful for this opportunity and excited to be mentored by two HPC experts, Yuxi Hong from Lawrence Berkeley National Laboratory and Matteo Ravasi from ShearWater GeoServices.&lt;/p>
&lt;p>Here&amp;rsquo;s also the link to my original &lt;a href="https://summerofcode.withgoogle.com/programs/2025/projects/C2XSZp2E" target="_blank" rel="noopener">proposal&lt;/a>&lt;/p>
&lt;h2 id="what-is-pylops-mpi-and-nccl-">What is PyLops-MPI and NCCL ?&lt;/h2>
&lt;p>PyLops is a Python library that provides a rich collection of linear operators to solve inverse problems. Its MPI extension, PyLops-MPI, takes this a step further by enabling these operations to run on large-scale, distributed computing systems like HPC using the Message-Passing Interface (MPI).&lt;/p>
&lt;p>Where does NCCL fit in? The NVIDIA Collective Communication Library (NCCL) is a library of highly-optimized routines for collective communication between GPU. It offers the opportunity to close the performance gap in PyLops-MPI. As we now offload more and more computationally intensive tasks to GPUs, the communication between them can become a bottleneck. NCCL offers a powerful solution to this problem, enabling high-bandwidth, low-latency communication that can significantly boost performance.&lt;/p>
&lt;h2 id="motivation-and-what-was-missing">Motivation and What was Missing&lt;/h2>
&lt;p>As a student with a background in geophysics (B.Sc) and now pursuing computer science (M.Sc), I&amp;rsquo;ve experienced firsthand the challenges of scaling scientific computing research from a personal desktop to a high-performance computing (HPC) cluster. It can be a significant hurdle. My project aims to ease this transition for PyLops-MPI users. PyLops-MPI is something I wish existed while I was doing my undergraduate reseach !&lt;/p>
&lt;p>Currently, PyLops-MPI is &amp;ldquo;CUDA-aware,&amp;rdquo; meaning it can offload computations to GPUs. However, the communication between those GPUs is still handled by the underlying MPI implementation, which isn&amp;rsquo;t always optimal. This project will address this gap by integrating NCCL to handle GPU-to-GPU communication directly. If the compute is done in GPU, there shouldn&amp;rsquo;t be a copy from GPU to CPU, transfer with MPI, and put them back to GPU again.&lt;/p>
&lt;p>This will be especially impactful for memory-bound problems where high-bandwidth communication is critical. By the end of this project, we&amp;rsquo;ll have a clear, quantifiable understanding of the performance gains achieved.&lt;/p>
&lt;h2 id="my-best-laid-plan">My Best-Laid Plan&lt;/h2>
&lt;p>My approach is grounded in good software engineering practices to ensure that this new feature is both robust and genuinely useful. I was impressed by the code quality (enjoyable read) of the repository - and so I commit not to break that.&lt;/p>
&lt;p>First and foremost, the goal is to seamlessly integrate NCCL without breaking what already works. A significant part of my effort will be dedicated to rigorous testing. This means not only ensuring that all existing tests pass but also developing a new, comprehensive test suite to validate the correctness of the GPU-to-GPU communication across different hardware setups.&lt;/p>
&lt;p>Once we&amp;rsquo;re confident that the integration is solid, the exciting part begins: benchmarking (or you may call it &amp;ldquo;Moment of Truth&amp;rdquo;)! The plan is to measure the performance of end-to-end iterative solvers. These solvers are a perfect test case because they involve a mix of intensive gradient computations on the GPU and frequent AllReduce calls to sync up processes. This will give us a clear picture of the speedup and efficiency gains from using NCCL.&lt;/p>
&lt;p>Finally, to make sure this work benefits the entire community, I will create clear documentation and tutorials. The goal is to make it easy for any user to leverage this new GPU-accelerated communication in their own research and applications.&lt;/p></description></item><item><title>Exploration of I/O Reproducibility with HDF5</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/pnnl/h5_reproducibility/</link><pubDate>Wed, 19 Feb 2025 09:00:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/pnnl/h5_reproducibility/</guid><description>&lt;p>Parallel I/O is a critical component in high-performance computing (HPC), allowing multiple processes to read and write data concurrently from a shared storage system. &lt;a href="https://github.com/HDFGroup/hdf5" target="_blank" rel="noopener">HDF5&lt;/a>—a widely adopted data model and library for managing complex scientific data—supports parallel I/O but introduces challenges in I/O reproducibility, where repeated executions do not always produce identical results. This lack of reproducibility can stem from non-deterministic execution orders, variations in collective buffering strategies, and race conditions in metadata and dataset chunking operations within HDF5’s parallel I/O hierarchy. Moreover, many HDF5 operations that leverage &lt;a href="%28https://www.hdfgroup.org/wp-content/uploads/2020/02/20200206_ECPTutorial-final.pdf%29">MPI I/O&lt;/a> require collective communication; that is, all processes within a communicator must participate in operations such as metadata creation, chunk allocation, and data aggregation. These collective calls ensure that the file structure and data layout remain consistent across processes, but they also introduce additional synchronization complexity that can impact reproducibility if not properly managed. In HPC scientific workflows, consistent I/O reproducibility is essential for accurate debugging, validation, and benchmarking, ensuring that scientific results are both verifiable and trustworthy. Tools such as &lt;a href="https://github.com/hpc-io/h5bench" target="_blank" rel="noopener">h5bench&lt;/a>—a suite of I/O kernels designed to exercise HDF5 I/O on parallel file systems—play an important role in identifying these reproducibility challenges, tuning performance, and ultimately supporting the overall robustness of large-scale scientific applications.&lt;/p>
&lt;h3 id="workplan">Workplan&lt;/h3>
&lt;p>The proposed work will include (1) analyzing and characterizing parallel I/O operations in &lt;a href="https://www.hdfgroup.org/wp-content/uploads/2020/02/20200206_ECPTutorial-final.pdf" target="_blank" rel="noopener">HDF5&lt;/a> with &lt;a href="https://github.com/hpc-io/h5bench" target="_blank" rel="noopener">h5bench&lt;/a> miniapps, (2) exploring and validating potential reproducibility challenges within the parallel I/O hierarchy (e.g., MPI I/O), and (3) implementing solutions to address parallel I/O reproducibility.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Parallel I/O&lt;/code> &lt;code>MPI-I/O&lt;/code> &lt;code>Reproducibility&lt;/code> &lt;code>HPC&lt;/code> &lt;code>HDF5&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C/C++, Python&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Medium&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/luanzheng-lenny-guo/">Luanzheng &amp;quot;Lenny&amp;quot; Guo&lt;/a> and [Wei Zhang]&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/wei-zhang/">Wei Zhang&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>