<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>osre25 | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/category/osre25/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/category/osre25/index.xml" rel="self" type="application/rss+xml"/><description>osre25</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Thu, 25 Sep 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>osre25</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/category/osre25/</link></image><item><title>Final Update: Building Intelligent Observability for NRP</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250925-manish-reddy/</link><pubDate>Thu, 25 Sep 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsd/seam/intelligent-observability/20250925-manish-reddy/</guid><description>&lt;p>I&amp;rsquo;m excited to share the completion of my OSRE 2025 project, &amp;ldquo;&lt;em>Intelligent Observability for NRP: A GenAI Approach&lt;/em>&amp;rdquo; and the significant learning journey it has been. We&amp;rsquo;ve successfully developed a novel InfoAgent architecture that delivers on our core goal: building an ML-powered service for NRP that analyzes monitoring data, detects anomalies, and provides trustworthy GenAI explanations.&lt;/p>
&lt;h2 id="how-our-novel-infoagent-architecture-advances-the-observability-mission">How Our Novel InfoAgent Architecture Advances the Observability Mission&lt;/h2>
&lt;p>Through extensive development and testing, I&amp;rsquo;ve learned tremendously about building production-ready AI systems and have implemented a novel InfoAgent architecture that orchestrates our specialized agents:&lt;/p>
&lt;h3 id="1-prometheus-metrics-analysis-agent">1. Prometheus Metrics Analysis Agent&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Continuously ingests and processes NRP&amp;rsquo;s Prometheus metrics&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: Fully implemented data pipelines handling multiple metric types with optimized latency&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Provides the foundation for anomaly detection by establishing normal behavior baselines&lt;/li>
&lt;/ul>
&lt;h3 id="2-query-refinement-agent-croq">2. Query Refinement Agent (CROQ)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Clarifies ambiguous metrics or patterns before generating explanations&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: Completed implementation of Conformal Revision of Questions for disambiguation&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Ensures explanations address the right system behaviors (e.g., distinguishing CPU saturation from memory pressure)&lt;/li>
&lt;li>&lt;strong>Deliverable Impact&lt;/strong>: Successfully improved accuracy of GenAI explanations by eliminating misinterpretations&lt;/li>
&lt;/ul>
&lt;h3 id="3-explanation-generation-agent-ais">3. Explanation Generation Agent (AIS)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Function&lt;/strong>: Creates human-readable explanations and root-cause analysis&lt;/li>
&lt;li>&lt;strong>Progress&lt;/strong>: Finalized the Automated Information Seeker with a complete Plan→Validate→Execute→Assess→Revise cycle&lt;/li>
&lt;li>&lt;strong>Purpose&lt;/strong>: Transforms technical anomalies into actionable insights for operators&lt;/li>
&lt;li>&lt;strong>Deliverable Impact&lt;/strong>: Delivers GenAI explanations with uncertainty quantification&lt;/li>
&lt;/ul>
&lt;h2 id="completed-integration-the-novel-infoagent-pipeline">Completed Integration: The Novel InfoAgent Pipeline&lt;/h2>
&lt;p>We&amp;rsquo;ve successfully integrated all agents into a unified observability pipeline that represents our novel contribution:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Data Collection&lt;/strong>: Prometheus metrics → Analysis Agent (comprehensive metrics support)&lt;/li>
&lt;li>&lt;strong>Anomaly Detection&lt;/strong>: With statistical confidence bounds using conformal prediction&lt;/li>
&lt;li>&lt;strong>Query Refinement&lt;/strong>: Resolving ambiguities before explanation&lt;/li>
&lt;li>&lt;strong>Explanation Generation&lt;/strong>: Human-readable analysis with uncertainty awareness&lt;/li>
&lt;li>&lt;strong>Feedback Loop&lt;/strong>: System learning from operator interactions (implemented and tested)&lt;/li>
&lt;/ol>
&lt;h2 id="hardware-testing-results">Hardware Testing Results&lt;/h2>
&lt;p>This project taught me valuable lessons about optimizing AI workloads on specialized hardware. We successfully tested our observability framework on Qualcomm Cloud AI 100 Ultra hardware:&lt;/p>
&lt;ul>
&lt;li>Achieved significant performance improvements over baseline CPU implementation&lt;/li>
&lt;li>Successfully ported and optimized GLM-4.5 for observability-specific tasks&lt;/li>
&lt;li>Validated that specialized AI hardware significantly enhances real-time anomaly detection&lt;/li>
&lt;/ul>
&lt;h2 id="learning-journey-and-novel-contributions">Learning Journey and Novel Contributions&lt;/h2>
&lt;p>Throughout OSRE 2025, I&amp;rsquo;ve learned extensively about:&lt;/p>
&lt;ol>
&lt;li>Building hierarchical agent coordination systems for complex reasoning&lt;/li>
&lt;li>Implementing conformal prediction for trustworthy AI outputs&lt;/li>
&lt;li>Creating self-correcting explanation pipelines&lt;/li>
&lt;li>Developing adaptive learning systems from operator feedback&lt;/li>
&lt;/ol>
&lt;p>The novel InfoAgent architecture demonstrates promising results in our testing environment, with evaluation metrics and benchmarks still being refined as work in progress.&lt;/p>
&lt;h2 id="ongoing-work-continuing-beyond-osre">Ongoing Work: Continuing Beyond OSRE&lt;/h2>
&lt;p>While OSRE 2025 is concluding, I&amp;rsquo;m actively continuing to contribute to this project:&lt;/p>
&lt;ol>
&lt;li>Preparing the InfoAgent framework for open-source release with comprehensive documentation&lt;/li>
&lt;li>Running extended evaluation tests on the Nautilus platform (work in progress)&lt;/li>
&lt;li>Writing a research paper detailing our novel architecture&lt;/li>
&lt;li>Creating tutorials to help others implement intelligent observability&lt;/li>
&lt;/ol>
&lt;p>&lt;strong>Project Updates and Code&lt;/strong>: You can follow my ongoing contributions and access the latest code at &lt;a href="https://mreddy10.pages.nrp-nautilus.io/gsocnrp/" target="_blank" rel="noopener">https://mreddy10.pages.nrp-nautilus.io/gsocnrp/&lt;/a>&lt;/p>
&lt;h2 id="acknowledgments">Acknowledgments&lt;/h2>
&lt;p>I&amp;rsquo;m deeply grateful to my lead mentor &lt;strong>Mohammad Firas Sada&lt;/strong> for his exceptional guidance throughout this transformative learning experience. His insights have been invaluable in helping me develop the novel InfoAgent architecture and navigate the complexities of building production-ready AI systems.&lt;/p>
&lt;p>The OSRE 2025 program has been an incredible journey of growth and discovery. I&amp;rsquo;ve learned not just how to build AI systems, but how to make them trustworthy, explainable, and genuinely useful for real-world operations. The novel InfoAgent architecture we&amp;rsquo;ve developed serves the original mission: creating an intelligent observability tool that helps NRP operators solve problems faster and keep complex research systems running smoothly.&lt;/p>
&lt;p>I&amp;rsquo;m excited to continue contributing to this project and look forward to seeing how the community adopts and extends these ideas. Check out my contributions and ongoing updates at &lt;a href="https://mreddy10.pages.nrp-nautilus.io/gsocnrp/" target="_blank" rel="noopener">https://mreddy10.pages.nrp-nautilus.io/gsocnrp/&lt;/a>!&lt;/p></description></item><item><title>Final Report: MPI Appliance for HPC Research on Chameleon</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/mpi/20250901-rohan-babbar/</link><pubDate>Mon, 01 Sep 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/mpi/20250901-rohan-babbar/</guid><description>&lt;p>Hi Everyone, This is my final report for the project I completed during my summer as a &lt;a href="https://ucsc-ospo.github.io/sor/" target="_blank" rel="noopener">Summer of Reproducibility (SOR)&lt;/a> student.
The project, titled &amp;ldquo;&lt;a href="https://ucsc-ospo.github.io/project/osre25/uchicago/mpi/" target="_blank" rel="noopener">MPI Appliance for HPC Research in Chameleon&lt;/a>,&amp;rdquo; was undertaken in collaboration with Argonne National Laboratory
and the Chameleon Cloud community. The project was mentored by &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/ken-raffenetti/">Ken Raffenetti&lt;/a> and was completed over the summer.
This blog details the work and outcomes of the project.&lt;/p>
&lt;h2 id="background">Background&lt;/h2>
&lt;p>Message Passing Interface (MPI) is the backbone of high-performance computing (HPC), enabling efficient scaling across thousands of
processing cores. However, reproducing MPI-based experiments remains challenging due to dependencies on specific library versions,
network configurations, and multi-node setups.&lt;/p>
&lt;p>To address this, we introduce a reproducibility initiative that provides standardized MPI environments on the Chameleon testbed.
This is set up as a master–worker MPI cluster. The master node manages tasks and communication, while the worker nodes do the computations.
All nodes have the same MPI libraries, software, and network settings, making experiments easier to scale and reproduce.&lt;/p>
&lt;h2 id="objectives">Objectives&lt;/h2>
&lt;p>The aim of this project is to create an MPI cluster that is reproducible, easily deployable, and efficiently configurable.&lt;/p>
&lt;p>The key objectives of this project were:&lt;/p>
&lt;ol>
&lt;li>
&lt;p>Pre-built MPI Images: Create ready-to-use images with MPI and all dependencies installed.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Automated Cluster Configuration: Develop Ansible playbooks to configure master–worker communication, including host setup, SSH key distribution, and MPI configuration across nodes.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Cluster Orchestration: Develop orchestration template to provision resources and invoke Ansible playbooks for automated cluster setup.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h2 id="implementation-strategy-and-deliverables">Implementation Strategy and Deliverables&lt;/h2>
&lt;h3 id="openstack-image-creation">Openstack Image Creation&lt;/h3>
&lt;p>The first step was to create a standardized pre-built image, which serves as the base image for all nodes in the cluster.&lt;/p>
&lt;p>Some important features of the image include:&lt;/p>
&lt;ol>
&lt;li>Built on Ubuntu 22.04 for a stable base environment.&lt;/li>
&lt;li>&lt;a href="https://spack.io/" target="_blank" rel="noopener">Spack&lt;/a> + Lmod integration:
&lt;ul>
&lt;li>Spack handles reproducible, version-controlled installations of software packages.&lt;/li>
&lt;li>Lmod (Lua Modules) provides a user-friendly way to load/unload software environments dynamically.&lt;/li>
&lt;li>Together, they allow users to easily switch between MPI versions, libraries, and GPU toolkits&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;a href="https://github.com/pmodels/mpich" target="_blank" rel="noopener">MPICH&lt;/a> and &lt;a href="https://github.com/open-mpi/ompi" target="_blank" rel="noopener">OpenMPI&lt;/a> pre-installed for standard MPI support and can be loaded/unloaded.&lt;/li>
&lt;li>Three image variants for various HPC workloads: CPU-only, NVIDIA GPU (CUDA 12.8), and AMD GPU (ROCm 6.4.2).&lt;/li>
&lt;/ol>
&lt;p>These images have been published and are available in the Chameleon Cloud Appliance Catalog:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://chameleoncloud.org/appliances/127/" target="_blank" rel="noopener">MPI and Spack for HPC (Ubuntu 22.04)&lt;/a> - CPU Only&lt;/li>
&lt;li>&lt;a href="https://chameleoncloud.org/appliances/130/" target="_blank" rel="noopener">MPI and Spack for HPC (Ubuntu 22.04 - CUDA)&lt;/a> - NVIDIA GPU (CUDA 12.8)&lt;/li>
&lt;li>&lt;a href="https://chameleoncloud.org/appliances/131/" target="_blank" rel="noopener">MPI and Spack for HPC (Ubuntu 22.04 - ROCm)&lt;/a> - AMD GPU (ROCm 6.4.2)&lt;/li>
&lt;/ul>
&lt;h3 id="cluster-configuration-using-ansible">Cluster Configuration using Ansible&lt;/h3>
&lt;p>The next step is to create scripts/playbooks to configure these nodes and set up an HPC cluster.
We assigned specific roles to different nodes in the cluster and combined them into a single playbook to configure the entire cluster automatically.&lt;/p>
&lt;p>Some key steps the playbook performs:&lt;/p>
&lt;ol>
&lt;li>Configure /etc/hosts entries for all nodes.&lt;/li>
&lt;li>Mount Manila NFS shares on each node.&lt;/li>
&lt;li>Generate an SSH key pair on the master node and add the master’s public key to the workers’ authorized_keys.&lt;/li>
&lt;li>Scan worker node keys and update known_hosts on the master.&lt;/li>
&lt;li>(Optional) Manage software:
&lt;ul>
&lt;li>Install new compilers with Spack&lt;/li>
&lt;li>Add new Spack packages&lt;/li>
&lt;li>Update environment modules to recognize them&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Create a hostfile at /etc/mpi/hostfile.&lt;/li>
&lt;/ol>
&lt;p>The code is publicly available and can be found on the GitHub repository: &lt;a href="https://github.com/rohanbabbar04/MPI-Spack-Experiment-Artifact" target="_blank" rel="noopener">https://github.com/rohanbabbar04/MPI-Spack-Experiment-Artifact&lt;/a>&lt;/p>
&lt;h3 id="orchestration">Orchestration&lt;/h3>
&lt;p>With the image now created and deployed, and the Ansible scripts ready for cluster configuration, we put everything
together to orchestrate the cluster deployment.&lt;/p>
&lt;p>This can be done in two primary ways:&lt;/p>
&lt;h4 id="python-chijupyter--ansible">Python CHI(Jupyter) + Ansible&lt;/h4>
&lt;p>&lt;a href="https://github.com/ChameleonCloud/python-chi" target="_blank" rel="noopener">Python-CHI&lt;/a> is a python library designed to facilitate interaction with the Chameleon testbed. Often used within environments like Jupyter notebooks.&lt;/p>
&lt;p>This setup can be put up as:&lt;/p>
&lt;ol>
&lt;li>Create leases, launch instances, and set up shared storage using python-chi commands.&lt;/li>
&lt;li>Automatically generate inventory.ini for Ansible based on launched instances.&lt;/li>
&lt;li>Run Ansible playbook programmatically using &lt;code>ansible_runner&lt;/code>.&lt;/li>
&lt;li>Outcome: fully configured, ready-to-use HPC cluster; SSH into master to run examples.&lt;/li>
&lt;/ol>
&lt;p>If you would like to see a working example, you can view it in the &lt;a href="https://chameleoncloud.org/experiment/share/7424a8dc-0688-4383-9d67-1e40ff37de17" target="_blank" rel="noopener">Trovi example&lt;/a>&lt;/p>
&lt;h4 id="heat-orchestration-template">Heat Orchestration Template&lt;/h4>
&lt;p>Heat Orchestration Template(HOT) is a YAML based configuration file. Its purpose is to define/create a stack to automate
the deployment and configuration of OpenStack cloud resources.&lt;/p>
&lt;p>&lt;strong>Challenges&lt;/strong>&lt;/p>
&lt;p>We faced some challenges while working with Heat templates and stacks in particular in Chameleon Cloud&lt;/p>
&lt;ol>
&lt;li>&lt;code>OS::Nova::Keypair&lt;/code>(new version): In the latest OpenStack version, the stack fails to launch if the &lt;code>public_key&lt;/code> parameter is not provided for the keypair,
as auto-generation is no longer supported.&lt;/li>
&lt;li>&lt;code>OS::Heat::SoftwareConfig&lt;/code>: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.&lt;/li>
&lt;/ol>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Heat Approach" srcset="
/report/osre25/uchicago/mpi/20250901-rohan-babbar/heatapproach_hua2bf48ad20dec386c348c909fcaf7111_39548_05fca9fb65271d31e3fd79f2e7b58a53.webp 400w,
/report/osre25/uchicago/mpi/20250901-rohan-babbar/heatapproach_hua2bf48ad20dec386c348c909fcaf7111_39548_19399eb0dbf598de84852723f8d60783.webp 760w,
/report/osre25/uchicago/mpi/20250901-rohan-babbar/heatapproach_hua2bf48ad20dec386c348c909fcaf7111_39548_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/mpi/20250901-rohan-babbar/heatapproach_hua2bf48ad20dec386c348c909fcaf7111_39548_05fca9fb65271d31e3fd79f2e7b58a53.webp"
width="760"
height="235"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>To tackle these challenges, we designed an approach that is both easy to implement and reproducible. First, we launch instances
by provisioning master and worker nodes using the HOT template in OpenStack. Next, we set up a bootstrap node, install Git and Ansible,
and run an Ansible playbook from the bootstrap node to configure the master and worker nodes, including SSH, host communication, and
MPI setup. The outcome is a fully configured, ready-to-use HPC cluster, where users can simply SSH into the master node to run examples.&lt;/p>
&lt;p>Users can view/use the template published in the Appliance Catalog: &lt;a href="https://chameleoncloud.org/appliances/132/" target="_blank" rel="noopener">MPI+Spack Bare Metal Cluster&lt;/a>.
For example, a demonstration of how to pass parameters is available on &lt;a href="https://chameleoncloud.org/experiment/share/7424a8dc-0688-4383-9d67-1e40ff37de17" target="_blank" rel="noopener">Trovi&lt;/a>.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>In conclusion, this work demonstrates a reproducible approach to building and configuring MPI clusters on the Chameleon testbed. By using standardized images,
Ansible automation, and Orchestration Templates, we ensure that every node is consistently set up, reducing manual effort and errors. The artifact, published on Trovi,
makes the entire process transparent, reusable, and easy to implement, enabling users/researchers to reliably recreate and extend the cluster environment for their own
experiments.&lt;/p>
&lt;h2 id="future-work">Future Work&lt;/h2>
&lt;p>Maintaining these images and possibly creating a script to reproduce MPI and Spack on a different image base environment.&lt;/p></description></item><item><title>Final Update(Mid-Term -> Final): MPI Appliance for HPC Research on Chameleon</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/mpi/20250831-rohan-babbar/</link><pubDate>Sun, 31 Aug 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/mpi/20250831-rohan-babbar/</guid><description>&lt;p>Hi everyone! This is my final update, covering the progress made every two weeks from the midterm to the end of the
project &lt;a href="https://ucsc-ospo.github.io/project/osre25/uchicago/mpi/" target="_blank" rel="noopener">MPI Appliance for HPC Research on Chameleon&lt;/a>, developed
in collaboration with Argonne National Laboratory and the Chameleon Cloud community.
This blog follows up on my earlier post, which you can find &lt;a href="https://ucsc-ospo.github.io/report/osre25/uchicago/mpi/20250803-rohan-babbar/" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;h3 id="-july-29--august-11-2025">🔧 July 29 – August 11, 2025&lt;/h3>
&lt;p>With the CUDA- and MPI-Spack–based appliances published, we considered releasing another image variant (ROCm-based) for AMD GPUs.
This will be primarily used in CHI@TACC, which provides AMD GPUs. We have successfully published a new image on Chameleon titled &lt;a href="https://chameleoncloud.org/appliances/131/" target="_blank" rel="noopener">MPI and Spack for HPC (Ubuntu 22.04 - ROCm)&lt;/a>,
and we also added an example to demonstrate its usage.&lt;/p>
&lt;h3 id="-august-12--august-25-2025">🔧 August 12 – August 25, 2025&lt;/h3>
&lt;p>With the examples now available on Trovi for creating an MPI cluster using Ansible and Python-CHI, my next step was to experiment with stack orchestration using Heat Orchestration Templates (HOT) on OpenStack Chameleon Cloud.
This turned out to be more challenging due to a few restrictions:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>OS::Nova::Keypair (new version)&lt;/strong>: In the latest OpenStack version, the stack fails to launch if the public_key parameter is not provided for the keypair, as auto-generation is no longer supported.&lt;/li>
&lt;li>&lt;strong>OS::Heat::SoftwareConfig&lt;/strong>: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.&lt;/li>
&lt;/ol>
&lt;p>To address these issues, we adopted a new strategy for configuring and creating the MPI cluster: using a temporary bootstrap node.&lt;/p>
&lt;p>In simple terms, the workflow of the Heat template is:&lt;/p>
&lt;ol>
&lt;li>Provision master and worker nodes via the HOT template on OpenStack.&lt;/li>
&lt;li>Launch a bootstrap node, install Git and Ansible on it, and then run an Ansible playbook from the bootstrap node to configure the master and worker nodes. This includes setting up SSH, host communication, and the MPI environment.&lt;/li>
&lt;/ol>
&lt;p>This provides an alternative method for creating an MPI cluster.&lt;/p>
&lt;p>We presented this work on August 26, 2025, to the Chameleon Team and the Argonne MPICH Team. The project was very well received.&lt;/p>
&lt;p>Stay tuned for my final report on this work, which I’ll be sharing in my next blog post.&lt;/p></description></item><item><title>End-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/</link><pubDate>Sat, 23 Aug 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/</guid><description>&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Heading" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image0_hu69efae69f006c4366342bdc2ded8b248_187729_f9e5e16b2001b9950ad995b2c786abc9.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image0_hu69efae69f006c4366342bdc2ded8b248_187729_27bc4379277ab462935158b3db96d992.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image0_hu69efae69f006c4366342bdc2ded8b248_187729_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image0_hu69efae69f006c4366342bdc2ded8b248_187729_f9e5e16b2001b9950ad995b2c786abc9.webp"
width="760"
height="392"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h1 id="introduction">&lt;strong>Introduction&lt;/strong>&lt;/h1>
&lt;p>Hello everyone!&lt;br>
I am Debangi Ghosh from India, an undergraduate student at the Indian Institute of Technology (IIT) BHU, Varanasi. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/northwestern/statwrap/">StatWrap: Cross-Project Searching and Classification using Local Indexing&lt;/a> project, my &lt;a href="https://drive.google.com/file/d/1dxyBP2oMJwYDCKyIWzr465zNmm6UWtnI/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>, under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/luke-rasmussen/">Luke Rasmussen&lt;/a>, focuses on developing a full-text search service within the StatWrap user interface. This involves evaluating different search libraries and implementing a classification system to distinguish between active and past projects.&lt;/p>
&lt;h1 id="about-the-project">&lt;strong>About the Project&lt;/strong>&lt;/h1>
&lt;p>As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for researchers to discover relevant projects, notes, and assets across both current and archived work, using information that is either user-entered or passively collected by StatWrap.&lt;/p>
&lt;p>Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Evaluating open-source search libraries&lt;/strong> suitable for local indexing and retrieval&lt;/li>
&lt;li>&lt;strong>Building the full-text search functionality&lt;/strong> directly into the StatWrap UI to allow seamless querying across projects&lt;/li>
&lt;li>&lt;strong>Ensuring reliability&lt;/strong> through the development of unit tests and comprehensive system testing&lt;/li>
&lt;li>&lt;strong>Implementing a classification system&lt;/strong> to label projects as “Active,” “Pinned,” or “Past” within the user interface&lt;/li>
&lt;/ul>
&lt;p>This project offers a great opportunity to work at the intersection of software development, information retrieval, and user-centric design—while contributing to research reproducibility and collaboration within scientific workflows.&lt;/p>
&lt;h1 id="deliverables">&lt;strong>Deliverables&lt;/strong>&lt;/h1>
&lt;p>The project has reached the end of its scope after 12 weeks of work. Here&amp;rsquo;s a breakdown:&lt;/p>
&lt;h2 id="1-descriptive-comparison-of-open-source-libraries">&lt;strong>1. Descriptive Comparison of Open-Source Libraries&lt;/strong>&lt;/h2>
&lt;p>Compared various open-source search libraries based on evaluation criteria such as &lt;strong>indexing speed, search speed, memory usage, typo tolerance, fuzzy searching, partial matching, full-text queries, contextual search, Boolean support, exact word match, installation ease, maintenance, documentation&lt;/strong>, and &lt;strong>developer experience&lt;/strong>. Decided upon the weights to assign to each of the features and point out the best library to use. According to our weights assigned,
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Evaluation" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image1_hu63c79919752d2305350a1cb96819590d_110608_4b5e863d88146124b333878508147eff.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image1_hu63c79919752d2305350a1cb96819590d_110608_c2220a56c480048842e8b750cc2ca56f.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image1_hu63c79919752d2305350a1cb96819590d_110608_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image1_hu63c79919752d2305350a1cb96819590d_110608_4b5e863d88146124b333878508147eff.webp"
width="760"
height="603"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>These results are after tuning the hyperparameters to give the best set of results
For huge data, FlexSearch has the least memory usage, followed by MiniSearch. The examples we used were limited, so Minisearch had the better memory usage results.
Along with the research and evaluation, I looked upon the Performance Benchmark of Full-Text-Search Libraries (Stress Test), available &lt;a href="https://nextapps-de.github.io/flexsearch/" target="_blank" rel="noopener">here&lt;/a>&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Stress Test" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image2_hu9b739b80416dccda0a7e0361ba4f7e36_163727_407cb964e7e05c64834433b6a84182ff.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image2_hu9b739b80416dccda0a7e0361ba4f7e36_163727_167223f62fbaf30991601d7745fad9f5.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image2_hu9b739b80416dccda0a7e0361ba4f7e36_163727_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image2_hu9b739b80416dccda0a7e0361ba4f7e36_163727_407cb964e7e05c64834433b6a84182ff.webp"
width="760"
height="384"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The benchmark was measured in terms per seconds, higher values are better (except the test &amp;ldquo;Memory&amp;rdquo;). The memory value refers to the amount of memory which was additionally allocated during search.&lt;/p>
&lt;p>FlexSearch performs queries up to 1,000,000 times faster compared to other libraries by also providing powerful search capabilities like multi-field search (document search), phonetic transformations, partial matching, tag-search, result highlighting or suggestions.
Bigger workloads are scalable through workers to perform any updates or queries to the index in parallel through dedicated balanced threads.&lt;/p>
&lt;h2 id="2-the-search-user-interface">&lt;strong>2. The Search User Interface&lt;/strong>&lt;/h2>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="ui" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image3_hu2c7c529fbdaba5c9b4f85e802acf251e_292973_5c88d9d2587c54c50da97d6c489519dc.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image3_hu2c7c529fbdaba5c9b4f85e802acf251e_292973_82065ca30e98bced61362bca45765215.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image3_hu2c7c529fbdaba5c9b4f85e802acf251e_292973_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image3_hu2c7c529fbdaba5c9b4f85e802acf251e_292973_5c88d9d2587c54c50da97d6c489519dc.webp"
width="760"
height="428"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="ui2" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image5_hu55f5482f96b2f6db562c5a51f9b5f629_220424_7a3499ad0fc3cd06919fcdd17194742a.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image5_hu55f5482f96b2f6db562c5a51f9b5f629_220424_5840b85d48a6e608855c8e0d96b4fe49.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image5_hu55f5482f96b2f6db562c5a51f9b5f629_220424_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image5_hu55f5482f96b2f6db562c5a51f9b5f629_220424_7a3499ad0fc3cd06919fcdd17194742a.webp"
width="760"
height="652"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="3-complete-search-execution-pipeline">&lt;strong>3. Complete Search Execution Pipeline&lt;/strong>&lt;/h2>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="ui2" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/Flowchart__hu0123533bb7a682ac6b28d9b34fa57bc0_349775_bd4ac2fa5efb17e2b237cf8d78278398.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/Flowchart__hu0123533bb7a682ac6b28d9b34fa57bc0_349775_a0e8f31fdbdc656a2886def3dca3410b.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/Flowchart__hu0123533bb7a682ac6b28d9b34fa57bc0_349775_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/Flowchart__hu0123533bb7a682ac6b28d9b34fa57bc0_349775_bd4ac2fa5efb17e2b237cf8d78278398.webp"
width="513"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h2 id="4-flexsearch-features">&lt;strong>4. FlexSearch Features&lt;/strong>&lt;/h2>
&lt;h4 id="1-persistent-indexing-with-automatic-loading">1. &lt;strong>Persistent Indexing with Automatic Loading&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Index persistence&lt;/strong>: Search index automatically saves to disk and loads on startup&lt;/li>
&lt;li>&lt;strong>Fast restoration&lt;/strong>: Rebuilds FlexSearch indices from saved document store without re-scanning files&lt;/li>
&lt;li>&lt;strong>Incremental updates&lt;/strong>: Detects project changes and updates only modified content&lt;/li>
&lt;li>&lt;strong>Background processing&lt;/strong>: Index updates happen asynchronously without blocking the User Interface.&lt;/li>
&lt;/ul>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="indexing" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image4_hu4893772edaa569a0d2e6454373f66573_78656_23074ee37edbb0f6abbd289ef211f756.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image4_hu4893772edaa569a0d2e6454373f66573_78656_993d6a1363d2cddf66632c4102acb8f5.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image4_hu4893772edaa569a0d2e6454373f66573_78656_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image4_hu4893772edaa569a0d2e6454373f66573_78656_23074ee37edbb0f6abbd289ef211f756.webp"
width="494"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h4 id="2-multi-document-type-support">2. &lt;strong>Multi-Document Type Support&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Unified search&lt;/strong>: Single search interface for projects, files, people, notes, and assets&lt;/li>
&lt;li>&lt;strong>Type-specific indices&lt;/strong>: Separate FlexSearch indices optimized for each document type&lt;/li>
&lt;li>&lt;strong>Cross-reference capabilities&lt;/strong>: Documents can reference and link to each other&lt;/li>
&lt;li>&lt;strong>Flexible schema&lt;/strong>: Each document type has tailored fields for optimal search performance&lt;/li>
&lt;/ul>
&lt;h4 id="3-intelligent-file-content-indexing">3. &lt;strong>Intelligent File Content Indexing&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Configurable file size limits&lt;/strong>: Admin-controlled maximum file size for content indexing&lt;/li>
&lt;li>&lt;strong>Smart file detection&lt;/strong>: Automatically identifies text files by extension and filename patterns&lt;/li>
&lt;li>&lt;strong>Content extraction&lt;/strong>: Full-text indexing with snippet generation for search results&lt;/li>
&lt;li>&lt;strong>Performance optimization&lt;/strong>: Skips binary files and respects size constraints to maintain speed&lt;/li>
&lt;/ul>
&lt;h4 id="4-advanced-query-processing">4. &lt;strong>Advanced Query Processing&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Multi-strategy search&lt;/strong>: Combines exact matches, fuzzy search, partial matches, and contextual search&lt;/li>
&lt;li>&lt;strong>Query preprocessing&lt;/strong>: Removes stop words and applies linguistic filters&lt;/li>
&lt;li>&lt;strong>Relevance scoring&lt;/strong>: Custom scoring algorithm considering multiple factors:
&lt;ul>
&lt;li>Exact phrase matches (highest weight)&lt;/li>
&lt;li>Individual word matches&lt;/li>
&lt;li>Term frequency with logarithmic capping&lt;/li>
&lt;li>Position-based scoring (earlier matches rank higher)&lt;/li>
&lt;li>Proximity bonuses for terms appearing near each other&lt;/li>
&lt;li>Completeness penalties for missing query terms&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h4 id="5-real-time-search-suggestions">5. &lt;strong>Real-Time Search Suggestions&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Autocomplete support&lt;/strong>: Dynamic suggestions based on indexed document titles&lt;/li>
&lt;li>&lt;strong>Search history&lt;/strong>: Maintains recent searches for quick re-execution&lt;/li>
&lt;li>&lt;strong>Debounced input&lt;/strong>: Prevents excessive API calls during typing&lt;/li>
&lt;li>&lt;strong>Contextual suggestions&lt;/strong>: Suggestions adapt based on current filters and context&lt;/li>
&lt;/ul>
&lt;h4 id="6-comprehensive-filtering-system">6. &lt;strong>Comprehensive Filtering System&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Type filtering&lt;/strong>: Filter by document type (projects, files, people, etc.)&lt;/li>
&lt;li>&lt;strong>Project scoping&lt;/strong>: Limit searches to specific projects&lt;/li>
&lt;li>&lt;strong>File type filtering&lt;/strong>: Filter files by extension&lt;/li>
&lt;li>&lt;strong>Advanced search panel&lt;/strong>: Collapsible interface for power users&lt;/li>
&lt;li>&lt;strong>Filter persistence&lt;/strong>: Maintains filter state across searches&lt;/li>
&lt;/ul>
&lt;h4 id="7-performance-monitoring--analytics">7. &lt;strong>Performance Monitoring &amp;amp; Analytics&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Real-time metrics&lt;/strong>: Track search times, cache hit rates, and index statistics&lt;/li>
&lt;li>&lt;strong>Performance dashboard&lt;/strong>: Visual indicators for system health&lt;/li>
&lt;li>&lt;strong>Cache management&lt;/strong>: LRU cache with configurable size and TTL&lt;/li>
&lt;li>&lt;strong>Search analytics&lt;/strong>: Historical data on search patterns and performance&lt;/li>
&lt;/ul>
&lt;h4 id="8-index-management-tools">8. &lt;strong>Index Management Tools&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Export/Import functionality&lt;/strong>: Backup and restore search indices&lt;/li>
&lt;li>&lt;strong>Full reindexing&lt;/strong>: Complete index rebuild with progress tracking&lt;/li>
&lt;li>&lt;strong>Index deletion&lt;/strong>: Clean slate functionality for troubleshooting&lt;/li>
&lt;li>&lt;strong>File size adjustment&lt;/strong>: Modify indexing constraints and rebuild affected content&lt;/li>
&lt;li>&lt;strong>Index statistics&lt;/strong>: Detailed breakdown of indexed content by type and project&lt;/li>
&lt;/ul>
&lt;h4 id="9-robust-error-handling--resilience">9. &lt;strong>Robust Error Handling &amp;amp; Resilience&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Graceful degradation&lt;/strong>: System continues operating even with partial index corruption&lt;/li>
&lt;li>&lt;strong>File system error handling&lt;/strong>: Handles missing files, permission issues, and path changes&lt;/li>
&lt;li>&lt;strong>Memory management&lt;/strong>: Prevents memory leaks during large indexing operations&lt;/li>
&lt;li>&lt;strong>Recovery mechanisms&lt;/strong>: Automatic fallback to basic search if advanced features fail&lt;/li>
&lt;/ul>
&lt;h4 id="10-user-experience-enhancements">10. &lt;strong>User Experience Enhancements&lt;/strong>&lt;/h4>
&lt;ul>
&lt;li>&lt;strong>Keyboard shortcuts&lt;/strong>: Ctrl+K to focus search, Escape to clear&lt;/li>
&lt;li>&lt;strong>Result highlighting&lt;/strong>: Visual emphasis on matching terms in results&lt;/li>
&lt;li>&lt;strong>Expandable results&lt;/strong>: Drill down into detailed information for each result&lt;/li>
&lt;li>&lt;strong>Loading states&lt;/strong>: Clear feedback during indexing and search operations&lt;/li>
&lt;li>&lt;strong>Responsive tabs&lt;/strong>: Organized results by type with badge counts&lt;/li>
&lt;/ul>
&lt;h2 id="5-classification-of-active-and-past-projects">&lt;strong>5. Classification of Active and Past Projects&lt;/strong>&lt;/h2>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Active Pinned" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image6_huacf20425d6903f6cfe6149bc5cb1772d_171494_1d3344ebb95180438d54893a9b5683e4.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image6_huacf20425d6903f6cfe6149bc5cb1772d_171494_a0f8ee7f62445c2f5f806022268d0821.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image6_huacf20425d6903f6cfe6149bc5cb1772d_171494_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image6_huacf20425d6903f6cfe6149bc5cb1772d_171494_1d3344ebb95180438d54893a9b5683e4.webp"
width="733"
height="760"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Past" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image7_hu7cccff315a5d098cd440d7277689d606_85529_76660a0dce9ac0ba1fa91c959db2773c.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image7_hu7cccff315a5d098cd440d7277689d606_85529_cc2abd1a6a3019f703ca3e656e55f920.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image7_hu7cccff315a5d098cd440d7277689d606_85529_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image7_hu7cccff315a5d098cd440d7277689d606_85529_76660a0dce9ac0ba1fa91c959db2773c.webp"
width="740"
height="542"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>A classification system is added within the User Interface similar to &lt;strong>&amp;ldquo;Add to Favorites&amp;rdquo;&lt;/strong> option. A new project added by default moves to &lt;strong>&amp;ldquo;Active&amp;rdquo;&lt;/strong> section, unless explicitely marked as &lt;strong>&amp;ldquo;Past&amp;rdquo;&lt;/strong>. Similarly, when a project is unpinned from Favorites, it goes to &amp;ldquo;Active&amp;rdquo; Section.&lt;/p>
&lt;h1 id="conclusion-and-future-scope">&lt;strong>Conclusion and future Scope&lt;/strong>&lt;/h1>
&lt;p>Building a comprehensive search system requires careful attention to performance, user experience, and maintainability. FlexSearch provided the foundation, but the real value came from thoughtful implementation of persistent indexing, advanced scoring, and robust error handling. The result is a search system that feels instant to users while handling complex queries across diverse document types.&lt;/p>
&lt;p>The key to success was treating search not as a single feature, but as a complete subsystem with its own data management, performance monitoring, and user interface considerations. By investing in these supporting systems, the search functionality became a central, reliable part of the application that users can depend on.&lt;/p>
&lt;p>The future scope would include:&lt;/p>
&lt;ol>
&lt;li>Using a database (for example, SQLite), instead of JSON, which is better for this use case than JSON due to better and efficient query performance and atomic (CRUD) operations.&lt;/li>
&lt;li>Integrating any suggestions from my mentors, as well as improvements we feel are necessary.&lt;/li>
&lt;li>Developing unit tests for further functionalities and improvements.&lt;/li>
&lt;/ol>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Thank You!" srcset="
/report/osre25/northwestern/statwrap/20250823-debangi29/image_hu81a7405087771991938f164c6a45c6d2_109315_f70985a589ad6b79f8c95b36c5279852.webp 400w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image_hu81a7405087771991938f164c6a45c6d2_109315_b28b9dbb6c70c33ca845fda461a64fcf.webp 760w,
/report/osre25/northwestern/statwrap/20250823-debangi29/image_hu81a7405087771991938f164c6a45c6d2_109315_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250823-debangi29/image_hu81a7405087771991938f164c6a45c6d2_109315_f70985a589ad6b79f8c95b36c5279852.webp"
width="760"
height="235"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p></description></item><item><title>[Final]Reproducibility of Interactive Notebooks in Distributed Environments</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/depaul/notebook-rep/08202025-rahmad/</link><pubDate>Wed, 20 Aug 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/depaul/notebook-rep/08202025-rahmad/</guid><description>&lt;p>I am sharing a overview of my project &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/06122025-rahmad">Reproducibility of Interactive Notebooks in Distributed Environments&lt;/a> and the work that I did this summer.&lt;/p>
&lt;h1 id="project-overview">Project Overview&lt;/h1>
&lt;p>This project aims at improving the reproducibility of interactive notebooks which are executed in a distributed environment. Notebooks like in the &lt;a href="https://jupyter.org/" target="_blank" rel="noopener">Jupyter&lt;/a> environment have become increasingly popular and are widely used in the scientific community due to their ease of use and portability. Reproducing these notebooks is a challenging task especially in a distributed cluster environment.&lt;/p>
&lt;p>In the distributed environments we consider, the notebook code is divided into manager and worker code. The manager code is the main entry point of the program which divides the task at hand into one or more worker codes which run in a parallel, distributed fashion. We utlize several open source tools to package and containerize the application code which can be used to reproduce it across different machines and environments. They include &lt;a href="https://github.com/radiant-systems-lab/sciunit" target="_blank" rel="noopener">Sciunit&lt;/a>, &lt;a href="https://github.com/radiant-systems-lab/Flinc" target="_blank" rel="noopener">FLINC&lt;/a>, and &lt;a href="https://cctools.readthedocs.io/en/stable/taskvine/" target="_blank" rel="noopener">TaskVine&lt;/a>. These are the high-level goals of this project:&lt;/p>
&lt;ol>
&lt;li>Generate execution logs for a notebook program.&lt;/li>
&lt;li>Generate code and data dependencies for notebook programs in an automated manner.&lt;/li>
&lt;li>Utilize the generated dependencies at various granularities to automate the deployment and execution of notebooks in a parallel and distributed environment.&lt;/li>
&lt;li>Audit and package the notebook code running in a distributed environment.&lt;/li>
&lt;li>Overall, support efficient reproducibility of programs in a notebook program.&lt;/li>
&lt;/ol>
&lt;h1 id="progress-highlights">Progress Highlights&lt;/h1>
&lt;p>Here are the details of the work that I did during this summer.&lt;/p>
&lt;h2 id="generation-of-execution-logs">Generation of Execution Logs&lt;/h2>
&lt;p>We generate execution logs for the notebook programs in a distributed environment the Linux utility &lt;a href="https://man7.org/linux/man-pages/man1/strace.1.html" target="_blank" rel="noopener">strace&lt;/a> which records every system call made by the notebook. It includes all files accessed during its execution. We collect separate logs for both manager and the worker code since they are executed on different machines and the dependencies for both are different. By recording the entire notebook execution, we capture all libraries, packages, and data files referenced during notebook execution in the form of execution logs. These logs are then utilized for further analyses.&lt;/p>
&lt;h2 id="extracting-software-dependencies">Extracting Software Dependencies&lt;/h2>
&lt;p>When a library such as a Python package like &lt;em>Numpy&lt;/em> is used by the notebook program, an entry is made in the execution log which has the complete path of the accessed library file(s) along with additional information. We analyze the execution logs for both manager and workers to find and enlist all dependencies. So far, we are limited to Python packages, though this methodology is general and can be used to find dependencies for any programing language. For Python packages, their version numbers are also obtained by querying the package managers like &lt;em>pip&lt;/em> or &lt;em>Conda&lt;/em> on the local system.&lt;/p>
&lt;h2 id="extracting-data-dependencies">Extracting Data Dependencies&lt;/h2>
&lt;p>We utilze similar execution logs to identify which data files were used by the notebook program. The list of logged files also contain various configuration or setting files used by certain packages and libraries. These files are removed from the list of data dependencies through post-processing done by analyzing file paths.&lt;/p>
&lt;h2 id="testing-the-pipeline">Testing the Pipeline&lt;/h2>
&lt;p>We have conducted our experiments on three use cases obtained from different domains using between 5 and 10 workers. They include distributed image convolution, climate trend analysis, and high energy physics experiment analysis. The results so far are promising with good accuracy and with a slight running time overhead.&lt;/p>
&lt;h2 id="processing-at-cell-level">Processing at Cell-level&lt;/h2>
&lt;p>We perform the same steps of log generation and data and software dependency extraction at the level of individual cells in a notebook instead of once for the whole notebook. As a result, we generate software and data dependencies at the level of individual notebook cells. This is achieved by interrupting control flow before and after execution of each cell to write special instructions to the execution log for marking boundaries of cell execution. We then analyze the intervals between these instructions to identify which files and Python packages are accessed by each specific cell. We use this information to generate the list of software dependencies used by that cell only.&lt;/p>
&lt;p>We also capture data dependencies by overriding analyzing the execution logs generated by overriding the function of the &lt;em>open&lt;/em> function call used to access various files.&lt;/p>
&lt;h2 id="distributed-notebook-auditing">Distributed Notebook Auditing&lt;/h2>
&lt;p>In order to execute and audit workloads in parallel, we use &lt;a href="https://github.com/radiant-systems-lab/parallel-sciunit" target="_blank" rel="noopener">Sciunit Parallel&lt;/a> which uses GNU Parallel for efficient parallel execution of tasks. The user specifies the number of tasks or machines to run the task on which is then distributed across them. Once the execution completes, their containerized executions need to be gathered at the host location.&lt;/p>
&lt;h2 id="efficient-reproducibility-with-checkpointing">Efficient Reproducibility with Checkpointing&lt;/h2>
&lt;p>An important challenge with Jupyter notebooks is that sometimes they are unnecessarily time-consuming and resource-intensive, especially when most cells remain unchanged. We worked on &lt;a href="https://github.com/talha129/NBRewind/tree/master" target="_blank" rel="noopener">NBRewind&lt;/a> which is a lightweight tool to accelerate notebook re-execution by avoiding redundant computation. It integrates checkpointing, application virtualization, and content-based deduplication. It enables two kinds of checkpoints: incremental and full-state. In incremental checkpoints, notebook states and dependencies across multiple cells are stored once such that only their deltas are stored again. In full-state checkpoints, the same is stored after each cell. During its restore process, it restores outputs for unchanged cells and thus enables efficient re-execution. Our empirical
evaluation demonstrates that NBRewind can significantly reduce both notebook audit and repeat times with incremental checkpoints.&lt;/p>
&lt;p>I am very happy abut the experience I have had in this project and I would encourage other students to join this program in the future.&lt;/p></description></item><item><title>Mid-Term Update: MPI Appliance for HPC Research on Chameleon</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/mpi/20250803-rohan-babbar/</link><pubDate>Sun, 03 Aug 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/mpi/20250803-rohan-babbar/</guid><description>&lt;p>Hi everyone! This is my mid-term blog update for the project &lt;a href="https://ucsc-ospo.github.io/project/osre25/uchicago/mpi/" target="_blank" rel="noopener">MPI Appliance for HPC Research on Chameleon&lt;/a>, developed in collaboration with Argonne National Laboratory and the Chameleon Cloud community.
This blog follows up on my earlier post, which you can find &lt;a href="https://ucsc-ospo.github.io/report/osre25/uchicago/mpi/20250614-rohan-babbar/" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;h3 id="-june-15--june-29-2025">🔧 June 15 – June 29, 2025&lt;/h3>
&lt;p>Worked on creating and configuring images on Chameleon Cloud for the following three sites:
CHI@UC, CHI@TACC, and KVM@TACC.&lt;/p>
&lt;p>Key features of the images:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Spack&lt;/strong>: Pre-installed and configured for easy package management of HPC software.&lt;/li>
&lt;li>&lt;strong>Lua Modules (LMod)&lt;/strong>: Installed and configured for environment module management.&lt;/li>
&lt;li>&lt;strong>MPI Support&lt;/strong>: Both MPICH and Open MPI are pre-installed, enabling users to run distributed applications out-of-the-box.&lt;/li>
&lt;/ul>
&lt;p>These images are now publicly available and can be seen directly on the Chameleon Appliance Catalog, titled &lt;a href="https://chameleoncloud.org/appliances/127/" target="_blank" rel="noopener">MPI and Spack for HPC (Ubuntu 22.04)&lt;/a>.&lt;/p>
&lt;p>I also worked on some example Jupyter notebooks on how to get started using these images.&lt;/p>
&lt;h3 id="-june-30--july-13-2025">🔧 June 30 – July 13, 2025&lt;/h3>
&lt;p>With the MPI Appliance now published on Chameleon Cloud, the next step was to automate the setup of an MPI-Spack cluster.&lt;/p>
&lt;p>To achieve this, I developed a set of Ansible playbooks that:&lt;/p>
&lt;ol>
&lt;li>Configure both master and worker nodes with site-specific settings&lt;/li>
&lt;li>Set up seamless access to Chameleon NFS shares&lt;/li>
&lt;li>Allow users to easily install Spack packages, compilers, and dependencies across all nodes&lt;/li>
&lt;/ol>
&lt;p>These playbooks aim to simplify the deployment of reproducible HPC environments and reduce the time required to get a working cluster up and running.&lt;/p>
&lt;h3 id="-july-14--july-28-2025">🔧 July 14 – July 28, 2025&lt;/h3>
&lt;p>This week began with me fixing some issues in python-chi, the official Python client for the Chameleon testbed.
We also discussed adding support for CUDA-based packages, which would make it easier to work with NVIDIA GPUs.
We successfully published a new image on Chameleon, titled &lt;a href="https://chameleoncloud.org/appliances/130/" target="_blank" rel="noopener">MPI and Spack for HPC (Ubuntu 22.04 - CUDA)&lt;/a>, and added an example to demonstrate its usage.&lt;/p>
&lt;p>We compiled the artifact containing the Jupyter notebooks and Ansible playbooks and published it on Chameleon Trovi.
Feel free to check it out &lt;a href="https://chameleoncloud.org/experiment/share/7424a8dc-0688-4383-9d67-1e40ff37de17" target="_blank" rel="noopener">here&lt;/a>. The documentation still needs some work.&lt;/p>
&lt;p>📌 That’s it for now! I’m currently working on the documentation, a ROCm-based image for AMD GPUs, and some container-based examples.
Stay tuned for more updates in the next blog.&lt;/p></description></item><item><title>Halfway Blog - WildBerryEye: Mechanical Design &amp; Weather-Resistant Enclosure</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/wildberryeye/20250725-teolangan/</link><pubDate>Fri, 25 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/wildberryeye/20250725-teolangan/</guid><description>&lt;p>Hi everyone! My name is &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/content/authors/teolangan">Teodor Langan&lt;/a>, and I am an undergraduate studying Robotics Engineering at the University of California, Santa Cruz. I’m happy to share the progress I have been able to make over the last six weeks on my GSoC 2025 project. Over the last six weeks, I have been working on developing the hardware for the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/wildberryeye/">WildBerryEye&lt;/a> project, mentored by &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/content/authors/caiespin">Carlos Isaac Espinosa&lt;/a>.&lt;/p>
&lt;h2 id="project-overview">Project Overview&lt;/h2>
&lt;p>The WildBerryEye project enables AI-powered ecological monitoring using Raspberry Pi cameras and computer vision models. However, achieving this requires a reliable enclosure that can support long-term deployment in the wild. The goal for my project is to address this need by designing a modular, 3D-printable camera casing that protects WildBerryEye’s electronics from outside factors such as rain, dust, and bugs, while remaining easy to print and assemble. To achieve this, my main responsibilities for this project include:&lt;/p>
&lt;ul>
&lt;li>Implementing a modular design and development-friendly features for ease of assembly and flexible use across hardware setups&lt;/li>
&lt;li>Prototyping and testing enclosures outdoors to assess durability, water resistance, and ventilation—then iterating based on results&lt;/li>
&lt;li>Developing clear documentation, assembly instructions, and designing with open-source tools&lt;/li>
&lt;li>Exploring material options and print techniques to improve outdoor lifespan and environmental resilience&lt;/li>
&lt;/ul>
&lt;p>Designed largely with FreeCAD and tested in real outdoor conditions, the open-source enclosure will ensure WildBerryEye hardware can be deployed in natural environments for continuous, low-maintenance data collection.&lt;/p>
&lt;h2 id="progress-so-far">Progress So Far&lt;/h2>
&lt;p>Over the past 6 weeks, great progress has been made on the design of the WildBerryEye camera enclosure. Some key accomplishments include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Full 3D Assembly Model of Electronics:&lt;/strong> Modeled all core components used in the WildBerryEye system to serve as a reference for enclosure design. For parts without existing CAD models, accurate measurements were taken and custom models were created in FreeCAD.&lt;/li>
&lt;li>&lt;strong>Initial Enclosure Prototype:&lt;/strong> Designed and 3D-printed a first full prototype featuring a hinge-latch mechanism to allow tool-free easy access to internal electronics for development and maintenance.&lt;/li>
&lt;li>&lt;strong>Design Iteration Based on Testing:&lt;/strong> Based on the results of the first print, created an improved version with better electronics integration, port alignment, and more functionality.&lt;/li>
&lt;/ul>
&lt;h2 id="challenges--next-steps">Challenges &amp;amp; Next Steps&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Field-Ready Integration:&lt;/strong> Preparing for field testing with upcoming prototypes by making sure that all internal electronics are securely mounted and fully accessible within the enclosure.&lt;/li>
&lt;li>&lt;strong>Latch Mechanism Refinement:&lt;/strong> Finalizing a reliable hinge-latch design that can keep the enclosure sealed during outdoor use while remaining easy to open for maintenance.&lt;/li>
&lt;li>&lt;strong>Balancing Modularity, Size, and Weatherproofing:&lt;/strong> Maintaining a compact form factor without compromising on modularity or weather resistance—especially when routing cables and mounting components.&lt;/li>
&lt;li>&lt;strong>Material Experimentation:&lt;/strong> Beginning test prints with TPU, a flexible filament that may provide improved seals or gaskets for added protection.&lt;/li>
&lt;li>&lt;strong>Ventilation Without Exposure:&lt;/strong> Exploring airflow solutions such as labyrinth-style vents to enable heat dissipation without letting in moisture or debris.&lt;/li>
&lt;/ul>
&lt;h2 id="final-thoughts">Final Thoughts&lt;/h2>
&lt;p>These past 6 weeks have helped me immensely to grow my skills in mechanical design, CAD modeling, and field-focused prototyping. The WildBerryEye system can help researchers monitor pollinators and other wildlife in their natural habitats without requiring constant in-person observation or high-maintenance setups. By enabling long-term, autonomous data collection in outdoor environments, it opens new possibilities for low-cost, scalable ecological monitoring.&lt;/p>
&lt;p>I’m especially grateful to my mentor &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/content/authors/caiespin">Carlos Isaac Espinosa&lt;/a> and the WildBerryEye team for their ongoing support. Excited for the second half, where the design will face real-world testing and help bring this impactful system one step closer to field deployment!&lt;/p></description></item><item><title>Midterm Blog: Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/umass/edge-replication/20250725-panjisri/</link><pubDate>Fri, 25 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/umass/edge-replication/20250725-panjisri/</guid><description>&lt;p>Hello! I&amp;rsquo;m Panji Sri Kuncara Wisma and I want to share my midterm progress on the &amp;ldquo;Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges&amp;rdquo; project under the mentorship of Fadhil I. Kurnia.&lt;/p>
&lt;h2 id="project-overview">Project Overview&lt;/h2>
&lt;p>The goal of our project is to create an open testbed that enables fair, reproducible evaluation of different consensus protocols (Paxos variants, EPaxos, Raft, etc.) when deployed at network edges. Currently, researchers struggle to compare these systems because they lack standardized evaluation environments and often rely on mock implementations of proprietary systems.&lt;/p>
&lt;p>XDN (eXtensible Distributed Network) is one of the important consensus systems we plan to evaluate in our benchmarking testbed. Built on GigaPaxos, it allows deployment of replicated stateful services across edge locations. As part of preparing our benchmarking framework, we need to ensure that the systems we evaluate, including XDN, are robust for fair comparison.&lt;/p>
&lt;h2 id="progress">Progress&lt;/h2>
&lt;p>As part of preparing our benchmarking tool, I have been working on refactoring XDN&amp;rsquo;s FUSE filesystem from C++ to Rust. This work is essential for creating a stable and reliable XDN platform.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="System Architecture" srcset="
/report/osre25/umass/edge-replication/20250725-panjisri/fuselog_design_hu4e0250a1afb641f82d064bca3b5b892d_118470_5600401ae6570bf38b96fa89a080f4f7.webp 400w,
/report/osre25/umass/edge-replication/20250725-panjisri/fuselog_design_hu4e0250a1afb641f82d064bca3b5b892d_118470_6d3b555dbec3bdb305839eda9b227acf.webp 760w,
/report/osre25/umass/edge-replication/20250725-panjisri/fuselog_design_hu4e0250a1afb641f82d064bca3b5b892d_118470_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/umass/edge-replication/20250725-panjisri/fuselog_design_hu4e0250a1afb641f82d064bca3b5b892d_118470_5600401ae6570bf38b96fa89a080f4f7.webp"
width="760"
height="439"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The diagram above illustrates how the FUSE filesystem integrates with XDN&amp;rsquo;s distributed architecture. On the left, we see the standard FUSE setup where applications interact with the filesystem through the kernel&amp;rsquo;s VFS layer. On the right, the distributed replication flow is shown: Node 1 runs &lt;code>fuselog_core&lt;/code> which captures filesystem operations and generates statediffs, while Nodes 2 and 3 run &lt;code>fuselog_apply&lt;/code> to receive and apply these statediffs, maintaining replica consistency across the distributed system.&lt;/p>
&lt;p>This FUSE component is critical for XDN&amp;rsquo;s operation as it enables transparent state capture and replication across edge nodes. By refactoring this core component from C++ to Rust, we&amp;rsquo;re hopefully strengthening the foundation for fair benchmarking comparisons in our testbed.&lt;/p>
&lt;h3 id="core-work-c-to-rust-fuse-filesystem-migration">Core Work: C++ to Rust FUSE Filesystem Migration&lt;/h3>
&lt;p>XDN relies on a FUSE (Filesystem in Userspace) component to capture filesystem operations and generate &amp;ldquo;statediffs&amp;rdquo; - records of changes that get replicated across edge nodes. The original C++ implementation worked but had memory safety concerns and limited optimization capabilities.&lt;/p>
&lt;p>I worked on refactoring from C++ to Rust, implementing several improvements:&lt;/p>
&lt;p>&lt;strong>New Features Added:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Zstd Compression&lt;/strong>: Reduces statediff payload sizes&lt;/li>
&lt;li>&lt;strong>Adaptive Compression&lt;/strong>: Intelligently chooses compression strategies&lt;/li>
&lt;li>&lt;strong>Advanced Pruning&lt;/strong>: Removes redundant operations (duplicate chmod/chown, created-then-deleted files)&lt;/li>
&lt;li>&lt;strong>Bincode Serialization&lt;/strong>: Helps avoid manual serialization code and reduces the risk of related bugs&lt;/li>
&lt;li>&lt;strong>Extended Operations&lt;/strong>: Added support for additional filesystem operations (mkdir, symlink, hardlinks, etc.)&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Architectural Improvements:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Memory Safety&lt;/strong>: Rust&amp;rsquo;s ownership system helps prevent common memory management issues&lt;/li>
&lt;li>&lt;strong>Type Safety&lt;/strong>: Using Rust enums instead of integer constants for better type checking&lt;/li>
&lt;/ul>
&lt;h2 id="findings">Findings&lt;/h2>
&lt;p>The optimization results performed as expected:&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Database Performance Comparison" srcset="
/report/osre25/umass/edge-replication/20250725-panjisri/performance_hudc10c2ffc95d775aedb0a1dad587d6fd_55711_cb1ea5caaa82d543dfeabd0c97f7c4fe.webp 400w,
/report/osre25/umass/edge-replication/20250725-panjisri/performance_hudc10c2ffc95d775aedb0a1dad587d6fd_55711_d65f44ef3f769dddda7f0211b94ad6b6.webp 760w,
/report/osre25/umass/edge-replication/20250725-panjisri/performance_hudc10c2ffc95d775aedb0a1dad587d6fd_55711_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/umass/edge-replication/20250725-panjisri/performance_hudc10c2ffc95d775aedb0a1dad587d6fd_55711_cb1ea5caaa82d543dfeabd0c97f7c4fe.webp"
width="760"
height="433"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>&lt;strong>Statediff Size Reductions:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>MySQL workload&lt;/strong>: 572MB → 29.6MB (95% reduction)&lt;/li>
&lt;li>&lt;strong>PostgreSQL workload&lt;/strong>: 76MB → 11.9MB (84% reduction)&lt;/li>
&lt;li>&lt;strong>SQLite workload&lt;/strong>: 4MB → 29KB (99% reduction)&lt;/li>
&lt;/ul>
&lt;p>The combination of write coalescing, pruning, and compression proves especially effective for database workloads, where many operations involve small changes to large files.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Rust vs C&amp;#43;&amp;#43; Performance Comparison" srcset="
/report/osre25/umass/edge-replication/20250725-panjisri/latency_hu3b080735c91d058ad2f9cf67a54d5f14_21553_2adee964972897a04e60327dcfe9675e.webp 400w,
/report/osre25/umass/edge-replication/20250725-panjisri/latency_hu3b080735c91d058ad2f9cf67a54d5f14_21553_dd86a6fc0dabbac3beb17266f1f49002.webp 760w,
/report/osre25/umass/edge-replication/20250725-panjisri/latency_hu3b080735c91d058ad2f9cf67a54d5f14_21553_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/umass/edge-replication/20250725-panjisri/latency_hu3b080735c91d058ad2f9cf67a54d5f14_21553_2adee964972897a04e60327dcfe9675e.webp"
width="760"
height="470"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>&lt;strong>Performance Comparison:&lt;/strong>
Remarkably, the Rust implementation matches or exceeds C++ performance:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>POST operations&lt;/strong>: 30% faster (10.5ms vs 15ms)&lt;/li>
&lt;li>&lt;strong>DELETE operations&lt;/strong>: 33% faster (10ms vs 15ms)&lt;/li>
&lt;li>&lt;strong>Overall latency&lt;/strong>: Consistently better (9ms vs 11ms)&lt;/li>
&lt;/ul>
&lt;h2 id="current-challenges">Current Challenges&lt;/h2>
&lt;p>While the core implementation is complete and functional, I&amp;rsquo;m currently debugging occasional latency spikes that occur under specific workload patterns. These edge cases need to be resolved before moving on to the benchmarking phase, as inconsistent performance could compromise the reliability of the evaluation.&lt;/p>
&lt;h2 id="next-steps">Next Steps&lt;/h2>
&lt;p>With the FUSE filesystem foundation nearly complete, next steps include:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Resolve latency spike issues&lt;/strong> and complete XDN stabilization&lt;/li>
&lt;li>&lt;strong>Build benchmarking framework&lt;/strong> - a comparison tool that can systematically evaluate different consensus protocols with standardized metrics.&lt;/li>
&lt;li>&lt;strong>Run systematic evaluation&lt;/strong> across protocols&lt;/li>
&lt;/ol>
&lt;p>The optimized filesystem will hopefully provide a stable base for reproducible performance comparisons between distributed consensus protocols.&lt;/p></description></item><item><title>Midterm Blog - WildBerryEye User Interface</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/wildberryeye/20250731-sophietao127/</link><pubDate>Wed, 16 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/wildberryeye/20250731-sophietao127/</guid><description>&lt;p>Hi, my name is Sophie Tao, I am an alumn at the University of Washington, with majoring in Electrical and Computer Engineering,
I’m happy to share the progress I have been able to make over the last six weeks on my GSoC 2025 project, WildBerryEye, mentored by &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/content/authors/caiespin">Carlos Isaac Espinosa&lt;/a>.&lt;/p>
&lt;h1 id="project-overview">Project Overview&lt;/h1>
&lt;p>WildBerryEye is an open-source initiative to support ecological monitoring of pollinators such as bees and hummingbirds using edge computing and computer vision. The project leverages a Raspberry Pi and YOLO for object detection and aims to provide an accessible, responsive, and real-time web interface for researchers, ecologists, and citizen scientists.&lt;/p>
&lt;p>This project specifically focuses on building the frontend and backend infrastructure for WildBerryEye’s user interface, enabling:&lt;/p>
&lt;ul>
&lt;li>Real-time pollinator detection preview
&lt;ul>
&lt;li>Real-time image capture&lt;/li>
&lt;li>Real time video capture&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>Responsive, User-friendly UI&lt;/li>
&lt;li>Object detection&lt;/li>
&lt;li>Researcher-friendly configuration and usability&lt;/li>
&lt;/ul>
&lt;h1 id="progress-so-far">Progress So Far&lt;/h1>
&lt;p>✅ Phase 1: Setup&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Frontend: Completed React + TypeScript project initialization with routing and base components. Pages include:&lt;/p>
&lt;ul>
&lt;li>Home page (with image preview)&lt;/li>
&lt;li>Dashboard page (pollinator image &amp;amp; video)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>Backend: Flask server initialized with modular structure. Basic API endpoints stubbed as per the proposal.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>✅ Phase 2: Core Features&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Real-Time Communication:
Frontend successfully receives image stream using WebSocket.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>UI Components:&lt;/p>
&lt;ul>
&lt;li>Implemented image carousel preview on homepage.&lt;/li>
&lt;li>Image Capture (Image download)&lt;/li>
&lt;li>Video Capture (Video Preview, Video Recording)&lt;/li>
&lt;li>Sidebar-based navigation and page structure fully integrated.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>API Development:&lt;/p>
&lt;ul>
&lt;li>Implemented core endpoints such as /home, and/dashboard routes.&lt;/li>
&lt;li>Backend handlers structured for image and video capture.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h1 id="challenges-encountered">Challenges Encountered&lt;/h1>
&lt;p>⚠️ Real-time Image Testing: Lack of consistent live camera input made local testing inconsistent. &lt;br>
⚠️ Allocate the camera module for both capture image and capture video. &lt;br>
⚠️ Obtain the proper format of the video.&lt;/p>
&lt;h1 id="next-steps">Next Steps&lt;/h1>
&lt;ul>
&lt;li>Enable more features for video capture&lt;/li>
&lt;li>Integrated with Machine Learning Model&lt;/li>
&lt;li>Conduct at least one usability test (self + external user) and incorporate feedback.&lt;/li>
&lt;li>Final Testing &amp;amp; Docs&lt;/li>
&lt;/ul>
&lt;h1 id="summary">Summary&lt;/h1>
&lt;p>At this midterm stage, the WildBerryEye UI project is on track with core milestones completed, including real-time communication, component setup, and backend API structure. The remaining work focuses on refinement, visualizations, testing, and documentation to ensure a polished final product by the end of GSoC 2025.&lt;/p></description></item><item><title>Mid-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/</link><pubDate>Tue, 15 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>Hello everyone!&lt;br>
I am Debangi Ghosh from India, an undergraduate student at the Indian Institute of Technology (IIT) BHU, Varanasi. As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/northwestern/statwrap/">StatWrap: Cross-Project Searching and Classification using Local Indexing&lt;/a> project, my &lt;a href="https://drive.google.com/file/d/1dxyBP2oMJwYDCKyIWzr465zNmm6UWtnI/view?usp=sharing" target="_blank" rel="noopener">proposal&lt;/a>, under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/luke-rasmussen/">Luke Rasmussen&lt;/a>, focuses on developing a full-text search service within the StatWrap user interface. This involves evaluating different search libraries and implementing a classification system to distinguish between active and past projects.&lt;/p>
&lt;h2 id="about-the-project">&lt;strong>About the Project&lt;/strong>&lt;/h2>
&lt;p>As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for investigators to discover relevant projects, notes, and assets—across both current and archived work—using information that is either user-entered or passively collected by StatWrap.&lt;/p>
&lt;p>Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Evaluating open-source search libraries&lt;/strong> suitable for local indexing and retrieval&lt;/li>
&lt;li>&lt;strong>Building the full-text search functionality&lt;/strong> directly into the StatWrap UI to allow seamless querying across projects&lt;/li>
&lt;li>&lt;strong>Ensuring reliability&lt;/strong> through the development of unit tests and comprehensive system testing&lt;/li>
&lt;li>&lt;strong>Implementing a classification system&lt;/strong> to label projects as “Active,” “Pinned,” or “Past” within the user interface&lt;/li>
&lt;/ul>
&lt;p>This project offers a great opportunity to work at the intersection of software development, information retrieval, and user-centric design—while contributing to research reproducibility and collaboration within scientific workflows.&lt;/p>
&lt;h2 id="progress">Progress&lt;/h2>
&lt;p>It has been more than six weeks since the project began, and significant progress has been made. Here&amp;rsquo;s a breakdown:&lt;/p>
&lt;h3 id="1-descriptive-comparison-of-open-source-libraries">1. &lt;strong>Descriptive Comparison of Open-Source Libraries&lt;/strong>&lt;/h3>
&lt;p>Compared various open-source search libraries based on evaluation criteria such as &lt;strong>indexing speed, search speed, memory usage, typo tolerance, fuzzy searching, partial matching, full-text queries, contextual search, Boolean support, exact word match, installation ease, maintenance, documentation&lt;/strong>, and &lt;strong>developer experience&lt;/strong>.&lt;/p>
&lt;h3 id="2-the-libraries">2. &lt;strong>The Libraries&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Lunr.js&lt;/strong>&lt;br>
A small, client-side full-text search engine that mimics Solr capabilities.&lt;/p>
&lt;ul>
&lt;li>Field-based search, boosting&lt;/li>
&lt;li>Supports TF-IDF, inverted index&lt;/li>
&lt;li>No built-in fuzzy search (only basic wildcards)&lt;/li>
&lt;li>Can serialize/deserialize index&lt;/li>
&lt;li>Not designed for large datasets&lt;/li>
&lt;li>Moderate memory usage and indexing speed&lt;/li>
&lt;li>Good documentation&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: Static websites or SPAs needing simple in-browser search&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>ElasticLunr.js&lt;/strong>&lt;br>
A lightweight, more flexible alternative to Lunr.js.&lt;/p>
&lt;ul>
&lt;li>Dynamic index (add/remove docs)&lt;/li>
&lt;li>Field-based and weighted search&lt;/li>
&lt;li>No advanced fuzzy matching&lt;/li>
&lt;li>Faster and more customizable than Lunr&lt;/li>
&lt;li>Smaller footprint&lt;/li>
&lt;li>Easy to use and maintain&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: Developers wanting Lunr-like features with simpler customization&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Fuse.js&lt;/strong>&lt;br>
A fuzzy search library ideal for small to medium datasets.&lt;/p>
&lt;ul>
&lt;li>Fuzzy search with typo tolerance&lt;/li>
&lt;li>Deep key/path searching&lt;/li>
&lt;li>No need to build index&lt;/li>
&lt;li>Highly configurable (threshold, distance, etc.)&lt;/li>
&lt;li>Linear scan = slower on large datasets&lt;/li>
&lt;li>Not full-text search (scoring-based match)&lt;/li>
&lt;li>Extremely easy to set up and use&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: Fuzzy search in small in-memory arrays (e.g., auto-suggest, dropdown filters)&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>FlexSearch&lt;/strong>&lt;br>
A blazing-fast, modular search engine with advanced indexing options.&lt;/p>
&lt;ul>
&lt;li>Extremely fast search and indexing&lt;/li>
&lt;li>Supports phonetic, typo-tolerant, and partial matching&lt;/li>
&lt;li>Asynchronous support&lt;/li>
&lt;li>Multi-language + Unicode-friendly&lt;/li>
&lt;li>Low memory footprint&lt;/li>
&lt;li>Configuration can be complex for beginners&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: High-performance search in large/multilingual datasets&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>MiniSearch&lt;/strong>&lt;br>
A small, full-text search engine with balanced performance and simplicity.&lt;/p>
&lt;ul>
&lt;li>Fast indexing and searching&lt;/li>
&lt;li>Fuzzy search, stemming, stop words&lt;/li>
&lt;li>Field boosting and prefix search&lt;/li>
&lt;li>Compact, can serialize index&lt;/li>
&lt;li>Clean and modern API&lt;/li>
&lt;li>Lightweight and easy to maintain&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: Balanced, in-browser full-text search for moderate datasets&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Search-Index&lt;/strong>&lt;br>
A persistent, full-featured search engine for Node.js and browsers.&lt;/p>
&lt;ul>
&lt;li>Persistent storage with LevelDB&lt;/li>
&lt;li>Real-time indexing&lt;/li>
&lt;li>Fielded queries, faceting, filtering&lt;/li>
&lt;li>Advanced queries (Boolean, range, etc.)&lt;/li>
&lt;li>Slightly heavier setup&lt;/li>
&lt;li>Good for offline/local-first apps&lt;/li>
&lt;li>Browser usage more complex than others&lt;/li>
&lt;li>&lt;strong>Best for&lt;/strong>: Node.js apps, &lt;strong>not directly compatible with the Electron + React environment of StatWrap&lt;/strong>&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h3 id="3-developer-experience-and-maintenance">3. Developer Experience and Maintenance&lt;/h3>
&lt;p>We analyzed the download trends of the search libraries using npm trends, and also reviewed their maintenance statistics to assess how frequently they are updated.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="DOWNLOADS" srcset="
/report/osre25/northwestern/statwrap/20250715-debangi29/downloads_hu3acc13cb2503d87ec01b259eecff7d9f_205568_2981b0e25cc7e6da71dd1af69f1ab499.webp 400w,
/report/osre25/northwestern/statwrap/20250715-debangi29/downloads_hu3acc13cb2503d87ec01b259eecff7d9f_205568_52b5a1c87803e2c8a2f59ad52703cd75.webp 760w,
/report/osre25/northwestern/statwrap/20250715-debangi29/downloads_hu3acc13cb2503d87ec01b259eecff7d9f_205568_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/downloads_hu3acc13cb2503d87ec01b259eecff7d9f_205568_2981b0e25cc7e6da71dd1af69f1ab499.webp"
width="760"
height="362"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;br>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Maintenance" srcset="
/report/osre25/northwestern/statwrap/20250715-debangi29/Maintenance_hub392779bb7551900858e36e62009d315_166372_50f35746c2224661759e3d1f68308f5c.webp 400w,
/report/osre25/northwestern/statwrap/20250715-debangi29/Maintenance_hub392779bb7551900858e36e62009d315_166372_1f83a8585ae086eae8ad16a0d18c8fff.webp 760w,
/report/osre25/northwestern/statwrap/20250715-debangi29/Maintenance_hub392779bb7551900858e36e62009d315_166372_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/Maintenance_hub392779bb7551900858e36e62009d315_166372_50f35746c2224661759e3d1f68308f5c.webp"
width="760"
height="261"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="4-comparative-analysis-after-testing">4. Comparative Analysis After Testing&lt;/h3>
&lt;p>Each search library was benchmarked against a predefined set of queries based on the same evaluation criteria.&lt;br>
We are yet to finalize the weights for each criterion, which will be done during the end-term evaluation.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="COMPARATIVE ANALYSIS" srcset="
/report/osre25/northwestern/statwrap/20250715-debangi29/image_huff63b524c7af2307fdfe0ebf7a2c55bc_128809_cf08ab4466e54fc0970dac451ab583d2.webp 400w,
/report/osre25/northwestern/statwrap/20250715-debangi29/image_huff63b524c7af2307fdfe0ebf7a2c55bc_128809_4d08ea843125818ade4b1288b2ed91fd.webp 760w,
/report/osre25/northwestern/statwrap/20250715-debangi29/image_huff63b524c7af2307fdfe0ebf7a2c55bc_128809_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/image_huff63b524c7af2307fdfe0ebf7a2c55bc_128809_cf08ab4466e54fc0970dac451ab583d2.webp"
width="760"
height="578"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="5-the-user-interface">5. The User Interface&lt;/h3>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="User Interface" srcset="
/report/osre25/northwestern/statwrap/20250715-debangi29/UI_hu614745e803a206ba95d1613340cef4da_263973_ad72fdc47d934ea42f989055b49d88aa.webp 400w,
/report/osre25/northwestern/statwrap/20250715-debangi29/UI_hu614745e803a206ba95d1613340cef4da_263973_51decc3c2ce6793ca567153dd67113d0.webp 760w,
/report/osre25/northwestern/statwrap/20250715-debangi29/UI_hu614745e803a206ba95d1613340cef4da_263973_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/UI_hu614745e803a206ba95d1613340cef4da_263973_ad72fdc47d934ea42f989055b49d88aa.webp"
width="760"
height="475"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;br>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Debug Tools" srcset="
/report/osre25/northwestern/statwrap/20250715-debangi29/image-1_huff1ce04307fd90cec714c35adb969f67_82199_e86edc8fa7aba824f1fd8a90948c619c.webp 400w,
/report/osre25/northwestern/statwrap/20250715-debangi29/image-1_huff1ce04307fd90cec714c35adb969f67_82199_ba6358e5089040847a0e39704677cc12.webp 760w,
/report/osre25/northwestern/statwrap/20250715-debangi29/image-1_huff1ce04307fd90cec714c35adb969f67_82199_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250715-debangi29/image-1_huff1ce04307fd90cec714c35adb969f67_82199_e86edc8fa7aba824f1fd8a90948c619c.webp"
width="760"
height="482"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>The user interface includes options to search using three search modes (Basic, Advanced, Boolean operators) with configurable parameters. Results are sorted based on relevance score (highest first), and also grouped by category.&lt;/p>
&lt;h3 id="6-overall-functioning">6. Overall Functioning&lt;/h3>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Indexing Workflow&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Projects are processed sequentially&lt;/li>
&lt;li>Metadata, files, people, and notes are indexed (larger files are queued for later)&lt;/li>
&lt;li>Uses a &amp;ldquo;brute-force&amp;rdquo; recursive approach to walk through project directories
&lt;ul>
&lt;li>Skips directories like &lt;code>node_modules&lt;/code>, &lt;code>.git&lt;/code>, &lt;code>.statwrap&lt;/code>&lt;/li>
&lt;li>Identifies eligible text files for indexing&lt;/li>
&lt;li>Logs progress every 10 files&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Document Creation Logic&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Reads file content as UTF-8 text&lt;/li>
&lt;li>Builds searchable documents with filename, content, and metadata&lt;/li>
&lt;li>Auto-generates tags based on content and file type&lt;/li>
&lt;li>Adds documents to the search index and document store&lt;/li>
&lt;li>Handles errors gracefully with debug logging&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Search Functionality&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Uses field-weighted search&lt;/li>
&lt;li>Enriches results with document metadata&lt;/li>
&lt;li>Supports filtering by type or project&lt;/li>
&lt;li>Groups results by category (files, projects, people, etc.)&lt;/li>
&lt;li>Implements caching for improved performance&lt;/li>
&lt;li>Search statistics are generated to monitor performance&lt;/li>
&lt;/ul>
&lt;/li>
&lt;/ul>
&lt;h2 id="challenges-and-end-term-goals">Challenges and End-Term Goals&lt;/h2>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>In-memory Indexing Metadata Storing&lt;/strong>&lt;br>
Most JavaScript search libraries (like Fuse.js, Lunr, MiniSearch) store indexes entirely in memory, which can become problematic for large-scale datasets. A key challenge is designing a scalable solution that allows for disk persistence or lazy loading to prevent memory overflows.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Deciding the Weights Accordingly&lt;/strong>&lt;br>
An important challenge is tuning the relevance scoring by assigning appropriate weights to different aspects of the search, such as exact word matches, prefix matches, and typo tolerance. For instance, we prefer exact matches to be ranked higher than fuzzy or partial matches.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Implementing the Selected Library&lt;/strong>&lt;br>
Once a library is selected (based on speed, features, and compatibility with Electron + React), the next challenge is integrating it into StatWrap efficiently—ensuring local indexing, accurate search results, and smooth performance even with large projects.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Classifying Active and Past Projects in the User Interface&lt;/strong>&lt;br>
To improve navigation and search scoping, we plan to introduce three project sections in the interface: &lt;strong>Pinned&lt;/strong>, &lt;strong>Active&lt;/strong>, and &lt;strong>Past&lt;/strong> projects. This classification will help users prioritize relevant content while enabling smarter indexing strategies.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>Stay tuned for the next blog!&lt;/p></description></item><item><title>From Friction to Flow: Why I'm Building Widgets for Reproducible Research</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/jupyter-widgets/20250624-nbrewer/</link><pubDate>Tue, 24 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/jupyter-widgets/20250624-nbrewer/</guid><description>&lt;blockquote>
&lt;p>This summer, I’m building Jupyter Widgets to reduce friction in reproducible workflows on Chameleon. Along the way, I’m reflecting on what usability teaches us about the real meaning of reproducibility.&lt;/p>
&lt;/blockquote>
&lt;h2 id="supercomputing-competition-reproducibility-reality-check">Supercomputing Competition: Reproducibility Reality Check&lt;/h2>
&lt;p>My first reproducibility experience threw me into the deep end—trying to recreate a tsunami simulation with a GitHub repository, a scientific paper, and a lot of assumptions. I was part of a student cluster competition at the Supercomputing Conference, where one of our challenges was to reproduce the results of a prior-year paper. I assumed “reproduce” meant something like “re-run the code and get the same numbers.” But what we actually had to do was rebuild the entire computing environment from scratch—on different hardware, with different software versions, and vague documentation. I remember thinking: &lt;em>If all these conditions are so different, what are we really trying to learn by conducting reproducibility experiments?&lt;/em> That experience left me with more questions than answers, and those questions have stayed with me. In fact, they’ve become central to my PhD research.&lt;/p>
&lt;h2 id="summer-of-reproducibility-lessons-from-100-experiments-on-chameleon">Summer of Reproducibility: Lessons from 100+ Experiments on Chameleon&lt;/h2>
&lt;p>I’m currently a PhD student and research software engineer exploring questions around what computational reproducibility really means, and when and why it matters. I also participated in the &lt;strong>&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/depaul/repronb/">Summer of Reproducibility 2024&lt;/a>&lt;/strong>, where I helped assess over 100 public experiments on the Chameleon platform. &lt;a href="https://doi.org/10.1109/e-Science62913.2024.10678673" target="_blank" rel="noopener">Our analysis&lt;/a> revealed key friction points—especially around usability—that don’t necessarily prevent reproducibility in the strictest sense, but introduce barriers in terms of time, effort, and clarity. These issues may not stop an expert from reproducing an experiment, but they can easily deter others from even trying. This summer’s project is about reducing that friction—some of which I experienced firsthand—by improving the interface between researchers and the infrastructure they rely on.&lt;/p>
&lt;h2 id="from-psychology-labs-to-jupyter-notebooks-usability-is-central-to-reproducibility">From Psychology Labs to Jupyter Notebooks: Usability is Central to Reproducibility&lt;/h2>
&lt;p>My thinking shifted further when I was working as a research software engineer at Purdue, supporting a psychology lab that relied on a complex statistical package. For most researchers in the lab, using the tool meant wrestling with cryptic scripts and opaque parameters. So I built a simple Jupyter-based interface to help them visualize input matrices, validate settings, and run analyses without writing code. The difference was immediate: suddenly, people could actually use the tool. It wasn’t just more convenient—it made the research process more transparent and repeatable. That experience was a turning point for me. I realized that usability isn’t a nice-to-have; it’s critical for reproducibility.&lt;/p>
&lt;h2 id="teaching-jupyter-widget-tutorials-at-scipy">Teaching Jupyter Widget Tutorials at SciPy&lt;/h2>
&lt;p>Since that first experience, I’ve leaned into building better interfaces for research workflows—especially using Jupyter Widgets. Over the past few years, I’ve developed and taught tutorials on how to turn scientific notebooks into interactive web apps, including at the &lt;strong>SciPy conference&lt;/strong> in &lt;a href="https://github.com/Jupyter4Science/scipy23-jupyter-web-app-tutorial" target="_blank" rel="noopener">2023&lt;/a> and &lt;a href="https://github.com/Jupyter4Science/scipy2024-jupyter-widgets-tutorial" target="_blank" rel="noopener">2024&lt;/a>. These tutorials go beyond the basics: I focus on building real, multi-tab applications that reflect the complexity of actual research tools. Teaching others how to do this has deepened my own knowledge of the widget ecosystem and reinforced my belief that good interfaces can dramatically reduce the effort it takes to reproduce and reuse scientific code. That’s exactly the kind of usability work I’m continuing this summer—this time by improving the interface between researchers and the Chameleon platform itself.&lt;/p>
&lt;h2 id="making-chameleon-even-more-reproducible-with-widgets">Making Chameleon Even More Reproducible with Widgets&lt;/h2>
&lt;p>This summer, I’m returning to Chameleon with a more focused goal: reducing some of the friction I encountered during last year’s reproducibility project. One of Chameleon’s standout features is its Jupyter-based interface, which already goes a long way toward making reproducibility more achievable. My work builds on that strong foundation by improving and extending interactive widgets in the &lt;strong>Python-chi&lt;/strong> library — making tasks like provisioning resources, managing leases, and tracking experiment progress on Chameleon even more intuitive. For example, instead of manually digging through IDs to find an existing lease, a widget could present your current leases in a dropdown or table, making it easier to pick up where you left off and avoid unintentionally reserving unnecessary resources. It’s a small feature, but smoothing out this kind of interaction can make the difference between someone giving up or trying again. That’s what this project is about.&lt;/p>
&lt;h2 id="looking-ahead-building-for-people-not-just-platforms">Looking Ahead: Building for People, Not Just Platforms&lt;/h2>
&lt;p>I’m excited to spend the next few weeks digging into these questions—not just about what we can build, but how small improvements in usability can ripple outward to support more reproducible, maintainable, and accessible research. Reproducibility isn’t just about rerunning code; it’s about supporting the people who do the work. I’ll be sharing updates as the project progresses, and I’m looking forward to learning (and building) along the way. I’m incredibly grateful to once again take part in this paid experience, made possible by the 2025 Open Source Research Experience team and my mentors.&lt;/p></description></item><item><title>EnvGym – An AI System for Reproducible Custom Computing Environments</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/envgym/</link><pubDate>Mon, 16 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/envgym/</guid><description>&lt;p>Hello, My name is Yiming Cheng. I am a Pre-doc researcher in Computer Science at University of Chicago. I&amp;rsquo;m excited to be working with the Summer of Reproducibility and the Chameleon Cloud community as a project leader. My project is &lt;a href="https://github.com/eaminc/envgym" target="_blank" rel="noopener">EnvGym&lt;/a> that focuses on developing an AI-driven system to automatically generate and configure reproducible computing environments based on natural language descriptions from artifact descriptions, Trovi artifacts, and research papers.&lt;/p>
&lt;p>The complexity of environment setup often hinders reproducibility in scientific computing. My project aims to bridge the knowledge gap between experiment authors and reviewers by translating natural language requirements into actionable, reproducible configurations using AI and NLP techniques.&lt;/p>
&lt;h3 id="project-overview">Project Overview&lt;/h3>
&lt;p>EnvGym addresses fundamental reproducibility barriers by:&lt;/p>
&lt;ul>
&lt;li>Using AI to translate natural language environment requirements into actionable configurations&lt;/li>
&lt;li>Automatically generating machine images deployable on bare metal and VM instances&lt;/li>
&lt;li>Bridging the knowledge gap between experiment authors and reviewers&lt;/li>
&lt;li>Standardizing environment creation across different hardware platforms&lt;/li>
&lt;/ul>
&lt;h3 id="june-10--june-16-2025">June 10 – June 16, 2025&lt;/h3>
&lt;p>Getting started with the project setup and initial development:&lt;/p>
&lt;ul>
&lt;li>I began designing the NLP pipeline architecture to parse plain-English descriptions (e.g., &amp;ldquo;I need Python 3.9, CUDA 11, and scikit-learn&amp;rdquo;) into structured environment &amp;ldquo;recipes&amp;rdquo;&lt;/li>
&lt;li>I set up the initial project repository and development environment&lt;/li>
&lt;li>I met with my mentor Prof. Kexin Pei to discuss the project roadmap and technical approach&lt;/li>
&lt;li>I started researching existing artifact descriptions from conferences and Trovi to understand common patterns in environment requirements&lt;/li>
&lt;li>I began prototyping the backend environment builder logic that will convert parsed requirements into machine-image definitions&lt;/li>
&lt;li>I explored Chameleon&amp;rsquo;s APIs for provisioning servers and automated configuration&lt;/li>
&lt;/ul>
&lt;h3 id="next-steps">Next Steps&lt;/h3>
&lt;ul>
&lt;li>Continue developing the NLP component for requirement parsing&lt;/li>
&lt;li>Implement the core backend logic for environment generation&lt;/li>
&lt;li>Begin integration with Chameleon Cloud APIs&lt;/li>
&lt;li>Start building the user interface for environment specification&lt;/li>
&lt;/ul>
&lt;p>This is an exciting and challenging project that combines my interests in AI systems and reproducible research. I&amp;rsquo;m looking forward to building a system that will help researchers focus on their science rather than struggling with environment setup issues.&lt;/p>
&lt;p>Thanks for reading, I will keep you updated as I make progress on EnvGym!&lt;/p></description></item><item><title>Assessing and Enhancing CC-Snapshot for Reproducible Experiment Enviroments</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/cc-snapshot/20250616-zahratm/</link><pubDate>Sun, 15 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/cc-snapshot/20250616-zahratm/</guid><description>&lt;p>Hello, My name is Zahra Temori. I am a rising senior in Computer Science at University of Delaware. I’m excited to be working with the Summer of Reproduciblity and the Chameleon Cloud community. My project is &lt;a href="https://github.com/ChameleonCloud/cc-snapshot" target="_blank" rel="noopener">cc-snapshot&lt;/a> that focuses on enhancing features for helping researchers capture and share reproducible experimental environments within the Chameleon Cloud testbed.&lt;/p>
&lt;p>Here is a detailed information about my project and plans to work for summer &lt;a href="https://docs.google.com/document/d/1kFOFL-H4WrXF7EUuXzcHLZ2p5w_DxbbWOGi-IGx39LM/edit?tab=t.0" target="_blank" rel="noopener">proposal&lt;/a>.&lt;/p>
&lt;h3 id="june-10--june-14-2025">June 10 – June 14, 2025&lt;/h3>
&lt;p>Getting started with the first milestone and beginning to explore the Chameleon Cloud and the project:&lt;/p>
&lt;ul>
&lt;li>I began familiarizing myself with the Chameleon Cloud platform. I created an account and successfully accessed a project.&lt;/li>
&lt;li>I learned how to launch an instance and create a lease for using computing resources.&lt;/li>
&lt;li>I met with my mentor to discuss the project goals and outline the next steps.&lt;/li>
&lt;li>I experimented with the environment and captured a snapshot to understand the process.&lt;/li>
&lt;/ul>
&lt;p>It has been less than a week and I have learned a lot specially about the Chameleon Cloud and how it is different from other clouds like AWS. I am exited to learn more and make progress.&lt;/p>
&lt;p>Thanks for reading, I will keep ypu updated as I work :)&lt;/p></description></item><item><title>MPI Appliance for HPC Research on Chameleon</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/mpi/20250614-rohan-babbar/</link><pubDate>Sat, 14 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/uchicago/mpi/20250614-rohan-babbar/</guid><description>&lt;p>Hi Everyone,&lt;/p>
&lt;p>I’m Rohan Babbar from Delhi, India. This summer, I’m excited to be working with the Argonne National Laboratory and the Chameleon Cloud community. My &lt;a href="https://ucsc-ospo.github.io/project/osre25/uchicago/mpi/" target="_blank" rel="noopener">project&lt;/a> focuses on developing an MPI Appliance to support reproducible High-Performance Computing (HPC) research on the Chameleon testbed.&lt;/p>
&lt;p>For more details about the project and the planned work for the summer, you can read my proposal &lt;a href="https://docs.google.com/document/d/1iOx95-IcEOSVxpOkL20-jT5SSDOwBiP78ysSUNpRwXs/edit?usp=sharing" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;h3 id="-community-bonding-period">👥 Community Bonding Period&lt;/h3>
&lt;p>Although the project officially started on June 2, 2025, I made good use of the community bonding period beforehand.&lt;/p>
&lt;ul>
&lt;li>I began by getting access to the Chameleon testbed, familiarizing myself with its features and tools.&lt;/li>
&lt;li>I experimented with different configurations to understand the ecosystem.&lt;/li>
&lt;li>My mentor, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/ken-raffenetti/">Ken Raffenetti&lt;/a>, and I had regular check-ins to align our vision and finalize our milestones, many of which were laid out in my proposal.&lt;/li>
&lt;/ul>
&lt;h3 id="-june-2--june-14-2025">🔧 June 2 – June 14, 2025&lt;/h3>
&lt;p>Our first milestone was to build a base image with MPI pre-installed. For this:&lt;/p>
&lt;ul>
&lt;li>We decided to use &lt;a href="https://spack.io/" target="_blank" rel="noopener">Spack&lt;/a>, a flexible package manager tailored for HPC environments.&lt;/li>
&lt;li>The image includes multiple MPI implementations, allowing users to choose the one that best suits their needs and switch between them using simple &lt;a href="https://lmod.readthedocs.io/en/latest/" target="_blank" rel="noopener">Lua Module&lt;/a> commands.&lt;/li>
&lt;/ul>
&lt;p>📌 That’s all for now! Stay tuned for more updates in the next blog.&lt;/p>
&lt;p>Thanks for reading!&lt;/p></description></item><item><title>StatWrap: Cross-Project Searching and Classification using Local Indexing</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250614-debangi29/</link><pubDate>Sat, 14 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/northwestern/statwrap/20250614-debangi29/</guid><description>&lt;p>Hello👋! I am &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/debangi-ghosh/">Debangi Ghosh&lt;/a>, currently pursuing a degree in Mathematics and Computing at IIT (BHU) Varanasi, India. This summer, I will be working on the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/northwestern/statwrap/">StatWrap: Cross-Project Searching and Classification using Local Indexing&lt;/a> project under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/luke-rasmussen/">Luke Rasmussen&lt;/a>. You can view my &lt;a href="https://drive.google.com/file/d/1dxyBP2oMJwYDCKyIWzr465zNmm6UWtnI/view?usp=sharing" target="_blank" rel="noopener">project proposal&lt;/a> for more details.&lt;/p>
&lt;p>My project aims to address the challenges in project navigation and discoverability by integrating a robust full-text search capability within the user interface. Instead of relying on basic keyword-based search—where remembering exact terms can be difficult—we plan to implement a natural language-based full-text search. This approach involves two main stages: indexing, which functions like creating a searchable map of the content, and searching, which retrieves relevant information from that map. We will evaluate and compare available open-source libraries to choose and implement the most effective one.
In addition, my project aims to enhance project organization by introducing a new classification system that clearly distinguishes between “Active” and “Past” projects in the user interface. This will improve clarity, reduce clutter, and provide a more streamlined experience as the number of projects grows.&lt;/p>
&lt;p>Stay tuned for updates on my progress in the coming weeks! 🚀&lt;/p></description></item><item><title>WildBerryEye: Mechanical Design &amp; Weather-Resistant Enclosure</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/wildberryeye/20250614-teolangan/</link><pubDate>Sat, 14 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/wildberryeye/20250614-teolangan/</guid><description>&lt;p>Hello! My name is &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/content/authors/teolangan">Teodor Langan&lt;/a>, an undergraduate student currently persueing a Robotics Engineering degree at the University of California, Santa Cruz. This Summer, I&amp;rsquo;ll be working on developing the hardware for the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/wildberryeye/">WildBerryEye&lt;/a> project, mentored by &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/content/authors/caiespin">Carlos Isaac Espinosa&lt;/a>. Here is my &lt;a href="https://drive.google.com/file/d/1DfZLWl3ccZk3ss9yMP6oL9dpsyypRBDA/view?usp=sharing" target="_blank" rel="noopener">project proposal&lt;/a>!&lt;/p>
&lt;p>My project focuses on tackling the hardware challenge for WildBerryEye, an open-source ecological monitoring platform built on Raspberry Pi. To reliably support the real-time object detection provided by the system, it requires a robust and weather-resistant camera enclosure that can reliably protect its electronics in the field. To address this, I will be designing and prototyping a modular, 3D-printable camera case using FreeCAD this Summer. The case will be able to protect electrical components from rain and dust while incorporating proper ventilation and heat dissipation features. Designed using FreeCAD, the entire model will be fully open-source, allowing for easy adoption and modification by the community. Over this Summer, this work will incorporate multiple rounds of field testing to test and refine the design under accurate field conditions. Ultimately, my project aims to deliver a detailed open-source FreeCAD model, full assembly documentation, and a user guide.&lt;/p>
&lt;p>I&amp;rsquo;m excited to see what we can learn througout the development of my project!&lt;/p></description></item></channel></rss>