osre25 | UCSC OSPO

Final Report for Smart Environments

Wed, 05 Nov 2025 00:00:00 +0000

Introduction

The process of creating the necessary software environment for code to run is a significant challenge in software development. Given a piece of open-source software intended for research, setting up the environmental dependencies to run the software could take significant manual effort. Existing automation methods struggle due to the complexity of managing diverse languages, dependencies, and hardware. In Smart Environments, I have created ENVAGENT, a general multi-agent framework designed to automate the construction of executable environments for reproducing research prototypes from top-tier conferences and journals. While reproducibility has become a growing concern in the research community, the process of setting up environments remains time-consuming, error-prone, and often poorly documented.

To assess this capability, a new benchmark, ENVBENCH, was created, containing 54 popular projects across seven languages. Results show ENVAGENT dramatically improves environment construction compared to current agents (+16.2%). Furthermore, the system shows initial promise in dynamically adjusting cloud-based hardware resources based on the code’s needs.

Method

EnvAgent

The EnvAgent I created during my time at OSRE utilizes a multi-agent workflow to automatically build software execution environments. The process is structured into three phases: preparation, construction, and refinement.

Phase 1 (Preparation): Specialized agents collect information about the software repository – its structure, relevant files, and the host system’s hardware specifications (CPU, memory, etc.). This data is then used by a planning agent to generate a detailed, step-by-step instruction set for creating a functional Dockerfile.

Phase 2 (Construction): Two agents work in tandem: one generates or modifies the Dockerfile based on the plan, while the other executes the Dockerfile within an isolated container, capturing any errors.

Phase 3 (Refinement): A final agent analyzes the container execution data, identifying areas for improvement in the Dockerfile. This process repeats until a stable, executable environment is achieved.

To improve efficiency, EnvAgent incorporates rule-based tools for predictable tasks like directory setup and log management, reducing the need for complex agent reasoning. This combination of intelligent agents and automated routines (“scaffolding”) ensures a robust and adaptive system.

EnvEval Benchmark

In addition to the agent, one significant contribution is the manual curation of a benchmark that measures the quality of generated environments. EnvEval is a benchmark specifically designed to assess environment setup qualities across 54 carefully curated open-source repositories. They are chosen from both Chameleon reproducible artifacts and Multi-SWE-bench dataset. EnvEval contains json rubrics that can be used to automatically determine the quality of constructed environments.

Each rubric is divided into three parts, corresponding to three major objectives that a successfully constructed environment should have:

Structure: Checks for basic directory structure, file presence, and environment variables.
Configuration: Asks the question “Is this configured?”, checks for whether dependencies have been correctly configured.
Functionality: Asks the question “Is this usable?”, runs actual tests to see if the functionalities are present.

There are many tests in each category, and their weights are adjusted based on their importance.

Evaluation

Baseline Systems:

The study compared EnvAgent to two established automated code generation systems: one utilizing Anthropic’s advanced reasoning models and the other employing OpenAI’s code-focused models. These systems were chosen for their strong performance in creating software code and their prevalence in automated engineering processes. Both baselines were given full access to the target software repositories and complete details about the host system’s hardware.

Evaluation Metrics:

The performance of EnvAgent was assessed using three key metrics. These included the ability to create working environments, the quality of those environments, and a single combined score. Results showed EnvAgent significantly outperformed the baselines, achieving a 33.91% improvement in the final overall score – reaching 74.01, which was higher than the best baseline score of 30.10. This suggests EnvAgent produced both more functional environments and ensured greater accuracy through extensive testing.

Conclusion

The process of creating the necessary software environments for code agents is a major hurdle in scaling up research and development. Currently, this task relies heavily on manual labor. To address this, a new system, ENVAGENT, was created to automatically build these environments using intelligent agents and by understanding dependencies. A new benchmark, ENVBENCH, was also developed to assess this system’s effectiveness. Preliminary results demonstrate a significant improvement – ENVAGENT achieved a 33.91% increase in success rates compared to existing automated agents, representing a substantial step towards more efficient and reproducible research.

Thank you!

Autofill

; 20251105-Sam_Huang

Final Blog: Rectilinear Floorplans in OpenROAD

Wed, 22 Oct 2025 00:00:00 +0000

Final Progress: Enabling Rectilinear Floorplanning in OpenROAD

Hello! I’m excited to share my final progress on implementing rectilinear (polygonal) die support in OpenROAD’s floorplanning flow as part of Google Summer of Code 2025. Under the guidance of my mentors Eder Monteiro and Augusto Berndt, we’ve made significant strides in extending OpenROAD to handle non-rectangular die shapes.

Here’s a link to my original proposal

Project Overview

This project aims to add support for rectilinear floorplans in OpenROAD, an open-source EDA tool used for digital chip design. Currently, OpenROAD only supports rectangular floorplans, which limits its use in modern designs that often require more complex shapes, especially in advanced packaging, chiplet architectures, and 3D ICs.

The project enables users to define floorplans using arbitrary rectilinear shapes made of $90^{\circ}$ corners. It involves three main components:

Accepting polygonal input during floorplan setup
Generating standard cell rows and routing tracks that follow the shape boundaries
Updating pin placement logic to work with irregular outlines

By enabling these capabilities, OpenROAD becomes more flexible and suitable for real-world designs where blocks may need to fit together like puzzle pieces. This can lead to better area utilization and potentially shorter interconnects.

The core challenge is maintaining robustness and backward compatibility while introducing this major new feature that touches multiple aspects of the design flow.

Pull Requests made

Support for Rectilinear dies in PPL (Pin Placement) - https://github.com/The-OpenROAD-Project/OpenROAD/pull/8182
Support for Rectilinear dies in IFP (Init Floorplan) - https://github.com/The-OpenROAD-Project/OpenROAD/pull/7893

Key Contributions

Phase 1: Init Floorplan (IFP) Module Support

The first half of the project focused on adding support for rectilinear floorplans in the IFP (Init Floorplan) module. This foundational work established the infrastructure for handling non-rectangular die shapes.

1. Polygonal Die Definition

Implemented support for accepting polygon vertices as input to define rectilinear die shapes.
Modified the TCL interfaces to accept a list of vertices of the rectilinear die in the -die_area and -core_area parameters and automatically switch to rectilinear flow.
Developed validation logic to ensure polygons are rectilinear and valid.

2. Standard Cell Row Generation

Developed a scanline-based algorithm to generate standard cell rows that conform to complex polygonal boundaries.
The algorithm sweeps horizontally across the die area and identifies valid row regions within the polygon.
Routing Track generation logic could directly be used for rectilinear die shapes.

3. Testing and Validation

Created comprehensive test cases for L-shaped, T-shaped, and other rectilinear configurations for floorplan creation and row generation.
Ensured backward compatibility with rectangular floorplans.
Added error handling for edge cases like invalid polygon specifications.

Demo: U-Shaped Die Row Generation

One of our test cases involved generating rows for a U-shaped die. Here is a snapshot from the OpenROAD GUI displaying perfectly laid out rows:

Phase 2: Pin Placement (PPL) Module Support

1. Core Data Structure Migration

Leveraged the odb::Line class instead of the simple Edge enum (which only handled 4 rectangular edges) to store edge data of rectilinear dies. This allows the system to handle an arbitrary number of polygon edges.
This required refactoring nearly every function in the pin placement pipeline, as the Edge enum was deeply embedded throughout the codebase.
The new representation is more flexible and can handle N-sided polygons while maintaining clean abstractions.

2. Pin Slot Calculation

Rewrote the defineSlots() function family to work with polygon edges while maintaining compatibility with existing die shapes.
Ensured slots are generated only within valid polygon boundaries.

3. Pin Orientation Algorithm

One of the most challenging aspects was determining the correct orientation for pins on polygon edges. For rectangular dies, this is trivial, but for complex, concave polygons, it’s non-trivial.
Leveraged the ray tracing algorithm to determine the correct pin orientations. The algorithm casts rays from edge midpoints to determine which side faces the interior of the polygon.
This ensures pins are correctly oriented and handles complex cases like concave polygons.

4. Hungarian Matching and Simulated Annealing for Polygons

Successfully extended both Hungarian Matching and Simulated Annealing to work with rectilinear dies as well as regular dies.
The flow now checks if the provided die is rectilinear and intelligently switches the flow accordingly.

Demo: T-Shaped Die Pin Placement

The image below shows pins placed on a rectilinear, T-shaped die:

Code Quality

Followed OpenROAD coding standards and conventions
Comprehensive error handling and validation
Extensive code reviews with multiple rounds of refinements
Well-documented functions and algorithms

Testing and Validation

Created multiple test cases covering various rectilinear shapes
Regression testing to ensure backward compatibility
Edge case handling (concave polygons, tight geometries, etc.)
Integration testing with downstream OpenROAD flows

Future Work

Supporting constraints for pin placement in PPL - currently in progress
Improved GUI support for viewing and editing polygonal floorplans
Further optimizing algorithms for very large polygons

Acknowledgements

I would like to thank my mentors, Eder Monteiro and Augusto Berndt for their patience, support and guidance throughout the project. Thanks to Stephanie and the entire UC OSPO team as well as Google Summer of Code for providing me with this incredible opportunity.

Scenic-RoboSuite Integration: Building the First Working Prototype

Mon, 29 Sep 2025 00:00:00 +0000

I’m Sahil, presenting the first working prototype of the Scenic-RoboSuite integration. This project is being mentored by Daniel Fremont and Eric Vin.

After months of development, we have achieved a functional prototype of the Scenic-RoboSuite interface. Researchers can now write basic declarative robotic manipulation scenarios in Scenic that execute with physics simulation in RoboSuite. While still in development, the prototype demonstrates the feasibility and potential of bridging probabilistic scenario generation with detailed robot control.

Major Achievements

MJCF XML Injection

The interface introduces direct MJCF XML support, allowing Scenic to build RoboSuite-native manipulable objects from raw XML definitions. Users can define custom objects with complex mesh geometries, textures, and physics properties directly in their Scenic scenarios:

dragon_xml = '''
<mujoco>
 <asset>
 <mesh file="dragon.stl" scale="0.01 0.01 0.01"/>
 <texture file="dragon_texture.png"/>
 </asset>
 <worldbody>
 <body name="object">
 <geom mesh="dragon_mesh" type="mesh"/>
 </body>
 </worldbody>
</mujoco>
'''

dragon = new CustomObject with mjcfXml dragon_xml

The system automatically handles collision geometry generation, joint creation for physics, and asset file resolution.

Complex Mesh Object Support

Import and manipulate arbitrary 3D models (STL, OBJ) with automatic mesh repair and texture mapping. The interface resolves file paths relative to Scenic files, copies assets to temporary directories for MuJoCo, and converts textures (JPG to PNG) when needed. This enables using custom robotic tools, industrial parts, or any 3D model in manipulation scenarios.

Custom Arena Definition

Define complete custom environments using MJCF XML, extending beyond RoboSuite’s built-in arenas:

custom_arena = new CustomArena with arenaXml localPath("warehouse.xml")

This allows creating specialized workspaces, factory floors, or research-specific environments while maintaining full physics simulation.

Multi-Robot Support

The interface handles multiple robots operating in the same workspace:

robot1 = new Panda at (-0.5, 0, 0)
robot2 = new UR5e at (0.5, 0, 0)
table = new Table at (0, 0, 0.425)

Each robot maintains independent control and can execute coordinated or individual behaviors.

Built-in Manipulation Behaviors

Ready-to-use behaviors for immediate testing and development:

MoveToPosition - Precise end-effector positioning
PickObject - Automated grasping with approach and closure
LiftToHeight - Controlled lifting to target heights
PickAndLift - Complete pick-and-place sequence

These behaviors use Operational Space Control (OSC) for intuitive 3D movement commands.

Extended Environment Configuration

The interface extends RoboSuite’s configurability through Scenic’s parameter system:

param controller_config = {'type': 'OSC_POSITION', 'impedance': 'low'}
param camera_view = 'robot0_eye_in_hand'
param lite_physics = True # Faster simulation for testing

Example: Probabilistic Pick-and-Place

model scenic.simulators.robosuite.model

# Randomly position cube on table
table = new Table at (0.6, 0, 0.425)
cube = new Box on table,
 with color (1, 0, 0, 1),
 with position (Uniform(-0.2, 0.2), Uniform(-0.2, 0.2), _)

# Robot adapts to random cube position
behavior AdaptivePickup():
 do PickAndLift(cube, height=1.1)

ego = new Panda at (0, 0, 0),
 with behavior AdaptivePickup()

Each scenario run generates a different cube position, testing the robot’s adaptive capabilities.

Challenges Overcome

Understanding Dual Architecture Paradigms

RoboSuite and Scenic operate on fundamentally different principles. RoboSuite builds environments imperatively through MuJoCo XML composition, expecting complete scene specification upfront. Scenic generates scenes probabilistically through constraint solving, requiring geometric knowledge before simulation. Bridging these required developing a two-pass system where we first extract geometry from a temporary RoboSuite environment, update Scenic’s understanding, then create the final simulation. This architectural mismatch touched every aspect of the integration, from object creation to property updates.

Discovering and Extending ManipulationEnv

RoboSuite’s documentation focuses on using pre-built tasks, not creating custom environments. Through extensive source code analysis, we discovered that ManipulationEnv was the key - it accepts robots as configuration while allowing customizable arenas and objects as components. This class became our foundation, but required significant extension. We implemented ScenicManipulationEnv to intercept Scenic’s object configurations, handle dynamic arena selection (EmptyArena vs MultiTableArena based on scene content), and manage the complex initialization sequence where robots, arenas, and objects must be assembled in specific order for MuJoCo compilation.

XML to 3D Mesh Pipeline

Converting MJCF XML to usable 3D meshes proved complex. MuJoCo uses XML to describe geometry, but Scenic needs actual mesh data for collision checking. We built a multi-stage pipeline: First, ElementTree parses the XML to extract mesh references and primitive definitions. Then, we handle two paths - for mesh files, we load STL/OBJ files with trimesh and apply XML-specified transformations; for primitives (boxes, cylinders), we generate meshes programmatically. The challenge intensified with composite objects - a table might have a box tabletop and four cylinder legs. We developed ComponentExtractor to analyze the MuJoCo scene graph, identify related geometries through naming patterns and hierarchy, and export each component as a separate GLB file with proper world transforms preserved.

File Path Resolution Discrepancies

Scenic and RoboSuite handle file paths completely differently. Scenic uses localPath() for paths relative to the scenario file, while RoboSuite expects paths relative to its package structure or absolute paths. MJCF XML compounds this - mesh references can be relative to the XML file location, not the calling code. We implemented a sophisticated path resolution system: detect whether paths come from embedded XML (relative to Scenic file) or external XML files (relative to XML location), copy all referenced assets (meshes, textures) to temporary directories accessible to MuJoCo, and handle texture format conversion (JPG to PNG) when needed. This system transparently manages assets whether they’re in the Scenic project, RoboSuite package, or absolute paths, making the interface truly portable.

Impact and Applications

This bridge enables:

Research: Generate diverse manipulation scenarios for robot learning algorithms
Testing: Validate robotic systems against probabilistic task variations
Development: Rapid prototyping of manipulation tasks without manual scene setup
Education: Teach robotics concepts through declarative scenario specification

The integration makes complex robotic simulations accessible through Scenic’s intuitive language while preserving RoboSuite’s detailed physics and control capabilities.

Documentation and Resources

The project includes:

example scenarios demonstrating all features
Comprehensive STATUS.md tracking working features and known issues
Technical documentation in docs/ covering architecture and troubleshooting
Mesh extraction utilities for pre-processing and caching

Current Status and Future Work

This prototype demonstrates that the Scenic-RoboSuite bridge is viable and functional. Basic features are working reliably:

Single-robot manipulation scenarios execute successfully
MJCF XML injection creates custom objects
Pick-and-place behaviors operate consistently
Multi-robot support functions in controlled scenarios

However, significant work remains:

Stability improvements: Some features work intermittently and need refinement
Velocity tracking: Full implementation awaits framework updates
Multi-robot coordination: Advanced synchronization primitives needed
Performance optimization: Mesh extraction and caching can be streamlined
Extended testing: More diverse scenarios and edge cases need validation

The prototype serves as a proof of concept, showing that probabilistic scenario specification can successfully drive physics-based robot simulation. The architecture is sound, the core features function, and the path forward is clear.

Conclusion

This working prototype of the Scenic-RoboSuite integration represents significant progress toward bridging probabilistic programming with robotic simulation. We’ve successfully demonstrated that declarative scenario specification can control detailed physics simulation, opening new possibilities for robotic system development and testing.

While not yet production-ready, the prototype provides a solid foundation for future development. Researchers can begin experimenting with basic manipulation scenarios, developers can test the interface with their use cases, and the community can contribute to making this bridge more robust and feature-complete.

The challenges overcome - from understanding dual architectures to implementing XML-to-mesh pipelines - have resulted in a functional system that validates our approach. This prototype proves that Scenic’s elegant scenario language and RoboSuite’s detailed physics can work together, setting the stage for a powerful new tool in robotics research and development.

Final Report — RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Sun, 28 Sep 2025 00:00:00 +0000

Hello! I’m Zeyu Zou! I have been contributing to the RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics project under the mentorship of Ziheng Duan. My project focuses on developing a framework that predicts spatial gene expression from histology images by combining vision encoders with single-cell RNA-seq references. The goal is to make spatial transcriptomics more affordable, interpretable, and scalable for the research community.

Introduction

RAG-ST is designed to reduce the cost and complexity of spatial transcriptomics by leveraging existing histology images and scRNA-seq priors. This work integrates computer vision with retrieval-augmented generation to improve prediction accuracy and interpretability.

Methods

The project used a two-stage pipeline:

Vision encoder (ResNet50/ViT) to map histology patches to cell type distributions.
Retrieval-augmented generation guided by scRNA-seq profiles to predict gene expression.

Datasets included HEST-1K (paired histology and expression) and CellxGene Census as the reference database. Training and evaluation pipelines were implemented in PyTorch.

Results

Implemented a complete pipeline from histology preprocessing to expression prediction.
Achieved higher correlation scores (Pearson/Spearman) and lower errors (MSE/MAE) compared to baseline models.
Produced spatial gene expression maps with interpretable retrieval traces and attention weights.
Released open-source code, preprocessing scripts, and analysis notebooks for reproducibility.

Future Work

Extend experiments to additional tissues (lung, liver, tumor samples).
Test cross-dataset generalization and robustness.
Explore integration into clinical pathology workflows for affordable spatial inference.

Acknowledgments

Thanks to my mentor Ziheng Duan, the UC OSPO team, the HEST-1K dataset contributors, and the CellxGene Census project. This work was conducted under OSRE 2025.

Final Update: Building Intelligent Observability for NRP

Thu, 25 Sep 2025 00:00:00 +0000

I’m excited to share the completion of my OSRE 2025 project, “Intelligent Observability for NRP: A GenAI Approach” and the significant learning journey it has been. We’ve successfully developed a novel InfoAgent architecture that delivers on our core goal: building an ML-powered service for NRP that analyzes monitoring data, detects anomalies, and provides trustworthy GenAI explanations.

How Our Novel InfoAgent Architecture Advances the Observability Mission

Through extensive development and testing, I’ve learned tremendously about building production-ready AI systems and have implemented a novel InfoAgent architecture that orchestrates our specialized agents:

1. Prometheus Metrics Analysis Agent

Function: Continuously ingests and processes NRP’s Prometheus metrics
Progress: Fully implemented data pipelines handling multiple metric types with optimized latency
Purpose: Provides the foundation for anomaly detection by establishing normal behavior baselines

Function: Clarifies ambiguous metrics or patterns before generating explanations
Progress: Completed implementation of Conformal Revision of Questions for disambiguation
Purpose: Ensures explanations address the right system behaviors (e.g., distinguishing CPU saturation from memory pressure)
Deliverable Impact: Successfully improved accuracy of GenAI explanations by eliminating misinterpretations

3. Explanation Generation Agent (AIS)

Function: Creates human-readable explanations and root-cause analysis
Progress: Finalized the Automated Information Seeker with a complete Plan→Validate→Execute→Assess→Revise cycle
Purpose: Transforms technical anomalies into actionable insights for operators
Deliverable Impact: Delivers GenAI explanations with uncertainty quantification

Completed Integration: The Novel InfoAgent Pipeline

We’ve successfully integrated all agents into a unified observability pipeline that represents our novel contribution:

Data Collection: Prometheus metrics → Analysis Agent (comprehensive metrics support)
Anomaly Detection: With statistical confidence bounds using conformal prediction
Query Refinement: Resolving ambiguities before explanation
Explanation Generation: Human-readable analysis with uncertainty awareness
Feedback Loop: System learning from operator interactions (implemented and tested)

Hardware Testing Results

This project taught me valuable lessons about optimizing AI workloads on specialized hardware. We successfully tested our observability framework on Qualcomm Cloud AI 100 Ultra hardware:

Achieved significant performance improvements over baseline CPU implementation
Successfully ported and optimized GLM-4.5 for observability-specific tasks
Validated that specialized AI hardware significantly enhances real-time anomaly detection

Learning Journey and Novel Contributions

Throughout OSRE 2025, I’ve learned extensively about:

Building hierarchical agent coordination systems for complex reasoning
Implementing conformal prediction for trustworthy AI outputs
Creating self-correcting explanation pipelines
Developing adaptive learning systems from operator feedback

The novel InfoAgent architecture demonstrates promising results in our testing environment, with evaluation metrics and benchmarks still being refined as work in progress.

Ongoing Work: Continuing Beyond OSRE

While OSRE 2025 is concluding, I’m actively continuing to contribute to this project:

Preparing the InfoAgent framework for open-source release with comprehensive documentation
Running extended evaluation tests on the Nautilus platform (work in progress)
Writing a research paper detailing our novel architecture
Creating tutorials to help others implement intelligent observability

Project Updates and Code: You can follow my ongoing contributions and access the latest code at https://mreddy10.pages.nrp-nautilus.io/gsocnrp/

Acknowledgments

I’m deeply grateful to my lead mentor Mohammad Firas Sada for his exceptional guidance throughout this transformative learning experience. His insights have been invaluable in helping me develop the novel InfoAgent architecture and navigate the complexities of building production-ready AI systems.

The OSRE 2025 program has been an incredible journey of growth and discovery. I’ve learned not just how to build AI systems, but how to make them trustworthy, explainable, and genuinely useful for real-world operations. The novel InfoAgent architecture we’ve developed serves the original mission: creating an intelligent observability tool that helps NRP operators solve problems faster and keep complex research systems running smoothly.

I’m excited to continue contributing to this project and look forward to seeing how the community adopts and extends these ideas. Check out my contributions and ongoing updates at https://mreddy10.pages.nrp-nautilus.io/gsocnrp/!

[Final] Building PeerSky’s Extensions System

Tue, 23 Sep 2025 00:00:00 +0000

Hi everyone, I’m Hanzhong Liu. Over the summer I worked on building the peersky://extensions system for PeerSky browser, a decentralized and privacy-first browser built on Electron.

This post is my final GSoC 2025 update — covering how the extensions manager was designed, the security model behind IPC, the UI for managing extensions, and what’s next for PeerSky.

Project Overview

The new extensions system makes PeerSky behave like a modern browser: you can install extensions from the Chrome Web Store or from local files, enable/disable them, update or uninstall, and interact with their toolbar actions through a puzzle-menu UI.

Key Design Goals

Secure preload-based API exposure via contextBridge
Support for preinstalled, Web Store, and local packages
Toolbar integration with pin/unpin support (up to six)
Robust validation: MV3-only, size caps, zip-slip prevention

Highlights

Preinstalled MV3s

PeerSky now ships with three trusted extensions out of the box:

Dark Reader
Linguist (web page translator)
uBlock Origin Lite

They remain installed by default but can be disabled at any time. This ensures users always have a working baseline without needing to browse an extension store.

Electron Integration

Instead of injecting scripts, the system uses preload + IPC. Each operation is routed through validated IPC channels:

listExtensions, installFromWebStore, toggleExtension, etc.
All methods are scoped to peersky://extensions only.
Rate limiting and size caps are enforced per renderer.

This design makes the surface auditable and prevents privilege leaks.

Browser actions appear in a puzzle menu and can be pinned for quick access:

Up to six pins are allowed
Pinned state persists across sessions.
Popups (e.g., for translators or wallets) open in isolated windows, with OAuth flows preserved via popup guards.

Security Highlights

Installs capped at 60 MB, with early rejection on oversized payloads
5 installs/minute per renderer to prevent abuse
ZIP/CRX extraction hardened against path traversal
MV3 required; permissions validated at install with warnings for risky hosts
Web Store installs use Google-signed CRX verification via electron-chrome-web-store

Example: Installing from the Web Store

Adding a new extension is simple:

Paste a Chrome Web Store URL or ID into the install bar.
PeerSky downloads and validates the CRX.
On success, the extension appears in the grid with toggle, update, and remove options.

Reflection

This project was both challenging and rewarding. Designing an extension system meant grappling with security, IPC design, and user experience at the same time. I learned to think carefully about security management, UI/UX positioning, and design APIs that are auditable.

I’m grateful to my mentor Akhilesh Thite and the UC OSPO team for their guidance and feedback. Their support pushed me to make deliberate technical decisions and communicate them clearly.

You can explore the project here: https://github.com/p2plabsxyz/peersky-browser

Final Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows

Thu, 18 Sep 2025 00:00:00 +0000

Final Report: Applying MLOps to Overcome Reproducibility Barriers in ML

Background

Hello! I’m Ahmed Alghali, and this is my final report the project Applying MLOps to Overcome Reproducibility Barriers in ML under the mentorship of Professor Fraida Fund and Mohamed Saeed.

This project aims to address the reproducibility problem in machine learning—both in core ML research and in applications to other areas of science.

The focus is on making large-scale ML experiments reproducible on Chameleon Cloud. To do this; we developed ReproGen, a template generator that produces ready-to-use, reproducible ML training workflows. The goal: is to make the cloud easy for researchers setting up experiments without the worry about the complexity involved in stitching everything together.

Progress Since Mid-Report

Migration from Cookiecutter to Copier

we initially used Cookiecutter for template generation as a templating engine, but it lacked features we were interested in (e.g., conditional questions). we switched to Copier, which provides more flexibility and better matches our use case.

Support for Multiple Setup Modes

We now offer two setup modes, designed to serve both beginners and users who want advanced options/customization:

Basic Mode – minimal prompts (project name, repository link, framework).
Advanced Mode – detailed control (compute site, GPU type, CUDA version, storage site, etc.).

this ensures accessibility for new users, while still enabling fine-grained control for users.

Automated Credential Generation

previously, users had to manually generate application credentials (via Horizon OpenStack UI). now, we provide scripts that can generate two types of credentials programmatically—Swift and EC2—using Chameleon JupyterHub credentials with python-chi and the openstack-sdk client.

Automatic README.md Generation

each generated project includes a customized README.md, containing setup guidance and commands tailored to the user’s configuration.

Bug Fixes and UX Enhancements

Alongside major features, we implemented numerous smaller changes and fixes to improve the reliability and user experience of the tool.

Deliverables

ReproGen GitHub Repository: source code for the template generator.
mlflow-replay branch: explore a past experiment, artifacts, and logged insights.
LLM-Demo branch: hands-on demo to track fine-tuning of an LLM using infrastructure generated by ReproGen.

Next Steps

Compatibility Matrix
- the tool and the generated setup both depend on software dependencies that required paying attention to compatibility. in all level Hardware, OS, Drivers, Computing Platforms, core and 3rd-party libraries. writing a documentation as a start to help future debugging and adding pieces without breaking what is there. .
Maintain Docker Images

so far we have a cpu and GPU docker images for multiple most frequently used framework.
- CPU based image: for data science workload (Scikit-Learn)
- GPU-Nvidia Variant: for Deep Learning workload on Nvidia Machines (Pytorch, Lightning, TensorFlow)
- GPU-AMD Variant: for Deep Learning workload on AMD Machines (Pytorch, Lightning, TensorFlow) adding more variants for more frameworks + Enhancing the experience of the existing images is recommended.

Reflection

When I first joined SoR 2025, I had a problem crystallizing the idea of how I can practically achieve reproducibility and package a tool that would maximizes the chance of reproducing experiment build using it. throughout the journey my mentors took me under their wings and helped me to understand the reproducibility challenges in ML, my Mentor Professor Fraida Fund wrote materials that saved me a lot of time to familiarize my self with the testbed,important Linux tools and commands, and even getting to have hand on practice how large model training happen with MLflow tracking server system is done in the cloud. and Mohamed Saeed. who took the time reviewing my presentation pushing me to do my best. I’m forever thankful in the way they shaped the project and my personal growth. this hands-on experience help me viewing MLOps , cloud APIs, and workflow design in different lenses, and I’m proud to have contributed a tool that can simplify help reproducible research for others.

Final Report: CarbonCast — An end-to-end consumption-based Carbon Intensity Forecasting service

Mon, 15 Sep 2025 00:00:00 +0000

Hi everyone—this is my final report for CarbonCast, mentored by Professor Abel Souza. Back in June, my goal was simple to say and harder to pull off: help people see when the grid is cleaner and make it easy to act on that information. Over the summer I turned CarbonCast from a research prototype into something you can open, click, and rely on: a containerized backend, a clean API, and a fast, friendly map UI.

Background

CarbonCast forecasts the carbon intensity of electricity (gCO₂e/kWh) using grid data and weather. Earlier versions were accurate but difficult to run and even harder to use outside a research context. My OSRE focus was to make CarbonCast usable for real people: provide a standard API, build a web UI that feels responsive, and package everything so it starts quickly and keeps itself healthy.

Goals

I centered the work around four goals. First, I wanted to ship an end-to-end containerized stack—data collection, validation, storage, API, and UI—that someone else could run without digging through my notes. Second, I aimed to expand coverage beyond a handful of regions so the map would be genuinely useful. Third, I needed to make it reliable, with retries, monitoring, and graceful fallbacks so the system could run for weeks without babysitting. Finally, I wanted to lay the groundwork for a consumption-based signal, because imports from neighboring regions also shape a region’s true emissions picture.

What I built

By the end of the program, CarbonCast runs as a containerized backend + API + web app that you can bring up with Docker. The pipelines now reach 85+ regions, and the UI currently exposes 58+ while we finish integrating the rest. The API offers straightforward endpoints for current conditions and multi-day views, plus region metadata so clients can discover what’s available. The UI presents an interactive choropleth map with a side panel for the energy mix and a simple timeline to move between past, now, and the next few days. To keep things feeling snappy, I tuned caching so “now” data updates quickly while historical and forecast views load instantly from cache. I also added a small “mission control” dashboard that shows what updated, what failed, and how the system recovered, which makes maintenance far less mysterious.

How it works

Fresh weather and grid data arrive on a regular schedule. The system checks each file for sanity, stores it, and serves it through a clean API. The React app calls that API and paints the map. Hovering reveals regional details; clicking opens a richer panel with the energy mix and trends; the timeline lets you scrub through hours naturally. In short, the path is fresh data → API → map, and each step is designed to be obvious and quick.

Behind the scenes, I extended the existing Django backend with a SQLite path so the UI works out of the box on a laptop. For production, you can point the same code at Postgres or MySQL without changing the UI. This choice made local testing easy while leaving room for scale later.

Highlights

A few moments stand out. The first time the dashboard flipped from red to green on its own—after the system retried through a wave of timeouts—was a turning point. Clicking across the map and getting instant responses because the right data was cached felt great too. And packaging everything so another person can run it without asking me for help might be the biggest quality-of-life win for future contributors.

Challenges

The first big hurdle was refactoring the old vanilla-JS interface. The original UI worked, but it was dated and hard to extend. I rebuilt it as a modern React + TypeScript app with a cleaner component structure and a fresh look—think glassmorphic panels, readable color scales, and a layout that feels consistent on both laptops and smaller screens. Moving to this design system made the codebase far easier to maintain, theme, and iterate on.

The next challenge was performance under real-time load. With dozens of regions updating, it was easy to hit API limits and make the UI feel jittery. I solved this by adding a smart caching layer with short, volatility-aware timeouts, request de-duplication, and background prefetching. That combination dramatically reduced round-trips, essentially eliminated rate-limit hits, and made the map feel responsive even as you scrub through time. The result is a UI that can handle many simultaneous updates without hiccups.

Finally, there were plenty of stubborn UI bugs. Some regions wouldn’t color even when data was available, certain charts refused to render, and a few elements flickered or never showed up. Most of this came down to learning React state management in a real project: taming race conditions, canceling in-flight requests when users navigate, and making sure state only updates when fresh data actually arrives. Fixing those issues taught me a lot about how maps re-paint, how charts expect their data, and how to keep components simple enough that they behave the way users expect.

What didn’t make the cut (yet)

I designed—but did not finish—per-region plug-in models so each grid can use the approach that fits it best. We decided to ship a stable, deployable service first and reserve that flexibility work for the next phase. The design is written down and ready to build.

Links and resources:

Project page: CarbonCast
Proposal: https://ucsc-ospo.github.io/report/osre25/ucsc/carboncast/20250710-tanushsavadi/
Midterm blog: https://ucsc-ospo.github.io/report/osre25/ucsc/carboncast/20250803-tanushsavadi/
Backend/API (branch): https://github.com/carbonfirst/CarbonCast/tree/django_apis_sqlite
Frontend/UI: https://github.com/carbonfirst/CarbonCastUI/tree/main

What’s next

My next steps are clear. I want to finish the per-region model plug-ins so grids can bring their own best forecasting logic. I also plan to carry the consumption-based signal end-to-end, including imports and interconnects surfaced directly in the UI. Finally, I’ll harden the system for production by enabling auth and throttling and by moving to a production-grade database where appropriate.

Thank you

Huge thanks to Professor Abel Souza for steady mentorship and to the OSRE community for thoughtful feedback. The most rewarding part of this summer was watching a research idea become something people can click on—and use to make cleaner choices.

Final Report: A Systematic Investigation into the Reproducibility of RAG Systems

Fri, 05 Sep 2025 00:00:00 +0000

I’m Baiqiang, and this is the final report for the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project, mentored by Luanzheng “Lenny” Guo and Dongfang Zhao. This project successfully developed a novel framework to quantitatively measure reproducibility in AI systems, yielding several surprising and impactful results.

The Challenge: The Need for Systematic Measurement

Retrieval-Augmented Generation (RAG) is a cornerstone of AI for science, but its reliability is often compromised by non-determinism. While this issue was a known concern, a fundamental challenge was the lack of standardized tools and methodologies to systematically measure and quantify the sources of this inconsistency. Without a rigorous way to analyze the problem, it was difficult to move beyond ad-hoc tests and establish the true root causes, hindering the development of truly trustworthy AI systems for science.

Our Contribution: The ReproRAG Framework

To address this gap, the central contribution of this project is ReproRAG, a comprehensive, open-source benchmarking framework. ReproRAG is designed to systematically investigate sources of uncertainty across the entire RAG pipeline by:

Isolating Variables: It allows for controlled experiments on embedding models, numerical precision, retrieval algorithms, hardware configurations (CPU/GPU), and distributed execution environments.
Quantifying Uncertainty: It employs a suite of metrics—including Exact Match Rate, Jaccard Similarity, and Kendall’s Tau—to precisely measure the impact of each variable on the final retrieved results.

Key Findings: A New Hierarchy of Uncertainty

Our large-scale empirical study using ReproRAG challenged common assumptions and established a clear hierarchy of what actually impacts reproducibility.

Core Algorithms Are Not the Problem: Our most surprising finding is that modern retrieval libraries like FAISS are perfectly reproducible out-of-the-box. Across all tested index types (including approximate ones like HNSW and IVF) and execution environments (single-node CPU/GPU and multi-node distributed systems), we achieved perfect run-to-run reproducibility (1.000 scores on all metrics) when environmental factors like random seeds were controlled. This falsifies the common hypothesis that approximate nearest neighbor algorithms are a primary source of randomness.
Embedding Model Choice is a Dominant Source of Variation: We found that the choice of the embedding model is a dominant factor driving result variation. When comparing outputs from different state-of-the-art models (BGE, E5, Qwen) for the same query, the agreement was very low (e.g., Overlap Coefficient of ~0.43-0.54). This means a scientific conclusion drawn with one model may not be reproducible with another, as they are fundamentally “seeing” different evidence.
Environmental Factors Introduce Measurable “Drift”:
- Numerical Precision: Changing floating-point precision (e.g., FP32 vs. FP16) was a guaranteed source of variation, but it caused a small and quantifiable “embedding drift” rather than chaotic changes.
- Data Insertion: Incrementally adding new data to an index caused a predictable “displacement” of old results, not a re-shuffling. The relative ranking of the remaining original documents was perfectly stable (Kendall’s Tau of 1.000).
Common Determinism Flags Can Be Ineffective: Our tests showed that popular software-level controls, like cudnn.deterministic flags in PyTorch, had no observable effect on the output of modern transformer-based embedding models. This underscores the necessity of empirical validation over assuming that framework settings work as advertised.

Conclusion

This project successfully shifted the focus of the RAG reproducibility problem. The key challenge is not to fix supposedly “random” algorithms, but to rigorously control the entire experimental environment. We delivered ReproRAG, a framework that empowers researchers to do just that. Our findings provide actionable insights for the community: efforts to improve reproducibility should focus less on the retrieval algorithms themselves and more on disciplined management of embedding models, data versioning, and numerical precision.

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Fri, 05 Sep 2025 00:00:00 +0000

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 2

Hello all! 👋 This is Tharit again. I want to share this blog post about my Part 2 of Google Summer of Code projects. In case you miss it, you can take a look at Part1 as well. Without further introduction, these following supports were added since last time.

Complex Number Support PR #148

Between this PR and the previous PR, there are lots of debugging and testing to make sure that all existing MPILinearOperator works under NCCL as they do with mpi4py PR #141, #142 #145

Most of the PyLops-MPI users are scientists and engineers working on the scientific problems - and most of the scientific problem involves complex numbers (Fourier Transform touches many things). NCCL does not support the complex number out-of-the-box.

It turned out that adding complex-number support was not the big issue. The complex number is simply the contiguous array of, says, float64. Unlike typical float64, one element of complex128 number is then represented by two float64. Things get more complicate if we start to talk about the complex number arithmatic. Luckily, NCCL semantics only supports element-wise ncclSum, ncclProd, ncclMin, ncclMax, ncclAvg. Wrapping element-wise operations for complex number is straightforward.

The change to PyLops-MPI _nccl.py itself is minimal. We simply added the function below and this hides the complexity of buffer-size management from users.

def _nccl_buf_size(buf, count=None):
 """ Get an appropriate buffer size according to the dtype of buf
 if buf.dtype in ['complex64', 'complex128']:
 return 2 * count if count else 2 * buf.size
 else:
 return count if count else buf.size

The conceptual is quite simple. But mechanically, to get it right in the general case required some extensive bug fixing, particularly in the call to _allgather as noted earlier in the “Core Change” section. The array needs some preprocessing (to align with NCCL semantics) and post-processing so that the result from Pylops-MPI’s NCCL allgather matches with the PyLops-MPI allgather. This is because Pylops-MPI must be able to switch between mpi4py and NCCL seamlessly from the user’s perspective. To make it concrete, here is how we do the _allgather() with NCCL

 def _allgather(self, send_buf, recv_buf=None):
 """Allgather operation
 """
 if deps.nccl_enabled and self.base_comm_nccl:
 if isinstance(send_buf, (tuple, list, int)):
 return nccl_allgather(self.base_comm_nccl, send_buf, recv_buf)
 else:
 send_shapes = self.base_comm.allgather(send_buf.shape)
 (padded_send, padded_recv) = _prepare_nccl_allgather_inputs(send_buf, send_shapes)
 raw_recv = nccl_allgather(self.base_comm_nccl, padded_send, recv_buf if recv_buf else padded_recv)
 return _unroll_nccl_allgather_recv(raw_recv, padded_send.shape, send_shapes)
 # < snip - MPI allgather >

After this feature was added, the PyLops-MPI with NCCL now catches up with its original MPI implementation, i.e., the test coverage is now the same 306 tests passed !

Benchmark Instrumentation PR #157

Profiling distributed GPU operations is critical to understanding performance bottlenecks. To make this easier, we added a lightweight benchmark instrumentation framework in PyLops-MPI. The goal was to allow developers to mark execution points in a function and collect timing information for these markers.

The core of the implementation is a @benchmark decorator. Inside a decorated function, developers can call mark(label) to record the time at specific points. After the function completes, the timings are reported in a human-readable format. This design is inspired by C++-style instrumentation, letting developers place markers directly in the code where they are most informative.

But because we are in Python, to handle nested function calls, we collect the timing information as a stack (bottom-up call graph) and parse the result at the end of the decorated function. Here’s is the illustration:

@benchmark
def outer_func_with_mark(par):
 mark("Outer func start")
 inner_func_with_mark(par) # <- this does `dot` and is also decorated
 dist_arr = DistributedArray(global_shape=par['global_shape'],
 partition=par['partition'],
 dtype=par['dtype'], axis=par['axis'])
 dist_arr + dist_arr
 mark("Outer func ends")

The text output is

[decorator]outer_func_with_mark: total runtime: 0.001206 s
 [decorator]inner_func_with_mark: total runtime: 0.000351 s
 Begin array constructor-->Begin dot: 0.000026 s
 Begin dot-->Finish dot: 0.000322 s
 Outer func start-->Outer func ends: 0.001202 s

Benchmarking is controlled via the environment variable BENCH_PYLOPS_MPI. It defaults to 1 (enable) but can be set to 0 to skip benchmarking for clean output. This means users can leave the decorated code unchanged and disable the benchmark through the environment variable. This is inspired by the C++ debug flag set during the compilation. Moreover, careful attention had to be made on concurrency issue of benchmarking because the time is recorded by CPU while the NCCL issues the operation in an async manner to CUDA stream PR #163 is an example of this.

Benchmark Result

This was the moment of truth. Our 12-week hardwork would be judged by a set of hard cold numbers. Our expectation was that

If the system does not have proprietary NVLink for GPU-GPU communication but is NCCL-compatible, the communication using CuPy + NCCL should still be faster than NumPy + MPI (and possiblyCuPy + MPI) in PyLops-MPI i.e., there should be a benefit from using NCCL from communication-related optimizations enabled by this project.

The result below was from NCSA UIUC Delta system 4-Way NVIDIA A40 GPU (no NVLink) with the allreduce operation.

That meets our expection. One thing to note here is: we see that actually the CuPy + MPI communication being slower than the NumPy + MPI. This is because the current implementation of PyLops-MPI uses non-buffered calls of mpi4py - see detail here. The choice was made due to its simplicity as it allowed send and receiving generic Python object wrapped in a list and thus allowed fast development process. However, These require the memory copy from GPU to CPU, do communication, and copy memory from CPU to GPU (pickle protocol) - see our discussion with mpi4py community here. This leads us to “Things left to do” section (later).

If the system has an NVLink for GPU-GPU communication, we will be able to see a significant gain in performance of PyLops-MPI with NCCL.

The result below is also from NCSA UIUC Delta system 8-Way NVIDIA H200 GPU (with NVLink) but we only use 4 GPUs to compare with previous result. This is also with the allreduce operation.

Here we unleash the true power of NCCL and its infrasture as you can see that the bandwidth of PyLops-MPI with NCCL is 800x of the MPI implementation !. It may not make much sense to compare the number with NumPy+MPI because there is drastic hardware infrastructure upgrade involved.

To top things off, we also ran the experiment trying to saturate the communication with the array size going up to 32 GB in total. We can see that we have the linear scaling i.e. time vs. data size grows linearly.

Finally, we ran an experiment with the application of Least-squares Migration, which is an iterative inversion scheme:

Each iteration applies a forward A and an adjoint A.T operation to form residuals and gradients.
A gradient accumulation requires a global reduction across processes with allreduce. Note that the computation is not trivial and so the total run-time of CPU and GPU is not fairly comparable (notice that in H200, the CuPy+MPI is not the slowest anymore). But we want to give the idea of how things piece together in the real application.

The impact of this GSoC project is clear:

With our NCCL-enabled PyLops-MPI,

if you don’t have access to the state-of-the-art infrastructure, PyLops-MPI with NCCL can still 10x the communication bandwith (A40 case)
if you do, we allow you to get the most out of the system (H200 case).

And the best thing is to use NCCL with PyLops-MPI, it requires minimal code change as shown in this LSM Tutorial and illustrated below. Only two change require from the code that run on MPI: the array must be allocated in GPU and nccl has to be passed to the DistributedArray. And that’s it !

nccl_comm = pylops_mpi.utils._nccl.initialize_nccl_comm()

# <snip - same set-up as running with MPI>

lsm = LSM(
 # <snip>
 cp.asarray(wav.astype(np.float32)), # Copy to GPU
 # <snip>
 engine="cuda",
 dtype=np.float32
)
lsm.Demop.trav_srcs = cp.asarray(lsm.Demop.trav_srcs.astype(np.float32)) # Copy to GPU
lsm.Demop.trav_recs = cp.asarray(lsm.Demop.trav_recs.astype(np.float32)) # Copy to GPU

x0 = pylops_mpi.DistributedArray(VStack.shape[1],
 partition=pylops_mpi.Partition.BROADCAST,
 base_comm_nccl=nccl_comm, # Explicitly pass nccl communicator
 engine="cupy") # Must use CuPy
# <snip - the rest is the same>

Things left to do

CUDA-Aware MPI: As we pointed out in the A40 experiment that current implementation of PyLops-MPI use non-buffered calls of mpi4py and thus introduces the memory copying from GPU to CPU. We aim to optimize this by introducing the buffered calls. However, this is not a trivial task because some of the MPI-related code was developed based on the semantics that the communication return the list object while the buffered call will return the array instead.

Wrapping Up KALLM

Wed, 03 Sep 2025 00:00:00 +0000

Large language models today look complicated, but if you peel back the layers, most of what you see is old technology: stacks of linear transformations. The Transformer architecture, the engine behind GPTs and their cousins, is often described as revolutionary. Yet the majority of its parameters are standard linear layers, the same kind of matrix multiplications you would find in a simple multilayer perceptron from the 1980s. For years these layers have gone unchallenged. They are fast, they scale, and they work. But maybe the time has come to ask: can we do better than linear?

This project explored exactly that. Instead of leaving those layers untouched, we tried replacing them with a more mathematically structured alternative: Kolmogorov–Arnold Networks (KANs). The result is a working language model—SmolLM2, a 135-million-parameter Transformer—where the final feedforward blocks no longer consist of brute-force linear weights, but of compact polynomial-based functions. And the striking fact is that performance remained within the baseline range. Smaller KANs managed to match larger linear layers, showing that smarter mathematics can stand shoulder to shoulder with the workhorse of deep learning.

Transformers

To understand the significance, let’s revisit what a Transformer actually is.
A Transformer block has two main components: attention and feedforward. The attention mechanism computes how each word in a sentence relates to every other word. That is the clever part, and it is what made Transformers famous. But once attention finishes its work, the output is passed into a feedforward network. And this feedforward network is essentially two large linear layers, stacked with a nonlinearity between them.

Now stacking thirty such blocks yields a complete model like SmolLM2. Look at the parameter counts and you see a pattern: attention is not the main consumer. It’s the feedforward layers. They dominate memory and computation, making them the primary target for efficiency gains.

What Are Kolmogorov–Arnold Networks?

So what happens if, instead of a giant matrix multiplication, we try something more structured? Enter Kolmogorov–Arnold Networks.

KANs are built on a mathematical theorem from the mid-20th century, which proved that any multivariate function can be decomposed into sums of univariate functions. Instead of mixing all inputs together at once, you treat each input dimension separately, applying a small nonlinear function, and then recombine. The beauty is that these univariate functions can be simple but expressive—like splines or polynomials—and yet, when summed, they approximate very complex mappings.

Think of a KAN layer as a set of individual univariate modules. Each one takes a single variable, bends it according to a chosen basis (polynomials, splines, etc.), and then all those bent versions are added up to produce the output. The richness of the final function depends on two factors:

Choice of basis: You can bend with Chebyshev polynomials, with Legendre polynomials, with B-splines, or with other families.
Degree: This is how many bends you allow. A degree-1 polynomial is just a line. Degree-2 can capture curves. Higher degrees capture higher-order oscillatory components.

A Chebyshev polynomial of the second kind, degree 2, is one such basis. Unlike a simple quadratic, it has roots and oscillations that make it particularly good at spanning function space efficiently. This efficiency explains its favorable performance in our experiments: low degree means fewer parameters, but Chebyshev’s properties let it approximate more than you might expect from so few numbers.

Why Small Can Beat Big

Linear layers require many parameters because they treat every input–output mapping as arbitrary. KANs assume smoothness: each input passes through a compact polynomial basis before recombination. This structure captures useful patterns with fewer parameters.

A degree-2 Chebyshev basis, for example, encodes curvature and oscillation efficiently. While a linear layer of the same size must spend parameters to approximate these effects, the polynomial basis includes them inherently. The result is comparable expressivity with fewer parameters. In language tasks where patterns are often smooth or compositional, this structured efficiency translates into competitive accuracy at lower cost.

Baselines, Modifications, and Comparisons

Here’s what we actually tested, in plain language:

The untouched baseline: a pretrained SmolLM2, with all thirty blocks intact.
Linear restart: the same pretrained model, but the last five feedforward modules were thrown away and replaced with freshly initialized linear ones. These then had to be trained again.
KAN replacement: again, take the pretrained model, cut off the last five feedforward modules, and put in new KAN modules instead—specifically, Chebyshev of the second kind, degree 2.

In all three cases, the backbone of the model—the embeddings, the attention layers, and the first twenty-five blocks—was left untouched. Only the tail was modified. This design allowed us to test transfer learning: would the pretrained parts of the model still play nicely with the new pieces? The answer is yes. The attention layers and other linear projections adapted seamlessly, proving that KANs can be swapped in without destabilizing the whole system.

Training was done on smol-smoltalk dataset, a small-scale dialogue corpus used for both pretraining and fine-tuning. After training, all models were evaluated on the same subset of BIG-Bench Hard tasks.

Results

The baseline was the pretrained SmolLM2 without modification. It achieved an average accuracy of 22.5%, using 134M parameters. This experiment has a single measurement because no training was applied. The rest of the experiments was done using 3 random seeds.

When retrained with linear replacements, the model reached an average accuracy of 43.8%, with 46M trainable parameters (only last 5 blocks are active) and 5.87 GB VRAM total usage.

Replacing the last five feedforward blocks with Kolmogorov–Arnold Networks produced an average accuracy of 44.1%, with 39M parameters and 5.86 GB VRAM usage. The memory consumption of KAN layers is a subject that requires further optimization.

In short, KANs matched or slightly exceeded the reinitialized linear baseline in accuracy, while using fewer parameters and slightly less memory. This demonstrates that structured polynomial layers can substitute for large linear layers without degrading reasoning performance.

Why Transfer Learning Works So Well

One of the surprising outcomes is how cleanly the pretrained Transformer integrates with KANs. Remember: only the feedforward modules in the last five blocks were replaced. All the other linear layers—embedding projections, attention queries, keys, and values, output heads—remained untouched. They continue to function as before. The new KAN blocks slot right in, adapt during training, and the system as a whole behaves coherently.

That tells us something important. The standard Transformer does not depend on linearity per se in those positions. What it depends on is a nonlinear transformation with enough expressive power. KANs provide that power, just in a different mathematical form. Which means: any pretrained Transformer can, in principle, be retrofit with KANs in the feedforward slots, no need to start from scratch.

Looking Ahead: Mixing Polynomial Bases

So far we only tested one family, Chebyshev-2. But the architecture is more general. Each KAN block can in fact host multiple polynomial families in parallel, or stack them in sequence.

Parallel: imagine splitting the input across several channels, each processed by a different basis. The outputs are then recombined. This way, one basis covers the smooth global structure, while another captures edge effects or oscillations.
Sequential: here, the output of one polynomial transformation becomes the input of another. You can think of it as layering function approximations, where the second basis corrects the limitations of the first. For example, a spline might give you piecewise smoothness, then a Chebyshev layer on top could adjust the global shape.

Both strategies were implemented and promise to extract more expressivity per parameter. Instead of simply making the networks bigger, we can make them smarter, combining the strengths of different mathematical families. That will be the focus of future work.

Conclusion

The main lesson is this: language models do not need to be built entirely from massive linear matrices. By replacing just a handful of those matrices with compact Kolmogorov–Arnold modules, we achieved the same reasoning accuracy with fewer parameters and less memory. Transfer learning works cleanly. The architecture adapts. And the door is now open to rethink what belongs inside a Transformer block.

KANs are not just a theoretical curiosity. They are practical, efficient, and compatible with modern large language models. This project showed that replacing linear with polynomial is not only possible, it is competitive. The next step is to push combinations, explore scaling, and see just how far this mathematical alternative can take us.

Final Report: MPI Appliance for HPC Research on Chameleon

Mon, 01 Sep 2025 00:00:00 +0000

Hi Everyone, This is my final report for the project I completed during my summer as a Summer of Reproducibility (SOR) student. The project, titled “MPI Appliance for HPC Research in Chameleon,” was undertaken in collaboration with Argonne National Laboratory and the Chameleon Cloud community. The project was mentored by Ken Raffenetti and was completed over the summer. This blog details the work and outcomes of the project.

Background

Message Passing Interface (MPI) is the backbone of high-performance computing (HPC), enabling efficient scaling across thousands of processing cores. However, reproducing MPI-based experiments remains challenging due to dependencies on specific library versions, network configurations, and multi-node setups.

To address this, we introduce a reproducibility initiative that provides standardized MPI environments on the Chameleon testbed. This is set up as a master–worker MPI cluster. The master node manages tasks and communication, while the worker nodes do the computations. All nodes have the same MPI libraries, software, and network settings, making experiments easier to scale and reproduce.

Objectives

The aim of this project is to create an MPI cluster that is reproducible, easily deployable, and efficiently configurable.

The key objectives of this project were:

Pre-built MPI Images: Create ready-to-use images with MPI and all dependencies installed.
Automated Cluster Configuration: Develop Ansible playbooks to configure master–worker communication, including host setup, SSH key distribution, and MPI configuration across nodes.
Cluster Orchestration: Develop orchestration template to provision resources and invoke Ansible playbooks for automated cluster setup.

Implementation Strategy and Deliverables

Openstack Image Creation

The first step was to create a standardized pre-built image, which serves as the base image for all nodes in the cluster.

Some important features of the image include:

Built on Ubuntu 22.04 for a stable base environment.
Spack + Lmod integration:
- Spack handles reproducible, version-controlled installations of software packages.
- Lmod (Lua Modules) provides a user-friendly way to load/unload software environments dynamically.
- Together, they allow users to easily switch between MPI versions, libraries, and GPU toolkits
MPICH and OpenMPI pre-installed for standard MPI support and can be loaded/unloaded.
Three image variants for various HPC workloads: CPU-only, NVIDIA GPU (CUDA 12.8), and AMD GPU (ROCm 6.4.2).

These images have been published and are available in the Chameleon Cloud Appliance Catalog:

MPI and Spack for HPC (Ubuntu 22.04) - CPU Only
MPI and Spack for HPC (Ubuntu 22.04 - CUDA) - NVIDIA GPU (CUDA 12.8)
MPI and Spack for HPC (Ubuntu 22.04 - ROCm) - AMD GPU (ROCm 6.4.2)

Cluster Configuration using Ansible

The next step is to create scripts/playbooks to configure these nodes and set up an HPC cluster. We assigned specific roles to different nodes in the cluster and combined them into a single playbook to configure the entire cluster automatically.

Some key steps the playbook performs:

Configure /etc/hosts entries for all nodes.
Mount Manila NFS shares on each node.
Generate an SSH key pair on the master node and add the master’s public key to the workers’ authorized_keys.
Scan worker node keys and update known_hosts on the master.
(Optional) Manage software:
- Install new compilers with Spack
- Add new Spack packages
- Update environment modules to recognize them
Create a hostfile at /etc/mpi/hostfile.

The code is publicly available and can be found on the GitHub repository: https://github.com/rohanbabbar04/MPI-Spack-Experiment-Artifact

Orchestration

With the image now created and deployed, and the Ansible scripts ready for cluster configuration, we put everything together to orchestrate the cluster deployment.

This can be done in two primary ways:

Python CHI(Jupyter) + Ansible

Python-CHI is a python library designed to facilitate interaction with the Chameleon testbed. Often used within environments like Jupyter notebooks.

This setup can be put up as:

Create leases, launch instances, and set up shared storage using python-chi commands.
Automatically generate inventory.ini for Ansible based on launched instances.
Run Ansible playbook programmatically using ansible_runner.
Outcome: fully configured, ready-to-use HPC cluster; SSH into master to run examples.

If you would like to see a working example, you can view it in the Trovi example

Heat Orchestration Template

Heat Orchestration Template(HOT) is a YAML based configuration file. Its purpose is to define/create a stack to automate the deployment and configuration of OpenStack cloud resources.

Challenges

We faced some challenges while working with Heat templates and stacks in particular in Chameleon Cloud

OS::Nova::Keypair(new version): In the latest OpenStack version, the stack fails to launch if the public_key parameter is not provided for the keypair, as auto-generation is no longer supported.
OS::Heat::SoftwareConfig: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.

To tackle these challenges, we designed an approach that is both easy to implement and reproducible. First, we launch instances by provisioning master and worker nodes using the HOT template in OpenStack. Next, we set up a bootstrap node, install Git and Ansible, and run an Ansible playbook from the bootstrap node to configure the master and worker nodes, including SSH, host communication, and MPI setup. The outcome is a fully configured, ready-to-use HPC cluster, where users can simply SSH into the master node to run examples.

Users can view/use the template published in the Appliance Catalog: MPI+Spack Bare Metal Cluster. For example, a demonstration of how to pass parameters is available on Trovi.

Conclusion

In conclusion, this work demonstrates a reproducible approach to building and configuring MPI clusters on the Chameleon testbed. By using standardized images, Ansible automation, and Orchestration Templates, we ensure that every node is consistently set up, reducing manual effort and errors. The artifact, published on Trovi, makes the entire process transparent, reusable, and easy to implement, enabling users/researchers to reliably recreate and extend the cluster environment for their own experiments.

Future Work

Maintaining these images and possibly creating a script to reproduce MPI and Spack on a different image base environment.

Final Update(Mid-Term -> Final): MPI Appliance for HPC Research on Chameleon

Sun, 31 Aug 2025 00:00:00 +0000

Hi everyone! This is my final update, covering the progress made every two weeks from the midterm to the end of the project MPI Appliance for HPC Research on Chameleon, developed in collaboration with Argonne National Laboratory and the Chameleon Cloud community. This blog follows up on my earlier post, which you can find here.

🔧 July 29 – August 11, 2025

With the CUDA- and MPI-Spack–based appliances published, we considered releasing another image variant (ROCm-based) for AMD GPUs. This will be primarily used in CHI@TACC, which provides AMD GPUs. We have successfully published a new image on Chameleon titled MPI and Spack for HPC (Ubuntu 22.04 - ROCm), and we also added an example to demonstrate its usage.

🔧 August 12 – August 25, 2025

With the examples now available on Trovi for creating an MPI cluster using Ansible and Python-CHI, my next step was to experiment with stack orchestration using Heat Orchestration Templates (HOT) on OpenStack Chameleon Cloud. This turned out to be more challenging due to a few restrictions:

OS::Nova::Keypair (new version): In the latest OpenStack version, the stack fails to launch if the public_key parameter is not provided for the keypair, as auto-generation is no longer supported.
OS::Heat::SoftwareConfig: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.

To address these issues, we adopted a new strategy for configuring and creating the MPI cluster: using a temporary bootstrap node.

In simple terms, the workflow of the Heat template is:

Provision master and worker nodes via the HOT template on OpenStack.
Launch a bootstrap node, install Git and Ansible on it, and then run an Ansible playbook from the bootstrap node to configure the master and worker nodes. This includes setting up SSH, host communication, and the MPI environment.

This provides an alternative method for creating an MPI cluster.

We presented this work on August 26, 2025, to the Chameleon Team and the Argonne MPICH Team. The project was very well received.

Stay tuned for my final report on this work, which I’ll be sharing in my next blog post.

[Final Blog] Distrobench: Distributed Protocol Benchmark

Sat, 30 Aug 2025 00:00:00 +0000

Introduction

This is the final blog for our contribution to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges project under the mentorship of Fadhil Kurnia for the OSRE program.

Distrobench is a framework to evaluate the performance of replication/coordination protocols for distributed systems. This framework standardizes benchmarking by allowing different protocols to be tested under an identical workload, and supports both local and remote deployment of the protocols. The frameworks tested are restricted under a key-value store application and are categorized under different consistency models, programming languages, and persistency (whether the framework stores its data in-memory or on-disk).

All the benchmark results are stored in a data.json file which can be viewed through a webpage we have provided. A user can clone the git repository, benchmark different protocols on their own machine or in a cluster of remote machines, then view the results locally. We also provided a webpage that shows our own benchmark results which ran on 3 Amazon EC2 t2.micro instances.

How to run a benchmark on Distrobench

Before running a benchmark using Distrobench, the protocol that will be benchmarked must first be built. This is to allow the script to initialize the protocol instance for local benchmark or to send the binaries into the remote machine. The remote machine running the protocol does not need to store the code for the protocol implementations, but does require dependencies for running that specific protocol such as Java, Docker, rsync, etc. The following are commands used to build the ailidani/paxi project which does not need any additional dependency to be run inside of a remote machine:

# Clone the Distrobench repository 
git clone git@github.com:fadhilkurnia/distro.git

# Clone the Paxi repository and build the binary 
cd distro/sut/ailidani.paxi
git clone git@github.com:ailidani/paxi.git
cd paxi/bin/
./build.sh

# Go back to the Distrobench root directory & run python script 
cd ../../../..
python main.py

By default, the script will start 3 local instances of a Paxi protocol implementation that the user chose through the CLI. The user can modify the number of running instances and whether or not it is deployed locally or in a remote machine by changing the contents of the .env file inside the root directory. The following is the contents of the default .env file:

NUM_OF_NODES=3

SSH_KEY=ssh-key.pem
REMOTE_USERNAME=ubuntu

PUBLIC_IP1=127.0.0.1
PUBLIC_IP2=127.0.0.1
PUBLIC_IP3=127.0.0.1

PRIVATE_IP1=127.0.0.1
PRIVATE_IP2=127.0.0.1
PRIVATE_IP3=127.0.0.1

CLIENT_IP=127.0.0.1

OUTPUT=data.json

When running a remote benchmark, a ssh-key should also be added in the root directory to allow the use of ssh and rsync from within the python script. All machines must also allow TCP connection through port 2000-2300 and port 3000-3300 because that would be the port range for communication between the running instances as well as for the YCSB benchmark. Running the benchmark requires the use of at least 3 nodes because it is the minimum number of nodes to support most protocols (5 nodes recommended).

To view the benchmark result in the web page locally, move data.json into the docs/ directory and run python -m http.server 8000. The page is then accessible through http://localhost:8000.

Deep dive on how Distrobench works

The following is the project structure of the Distrobench repository:

distro/
├── main.py // Main python script for running benchmark
├── data.json // Output file for main.py
├── README.md
├── .env // Config for running the benchmark
├── docs/
│ ├── index.html // Web page to show benchmark results
│ ├── data.json // Output file displayed by web page
│ ├── README.md
├── src/
│ ├── utils/
│ └── ycsb/ // Submodule for YCSB
└── sut/ // Systems under test
 ├── ailidani.paxi/
 └── run.py // Protocol-specific benchmark script called by main.py
 ├── apache.zookeeper/
 ├── etcd-io.etcd/
 ├── fadhilkurnia.xdn/
 ├── holipaxos-artifect.holipaxos/
 ├── otoolep.hraftd/
 └── tikv.tikv/

main.py will automatically detect directories inside sut/ and will call the main function inside run.py. The following is the structure of run.py written in pseudocode style:

FUNCTION main(run_ycsb: Function, nodes: List of Nodes, ssh: Dictionary)
 node_data = map_ip_port(nodes)

 SWITCH user\_input
 CASE 0:
 start()
 RETURN
 CASE 1:
 stop()
 RETURN
 CASE 2:
 client_data = []
 FOR EACH item IN node_data
 ADD item.client_addr TO client_data
 END FOR
 run_ycsb(client_data)
 RETURN
 END SWITCH
END FUNCTION

FUNCTION start()
 // Start the protocol instance (local or remote)
END FUNCTION

FUNCTION stop()
 // Stop the protocol instance (local or remote)
END FUNCTION

FUNCTION map_ip_port(nodes: List of Nodes) -> List of Dictionary
 // Generate port numbers based on the protocol requirements
END FUNCTION

The .env file provides both public and private IP addresses to add versatility when running a remote benchmark. Private IP is used for communication between remote machines if they are under the same network group. In the case of our own benchmark, four t2.micro EC2 instances are deployed under the same network group. Three of them are used to run the protocol and the fourth machine acts as the YCSB client. It is possible to use your local machine as the YCSB client instead of through another remote machine by specifying CLIENT_IP in the .env file as 127.0.0.1. The decision to use the remote machine as the YCSB client is made to reduce the impact of network latency between the client and the protocol servers to a minimum.

The main tasks of the start() function can be broken down into the following:

Generate custom configuration files for each remote machine instance (May differ between implementations. Some implementations does not require a config file because they support flag parameters out of the box, others require multiple configuration files for each instance)
rsync binaries into the remote machine (If running a remote benchmark)
Start the instances

The stop() function is a lot simpler since it only kills the process running the protocol and optionally removes the copied binary files in the remote machine. The run_ycsb() function passed onto run.py is defined in main.py and currently supports two types of workload:

Read-heavy: A single-client workload with 95% read and 5% update (write) operations
Update-heavy: A single-client workload with 50% read and 50% update (write) operations

A new workload can be added inside the src/ycsb/workloads directory. Both workloads above only run 1000 operations for the benchmark which may not be enough operations to properly evaluate the performance of the protocols. It should also be noted that while YCSB does support a scan operation, it is never used for our benchmark because none of our tested protocols implement this operation.

How to implement a new protocol in Distrobench

Adding a new protocol to distrobench requires implementing two main components: a Python integration script (run.py) and a YCSB database binding for benchmarking.

Create the protocol directory structure
- Create a new directory under sut/ using format yourrepo.yourprotocol/.
Write run.py integration
- Put script inside yourrepo.yourprotocol/ directory
- Must have the main(run_ycsb, nodes, ssh) function.
- Add start/stop/benchmark menu options
- Handle local (127.0.0.1) and remote deployment
Create YCSB client
- Make Java class extending YCSB’s DB class
- Put inside src/ycsb/yourprotocol/src/main/java/site/ycsb/yourprotocol
- Implement read(), insert(), update(), delete() methods
Register your client
- Register your client to src/pom.xml, src/ycsb/bin/binding.properties, and src/ycsb/bin/ycsb.
Build and test
- Run cd src/ycsb && mvn clean package
- Run python main.py
- Select your protocol and test it

Protocols which have been tested

Distrobench has tested 20 different distributed consensus protocols across 7 different implementation projects.

ailidani/paxi
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability, Eventual
- Protocol : Paxos, EPaxos, SDpaxos, WPaxos, ABD, chain, VPaxos, WanKeeper, KPaxos, Paxos_groups, Dynamo, Blockchain, M2Paxos, HPaxos.
apache/zookeeper
- Programming Language : Java
- Persistency : On-Disk
- Consistency Model : Linearizability + Primary Integrity
- Protocol : Zookeeper implements ZAB (Zookeper Atomic Broadcast)
etcd-io/etcd
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft
fadhilkurnia/xdn
- Programming Language : Java, Rust
- Persistency : On-Disk
- Consistency Model : Linearizability, Linearizability + Primary Integrity
- Protocol : Gigapaxos
Zhiying12/holipaxos-artifect
- Programming Language : Go, Rust
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Holipaxos, Omnipaxos, Multipaxos
otoolep/hraftd
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft
tikv/tikv
- Programming Language : Rust
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft

Challenges

When attempting to benchmark HoliPaxos, the main challenge was handling versions that rely on persistent storage with RocksDB. Since some implementations are written in Go, it was necessary to find compatible versions of RocksDB and gRocksDB (for example, RocksDB 10.5.1 works with gRocksDB 1.10.2). Another difficulty was that RocksDB is resource-intensive to compile, and in our project we did not have sufficient CPU capacity on the remote machine to build RocksDB and run remote benchmarks.
Some projects did not compile successfully at first and required minor modifications to run.

Conclusion and future improvements

The current benchmark result shows the performance of all the mentioned protocols by throughput and benchmark runtime. The results are subject to revisions because it may not reflect the best performance for the protocols due to unoptimized deployment script. We are also planning to switch to a more powerful EC2 machine because t2.micro does not have enough resources to support the use of RocksDB as well as TiKV.

In the near future, additional features will be added to Distrobench such as:

Multi-Client Support: The YCSB client will start multiple clients which will send requests in parallel to different servers in the group.
Commit Versioning: Allows the labelling of all benchmark results with the commit hash of the protocol’s repository version. This allows comparing different versions of the same project.
Adding more Primary-Backup, Sequential, Causal, and Eventual consistency protocols: Implementations with support for a consistency model other than linearizability and one that provides an existing key-value store application are notoriously difficult to find.
Benchmark on node failure
Benchmark on the addition of a new node

Final Blog:Improving Usability and Performance in cc-snapshot

Sun, 24 Aug 2025 00:00:00 +0000

My name is Zahra Temori, and I’m thrilled to collaborate with mentor Paul Marshall during this summer on the cc-snapshot project.

Introduction

Reproducibility is an important concept in high performance computing and research. It ensures that experiments can be repeated, validated, and extended with confidence. Achieving a reproducible environment requires identical software stacks, with the exact same dependencies, and configuration. The Chameleon Cloud testbed provides the cc-snapshot tool to support reproducibility by capturing the complete state of a running system. This allows researchers to rerun experiments exactly as before, share setups among each other, and avoid potential environmental issues such as missing dependencies or version mismatches. In this work, we explore how to enhance snapshotting as a reproducible method and make it an effective strategy for HPC research.

Key Achievements

The project was divided into two phases.The first phase focused on usability, reorganizing the tool, and expanding its capabilities. The second phase was benchmarking to evaluate alternative image formats and compression methods to improve snapshotting performance.

Usability Enhancements: The original snapshotting tool had challenges including a limited command line, tightly coupled logic, and minimal testing support, which made it difficult for users to interact with and developers to maintain. To enhance the command line interface, we added a flag to disable automatic updates, giving users more control over when to pull the latest version. We also added a dry-run flag to simulate actions before running a snapshot, allowing developers to test and run safely. Moreover, we implemented support for a custom source path, enabling snapshots of specific directories. This helps developers test smaller directories rather than full snapshots, which can be more complicated when testing functionalities. To improve maintainability, we refactored the codebase into five modular functions, allowing developers to make future changes more easily. In addition, we added automated tests with GitHub Actions to validate new and existing features and ensure that changes work as expected.
Performance Optimization: The default format and compression on snapshotting was Qcow2 with zlib, which often resulted in long snapshot creation time. To address this performance issue, we benchmarked other alternatives such as QCOW2 with zstd compression, and RAW with no compression. We also chose three images of varying sizes: small 4.47 GiB, medium 7.62 GiB, and large 12.7 GiB. The medium size image was user created to demonstrate the snapshotting and compression works for both Chameleon-supported images and user-created images.

Results: We ran each image with different compression methods and recorded four key metrics: creation time, upload time, boot time, and final image size. We calculated the overall time of each compression method from experiments on three different image sizes to evaluate which performed better. The results revealed that zstd compression reduced the creation time around 80.6% across the three image sizes. The upload time for zstd was nearly equal to the zlib method, while RAW images, due to no compression and larger size, uploaded much slower compared to images compressed with zlib and zstd. The boot time was nearly the same across all images, confirming that zlib and zstd take about the same time to uncompress, while RAW images take longer to boot due to large size. Our work suggested that QCOW2 with zstd compression should be used instead of QCOW2 with zlib compression when creating a snapshot. This enables researchers to generate and share reproducible environments faster.

Conclusion and Future Work

Snapshotting is a practical way to support reproducibility in HPC, but to be effective, it should be easy to use and fast enough for real research workflows. Our results show that using zstd compression can drop the snapshot creation time by over 80% compared to the common default zlib compression, without affecting upload or boot performance. Looking ahead, we plan to integrate zstd , try it on more workloads and image types, and explore ways to improve snapshotting for even greater speedups and reliable results.

Deliverables

Repository: All comprehensive analysis code and source code can be found in the CC-SNAPSHOT GitHub Repository.

End-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing

Sat, 23 Aug 2025 00:00:00 +0000

Introduction

Hello everyone!
I am Debangi Ghosh from India, an undergraduate student at the Indian Institute of Technology (IIT) BHU, Varanasi. As part of the StatWrap: Cross-Project Searching and Classification using Local Indexing project, my proposal, under the mentorship of Luke Rasmussen, focuses on developing a full-text search service within the StatWrap user interface. This involves evaluating different search libraries and implementing a classification system to distinguish between active and past projects.

About the Project

As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for researchers to discover relevant projects, notes, and assets across both current and archived work, using information that is either user-entered or passively collected by StatWrap.

Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:

Evaluating open-source search libraries suitable for local indexing and retrieval
Building the full-text search functionality directly into the StatWrap UI to allow seamless querying across projects
Ensuring reliability through the development of unit tests and comprehensive system testing
Implementing a classification system to label projects as “Active,” “Pinned,” or “Past” within the user interface

This project offers a great opportunity to work at the intersection of software development, information retrieval, and user-centric design—while contributing to research reproducibility and collaboration within scientific workflows.

Deliverables

The project has reached the end of its scope after 12 weeks of work. Here’s a breakdown:

1. Descriptive Comparison of Open-Source Libraries

Compared various open-source search libraries based on evaluation criteria such as indexing speed, search speed, memory usage, typo tolerance, fuzzy searching, partial matching, full-text queries, contextual search, Boolean support, exact word match, installation ease, maintenance, documentation, and developer experience. Decided upon the weights to assign to each of the features and point out the best library to use. According to our weights assigned,

These results are after tuning the hyperparameters to give the best set of results For huge data, FlexSearch has the least memory usage, followed by MiniSearch. The examples we used were limited, so Minisearch had the better memory usage results. Along with the research and evaluation, I looked upon the Performance Benchmark of Full-Text-Search Libraries (Stress Test), available here

The benchmark was measured in terms per seconds, higher values are better (except the test “Memory”). The memory value refers to the amount of memory which was additionally allocated during search.

FlexSearch performs queries up to 1,000,000 times faster compared to other libraries by also providing powerful search capabilities like multi-field search (document search), phonetic transformations, partial matching, tag-search, result highlighting or suggestions. Bigger workloads are scalable through workers to perform any updates or queries to the index in parallel through dedicated balanced threads.

2. The Search User Interface

3. Complete Search Execution Pipeline

4. FlexSearch Features

1. Persistent Indexing with Automatic Loading

Index persistence: Search index automatically saves to disk and loads on startup
Fast restoration: Rebuilds FlexSearch indices from saved document store without re-scanning files
Incremental updates: Detects project changes and updates only modified content
Background processing: Index updates happen asynchronously without blocking the User Interface.

2. Multi-Document Type Support

Unified search: Single search interface for projects, files, people, notes, and assets
Type-specific indices: Separate FlexSearch indices optimized for each document type
Cross-reference capabilities: Documents can reference and link to each other
Flexible schema: Each document type has tailored fields for optimal search performance

3. Intelligent File Content Indexing

Configurable file size limits: Admin-controlled maximum file size for content indexing
Smart file detection: Automatically identifies text files by extension and filename patterns
Content extraction: Full-text indexing with snippet generation for search results
Performance optimization: Skips binary files and respects size constraints to maintain speed

4. Advanced Query Processing

Multi-strategy search: Combines exact matches, fuzzy search, partial matches, and contextual search
Query preprocessing: Removes stop words and applies linguistic filters
Relevance scoring: Custom scoring algorithm considering multiple factors:
- Exact phrase matches (highest weight)
- Individual word matches
- Term frequency with logarithmic capping
- Position-based scoring (earlier matches rank higher)
- Proximity bonuses for terms appearing near each other
- Completeness penalties for missing query terms

5. Real-Time Search Suggestions

Autocomplete support: Dynamic suggestions based on indexed document titles
Search history: Maintains recent searches for quick re-execution
Debounced input: Prevents excessive API calls during typing
Contextual suggestions: Suggestions adapt based on current filters and context

6. Comprehensive Filtering System

Type filtering: Filter by document type (projects, files, people, etc.)
Project scoping: Limit searches to specific projects
File type filtering: Filter files by extension
Advanced search panel: Collapsible interface for power users
Filter persistence: Maintains filter state across searches

7. Performance Monitoring & Analytics

Real-time metrics: Track search times, cache hit rates, and index statistics
Performance dashboard: Visual indicators for system health
Cache management: LRU cache with configurable size and TTL
Search analytics: Historical data on search patterns and performance

8. Index Management Tools

Export/Import functionality: Backup and restore search indices
Full reindexing: Complete index rebuild with progress tracking
Index deletion: Clean slate functionality for troubleshooting
File size adjustment: Modify indexing constraints and rebuild affected content
Index statistics: Detailed breakdown of indexed content by type and project

9. Robust Error Handling & Resilience

Graceful degradation: System continues operating even with partial index corruption
File system error handling: Handles missing files, permission issues, and path changes
Memory management: Prevents memory leaks during large indexing operations
Recovery mechanisms: Automatic fallback to basic search if advanced features fail

10. User Experience Enhancements

Keyboard shortcuts: Ctrl+K to focus search, Escape to clear
Result highlighting: Visual emphasis on matching terms in results
Expandable results: Drill down into detailed information for each result
Loading states: Clear feedback during indexing and search operations
Responsive tabs: Organized results by type with badge counts

5. Classification of Active and Past Projects

A classification system is added within the User Interface similar to “Add to Favorites” option. A new project added by default moves to “Active” section, unless explicitely marked as “Past”. Similarly, when a project is unpinned from Favorites, it goes to “Active” Section.

Conclusion and future Scope

Building a comprehensive search system requires careful attention to performance, user experience, and maintainability. FlexSearch provided the foundation, but the real value came from thoughtful implementation of persistent indexing, advanced scoring, and robust error handling. The result is a search system that feels instant to users while handling complex queries across diverse document types.

The key to success was treating search not as a single feature, but as a complete subsystem with its own data management, performance monitoring, and user interface considerations. By investing in these supporting systems, the search functionality became a central, reliable part of the application that users can depend on.

The future scope would include:

Using a database (for example, SQLite), instead of JSON, which is better for this use case than JSON due to better and efficient query performance and atomic (CRUD) operations.
Integrating any suggestions from my mentors, as well as improvements we feel are necessary.
Developing unit tests for further functionalities and improvements.

[Final]Reproducibility of Interactive Notebooks in Distributed Environments

Wed, 20 Aug 2025 00:00:00 +0000

I am sharing a overview of my project Reproducibility of Interactive Notebooks in Distributed Environments and the work that I did this summer.

Project Overview

This project aims at improving the reproducibility of interactive notebooks which are executed in a distributed environment. Notebooks like in the Jupyter environment have become increasingly popular and are widely used in the scientific community due to their ease of use and portability. Reproducing these notebooks is a challenging task especially in a distributed cluster environment.

In the distributed environments we consider, the notebook code is divided into manager and worker code. The manager code is the main entry point of the program which divides the task at hand into one or more worker codes which run in a parallel, distributed fashion. We utlize several open source tools to package and containerize the application code which can be used to reproduce it across different machines and environments. They include Sciunit, FLINC, and TaskVine. These are the high-level goals of this project:

Generate execution logs for a notebook program.
Generate code and data dependencies for notebook programs in an automated manner.
Utilize the generated dependencies at various granularities to automate the deployment and execution of notebooks in a parallel and distributed environment.
Audit and package the notebook code running in a distributed environment.
Overall, support efficient reproducibility of programs in a notebook program.

Progress Highlights

Here are the details of the work that I did during this summer.

Generation of Execution Logs

We generate execution logs for the notebook programs in a distributed environment the Linux utility strace which records every system call made by the notebook. It includes all files accessed during its execution. We collect separate logs for both manager and the worker code since they are executed on different machines and the dependencies for both are different. By recording the entire notebook execution, we capture all libraries, packages, and data files referenced during notebook execution in the form of execution logs. These logs are then utilized for further analyses.

Extracting Software Dependencies

When a library such as a Python package like Numpy is used by the notebook program, an entry is made in the execution log which has the complete path of the accessed library file(s) along with additional information. We analyze the execution logs for both manager and workers to find and enlist all dependencies. So far, we are limited to Python packages, though this methodology is general and can be used to find dependencies for any programing language. For Python packages, their version numbers are also obtained by querying the package managers like pip or Conda on the local system.

Extracting Data Dependencies

We utilze similar execution logs to identify which data files were used by the notebook program. The list of logged files also contain various configuration or setting files used by certain packages and libraries. These files are removed from the list of data dependencies through post-processing done by analyzing file paths.

Testing the Pipeline

We have conducted our experiments on three use cases obtained from different domains using between 5 and 10 workers. They include distributed image convolution, climate trend analysis, and high energy physics experiment analysis. The results so far are promising with good accuracy and with a slight running time overhead.

Processing at Cell-level

We perform the same steps of log generation and data and software dependency extraction at the level of individual cells in a notebook instead of once for the whole notebook. As a result, we generate software and data dependencies at the level of individual notebook cells. This is achieved by interrupting control flow before and after execution of each cell to write special instructions to the execution log for marking boundaries of cell execution. We then analyze the intervals between these instructions to identify which files and Python packages are accessed by each specific cell. We use this information to generate the list of software dependencies used by that cell only.

We also capture data dependencies by overriding analyzing the execution logs generated by overriding the function of the open function call used to access various files.

Distributed Notebook Auditing

In order to execute and audit workloads in parallel, we use Sciunit Parallel which uses GNU Parallel for efficient parallel execution of tasks. The user specifies the number of tasks or machines to run the task on which is then distributed across them. Once the execution completes, their containerized executions need to be gathered at the host location.

Efficient Reproducibility with Checkpointing

An important challenge with Jupyter notebooks is that sometimes they are unnecessarily time-consuming and resource-intensive, especially when most cells remain unchanged. We worked on NBRewind which is a lightweight tool to accelerate notebook re-execution by avoiding redundant computation. It integrates checkpointing, application virtualization, and content-based deduplication. It enables two kinds of checkpoints: incremental and full-state. In incremental checkpoints, notebook states and dependencies across multiple cells are stored once such that only their deltas are stored again. In full-state checkpoints, the same is stored after each cell. During its restore process, it restores outputs for unchanged cells and thus enables efficient re-execution. Our empirical evaluation demonstrates that NBRewind can significantly reduce both notebook audit and repeat times with incremental checkpoints.

I am very happy abut the experience I have had in this project and I would encourage other students to join this program in the future.

Midterm Report: Learning and Building ORB

Thu, 07 Aug 2025 00:00:00 +0000

Project Overview

UC ORB is an open-source platform developed to increase visibility and engagement with open source projects across the University of California system.

By providing a structured and searchable repository browser, ORB makes it easier for researchers, students, and collaborators to discover relevant open source initiatives, track their impact, and connect with contributors. It also helps campuses demonstrate the value of their open source output to potential funders and institutional partners.

Progress So Far

Significant progress has been made in building out core features of the ORB Showcase platform:

Searching and Filtering Options

Users can now search and filter repositories using multiple criteria:

Development Team / UC Campus
Programming Language
License Type
Topic / Domain Area

These filtering tools make it easy to explore the growing set of repositories in a meaningful and personalized way.

Pagination has been added to ensure scalability and smooth performance, even as the number of projects continues to grow.

Repository Details View

Each repository page now displays rich metadata and contextual information, including:

README preview – offering a quick look at the project’s purpose and usage

License – clearly indicating how the project can be used or adapted

Contributors and Funders – acknowledging the people and institutions behind the work

What’s Next

As we prepare UC ORB for public launch, we’re focused on improving the backend workflow and addressing some key challenges:

⚙️ GitHub Workflow Challenges Creating a GitHub-first workflow for adding repositories is powerful, but also tricky:

GitHub Actions cannot be triggered by API calls from a backend directly, which limits automation via server-side tools.

The GitHub bot has permission limitations, especially when it comes to interacting with PRs and validating submissions outside of standard GitHub UI flows.

I’m currently working on designing a more robust and maintainable workflow to handle these edge cases, including:

A standalone script that can add repositories directly to the database, bypassing the need for a pull request and enabling more flexible internal submissions.

Better logging and validation to ensure consistency between the file-based data model and the live PostgreSQL database.

Reflection

This project has been a great learning experience despite challenges with Frontend, Backend, GitHub Actions / Bots and APIs, it’s been exciting to build a platform that highlights open source work across the UC system.

I’m looking forward to what’s coming next as we get closer to launching ORB.

Midterm Report: Simulation, Comparison, and Conclusion of Cache Eviction

Wed, 06 Aug 2025 00:00:00 +0000

Project Overview

CacheBench is a benchmarking suite designed for comprehensive cache performance evaluation, with a particular focus on analyzing the miss ratios of various cache eviction algorithms.

At the core of CacheBench lie two key components: the high-performance cache simulator, libCacheSim, and the extensive open-source cache datasets, which collectively contain over 8,000 traces from diverse applications. This ensures broad coverage across a range of realistic workloads.

Our primary goal is to evaluate all major and widely-used cache eviction algorithms on thousands of traces, in order to gain insights into their behaviors and design trade-offs. Additionally, we aim to identify and distill representative workloads, making benchmarking more efficient and comprehensive for future cache research.

Progress and Pain Points

We began by benchmarking prevalent eviction algorithms, including FIFO, LRU, CLOCK, LFU, Random, Belady (BeladySize), CAR, ARC, LIRS, LHD, Hyperbolic, GDSF, W-TinyLFU, 2Q, SLRU, S3-FIFO, SIEVE, and LeCaR. As we developed the suite, we made progressive improvements to both the simulator and dataset infrastructure. Our progress can be summarized as follows:

Collected miss ratio results for all listed algorithms across 8,000+ traces.
Identified best- and worst-performing traces for each algorithm, and conducted feature analysis of these traces.
Developed Python bindings: To increase accessibility, we provided a Python package that allows users to easily download traces and run simulation analyses using libCacheSim and the cache datasets.

However, analysis remains challenging because there is no universally accepted metric or baseline for objectively comparing cache eviction algorithms’ performance across all workloads.

Next Steps

For the second half of the project, my focus will shift to:

Evaluating More Complex Eviction Algorithms: Having concentrated mainly on static eviction policies so far (which are generally more deterministic and understandable), I will now investigate learning-based eviction algorithms such as LRB and 3L-Cache. These models incorporate learning components and incur additional computational overhead, making simulations slower and more complex.
Detailed Trace Analysis: Since eviction algorithms can have highly variable performance on the same trace, I plan to analyze why certain algorithms excel on specific traces while others do not. Understanding these factors is crucial to characterizing both the algorithms and the workload traces.
Constructing Representative Workload Sets: Based on ongoing simulations and trace analyses, I aim to identify a minimal but representative subset of traces that can serve as a basic evaluation suite, simplifying testing and improving accessibility.

Reflection

This project has truly been the highlight of my summer. By evaluating a wide range of cache eviction algorithms, I’ve significantly deepened my understanding of cache design and its underlying principles.

I’m especially grateful to my mentors for their constant support, patience, and guidance throughout this journey. It’s been a privilege to learn from you!

I’m excited to see the final results of CacheBench!

Midterm Report: Learning, Building, and Documenting Brahma

Tue, 05 Aug 2025 00:00:00 +0000

Project Overview

Brahma-XR is an open-source WebXR framework designed for building collaborative virtual environments especially those involving spatial data and scientific visualization.

What makes Brahma powerful is that the same codebase runs seamlessly across both the browser and XR devices like the Apple Vision Pro, Meta Quest 3, and VARJO. This makes it ideal for rapid prototyping and creating cross-platform immersive experiences.

Some of Brahma’s built-in features include:

Grab-and-pull locomotion
Raycasting and interaction
Avatar embodiment
Spatial rendering
Support for geospatial and data-driven visualizations

Brahma is intentionally lightweight, optimized to run even on low-compute devices—making immersive collaboration more accessible to everyone.

What Worked (and What Didn’t)

As Brahma transitioned from a private research repo to a public open-source project, a lot of important foundational work had to be done around documentation, packaging, and example previews.

There are two aspects that make Brahma especially unique:

Bipartite npm package structure – which requires detailed and thoughtful documentation.
Immersive, real-time examples – unlike typical libraries, Brahma’s examples aren’t just static demos. They are live, multi-user XR apps designed to be interacted with.

The first half of the project focused on setting the stage—structuring and preparing the framework for broader use.

🔧 Key Accomplishments

Learning Three.js
I spent time learning the fundamentals of Three.js—how it handles 3D rendering, scene setup, materials, cameras, and animations. I also explored how large-scale Three.js projects are organized, which helped me understand how Brahma’s example apps are built.
Setting up the project structure
I looked at the architecture of various open-source projects and used that knowledge to shape Brahma’s structure. The goal was to align with community best practices while keeping things clean and modular for future contributors.
Understanding npm packaging (especially bipartite)
Since Brahma includes both client- and server-side logic, I spent time understanding how multi-part npm packages are published and maintained. I explored best practices around versioning, distribution, and separating internal vs public modules.
Creating a documentation system
After exploring different approaches (and with my mentor’s help), I set up a static documentation site using JSDoc with the Docdash theme. The current version includes guides, API references, and contribution instructions. This is just the beginning—the docs will evolve as the community grows.

What’s Next

In the second half of the project, I’ll be focusing on:

Building a routing system
For both documentation and example apps, so that users can easily browse through different components and use cases.
Setting up UI and 3D infrastructure
To make it easier for others to start building apps with Brahma by providing clean base layers for interface and spatial development.
Prepping for the first public release
Publishing the Brahma NPM package along with a curated set of featured examples and contributor-friendly documentation—making it easier for developers to get started and contribute.

Reflection

This project has truly been the highlight of my summer. Learning about WebXR, Three.js, and open-source workflows has been both exciting and rewarding. Every challenge taught me something new.

I am specially greatfull to my mentor Samir Ghosh for his constant support, patience, and guidance. It’s been a privilege learning from you!

I’m looking forward to what’s coming next as we get closer to the first public release of Brahma!

Mid-term Blog: Rectilinear Floorplans in OpenROAD - Progress Update

Sun, 03 Aug 2025 00:00:00 +0000

Mid-term Progress: Enabling Rectilinear Floorplanning in OpenROAD

Hello! I’m excited to share my mid-term progress on implementing rectilinear (polygonal) die support in OpenROAD’s floorplanning flow as part of Google Summer of Code 2025. Under the guidance of my mentors Eder Monteiro and Augusto Berndt, we’ve made significant strides in extending OpenROAD to handle non-rectangular die shapes.

Here’s a link to my original proposal

Project Overview

My project focuses on extending OpenROAD’s floorplanning capabilities to support rectilinear die shapes. This enhancement is crucial for modern VLSI design flows involving advanced packaging, 2.5D/3D ICs, and irregular chiplet-based designs where non-rectangular dies are increasingly common.

The core challenge is maintaining robustness and backward compatibility while introducing this major new feature that touches multiple aspects of the design flow.

Progress Made

1. Tcl Frontend modification and Input Parsing

Successfully extended the Tcl interface to accept rectilinear die specifications:

The initialize_floorplan command can now accept a list of coordinates specifying a polygon and automatically trigger the rectilinear floorplanning flow.
Added robust input validation for rectilinear coordinates
Ensured backward compatibility with existing rectangular floorplan flows

2. Die creation and Validation

Leveraged the internal structure odb::Polygon to store the vertices of the shape.
Implemented polygon validation to ensure shapes are valid rectilinear polygons
Added proper error handling and user feedback for invalid polygon inputs

3. Row Generation for rectilinear Dies - Major Milestone

The most significant achievement has been implementing the make_polygon_rows functionality, which generates standard cell rows that conform to the rectilinear die boundaries. This was one of the most challenging aspects of the project. To solve this, we developed an efficient scanline-based approach to fill rectilinear areas with rows by clipping the row lengths according the the boundary of the shape supplied.

4. Testing and Validation

Tests were added to the regression suite and were used to successfully test the entire initialize_floorplan flow. The changes made were merged into OpenROAD successfully. PR Link

makeRows demo :

One of our test cases involved generating rows for a U-shaped die. Here is a snapshot from the OpenROAD gui displaying perfectly laid out rows:

Next Steps

I am currently working on the pin placer (ppl) module and extending it to support rectilinear floorplans. This requires a careful re-evaluation of each step, including the cost function used to optimize pin placement.

Mid-Term Update: MPI Appliance for HPC Research on Chameleon

Sun, 03 Aug 2025 00:00:00 +0000

Hi everyone! This is my mid-term blog update for the project MPI Appliance for HPC Research on Chameleon, developed in collaboration with Argonne National Laboratory and the Chameleon Cloud community. This blog follows up on my earlier post, which you can find here.

🔧 June 15 – June 29, 2025

Worked on creating and configuring images on Chameleon Cloud for the following three sites: CHI@UC, CHI@TACC, and KVM@TACC.

Key features of the images:

Spack: Pre-installed and configured for easy package management of HPC software.
Lua Modules (LMod): Installed and configured for environment module management.
MPI Support: Both MPICH and Open MPI are pre-installed, enabling users to run distributed applications out-of-the-box.

These images are now publicly available and can be seen directly on the Chameleon Appliance Catalog, titled MPI and Spack for HPC (Ubuntu 22.04).

I also worked on some example Jupyter notebooks on how to get started using these images.

🔧 June 30 – July 13, 2025

With the MPI Appliance now published on Chameleon Cloud, the next step was to automate the setup of an MPI-Spack cluster.

To achieve this, I developed a set of Ansible playbooks that:

Configure both master and worker nodes with site-specific settings
Set up seamless access to Chameleon NFS shares
Allow users to easily install Spack packages, compilers, and dependencies across all nodes

These playbooks aim to simplify the deployment of reproducible HPC environments and reduce the time required to get a working cluster up and running.

🔧 July 14 – July 28, 2025

This week began with me fixing some issues in python-chi, the official Python client for the Chameleon testbed. We also discussed adding support for CUDA-based packages, which would make it easier to work with NVIDIA GPUs. We successfully published a new image on Chameleon, titled MPI and Spack for HPC (Ubuntu 22.04 - CUDA), and added an example to demonstrate its usage.

We compiled the artifact containing the Jupyter notebooks and Ansible playbooks and published it on Chameleon Trovi. Feel free to check it out here. The documentation still needs some work.

📌 That’s it for now! I’m currently working on the documentation, a ROCm-based image for AMD GPUs, and some container-based examples. Stay tuned for more updates in the next blog.

Midterm blog: CarbonCast Midpoint Update: From Vision to Reality

Sun, 03 Aug 2025 00:00:00 +0000

A few months ago, I shared my vision for making carbon intensity forecasts more accessible through the CarbonCast project. My proposal under the mentorship of Professor Abel Souza aims to build an API that makes carbon intensity forecasts more accessible and actionable. I had two main goals: expand CarbonCast to work with more regional electricity grids, and transform it from a research project into something that could actually run and be interacted with in the real world.

Today, I’m excited to share that we’ve not only hit those goals – we’ve exceeded them in ways I didn’t expect.

What We’ve Built So Far

Remember how I mentioned that CarbonCast needed to support more regional grids? Well, we’ve gone big. The system now covers 85+ regions across two continents. We’re talking about major US grid operators like ERCOT (Texas), CISO (California), PJM (Mid-Atlantic), MISO (Midwest), and NYISO (New York), plus we’ve expanded into European countries like Germany, France, Spain, and the UK.

But here’s the thing – collecting weather data for carbon intensity forecasting isn’t as simple as just downloading a few files. Each region needs four different types of weather data: solar radiation (for solar power predictions), wind patterns (for wind power), temperature and humidity (for energy demand), and precipitation (which affects both supply and demand). That means we’re managing data collection for over 340 different combinations of regions and weather variables.

The Automation Challenge

When I started this project, I quickly realized that manually managing data collection for this many regions would be impossible. We’re talking about thousands of data requests, each taking time to process, with various things that can go wrong along the way.

So we built something I’m really proud of: an intelligent automation system that handles 95% of the work without human intervention. That means 19 out of every 20 data collection tasks happen automatically, even when things go wrong.

The system is smart about it too. It knows when to speed up data collection, when to slow down to avoid overwhelming the servers, and how to recover when errors happen. We’ve achieved 99% data completeness, which means almost every piece of weather data we need actually makes it into our system successfully.

Making It Production-Ready

The biggest challenge was taking CarbonCast from a research project that worked on my laptop to something that could run reliably for weeks without me babysitting it. This meant building in all the boring but crucial stuff that makes software actually work in the real world.

We created a comprehensive error handling system that can automatically recover from 95% of the problems it encounters. Network hiccups, server timeouts, data format changes – the system handles these gracefully and keeps running.

There’s also a real-time monitoring dashboard that shows exactly what’s happening across all regions. I can see which areas are collecting data successfully, which ones might be having issues, and get alerts if anything needs attention. It’s like having a mission control center for carbon data.

The Dashboard: Mission Control for Carbon Data

Let me show you what this monitoring system actually looks like. We built a comprehensive web dashboard that gives us real-time visibility into everything that’s happening:

The main dashboard showing real-time system metrics and status across all regions

The dashboard shows key metrics at a glance – total requests, completion rates, and active regions. But it goes much deeper than that. You can drill down into individual requests to see their complete lifecycle:

Detailed view of individual data requests showing processing timelines and status

Each request card shows everything from the initial request time to when the data becomes available for download. This level of visibility is crucial when you’re managing hundreds of data requests across different regions and weather variables.

The regional analytics view shows how well we’re doing across different grid operators:

Regional breakdown showing completion status across different electricity grid operators

What I’m particularly proud of is the error handling dashboard. When things do go wrong (which they inevitably do with any large-scale data system), we can see exactly what happened and how the system recovered:

Error tracking and resolution system showing 100% success rate in region mapping

The fact that we’re showing “No unknown regions found” means our coordinate-based region detection system is working perfectly – every weather data request gets properly mapped to the right electricity grid.

The Technical Foundation

Under the hood, we’ve built what I’d call enterprise-grade infrastructure. The system can run autonomously for weeks, automatically organizing data by region and weather type, managing storage efficiently, and even optimizing its own performance based on what it learns.

We’ve also created comprehensive testing systems to make sure everything works reliably. When you’re dealing with data that people might use to make real decisions about when to charge their electric vehicles or run their data centers, reliability isn’t optional.

The architecture follows a modular, service-oriented design with clear separation between data collection, processing, monitoring, and user interfaces. This makes it much easier to maintain and extend as we add new features.

Why This Matters

All of this infrastructure work might sound technical, but it’s directly connected to the original vision: making carbon intensity forecasts accessible to everyone.

With this foundation in place, we can now provide reliable, up-to-date weather data for carbon intensity forecasting across major electricity grids in North America and Europe. That means developers building carbon-aware applications, companies trying to reduce their emissions, and individuals wanting to time their energy use for lower environmental impact all have access to the data they need.

What’s Next: Breaking Down CarbonCast

The next phase is where things get really exciting. Now that we have this solid data collection foundation, we’re going to break down CarbonCast itself into modular components. This will make it easier for developers to integrate carbon intensity forecasting into their own applications, whether that’s a smart home system, a cloud computing platform, or a mobile app that helps people make greener energy choices.

Looking Back

When I started this project, I knew we needed better infrastructure for carbon data. What I didn’t expect was how much we’d end up building – or how well it would work. We’ve created something that can reliably collect and organize weather data across two continents, handle errors gracefully, and run without constant supervision.

More importantly, we’ve built the foundation that will make it possible for anyone to access accurate carbon intensity forecasts. Whether you’re a developer building the next generation of carbon-aware applications or someone who just wants to know the best time to do laundry to minimize your environmental impact, the infrastructure is now there to support those decisions.

The vision of making carbon data accessible and actionable is becoming reality, one automated data collection at a time.

Impact Beyond Research

This work builds directly on the foundation of Multi-day Forecasting of Electric Grid Carbon Intensity using Machine Learning, transforming research into practical, real-world infrastructure. We’re not just making carbon intensity forecasts more accurate – we’re making them accessible to everyone who wants to reduce their environmental impact.

The open-source nature of CarbonCast means that anyone can run, contribute to, and benefit from this work. Whether you’re a developer building carbon-aware applications, a policymaker working on grid decarbonization strategies, or a sustainability-conscious individual looking to reduce your carbon footprint, the tools are now there to make informed, impactful choices.

Looking ahead, I’m excited to see how this infrastructure will enable the next generation of carbon-aware computing and smart energy decisions.

Mid-Term Report: Uncovering the True Sources of Non-Reproducibility in AI for Science

Fri, 01 Aug 2025 00:00:00 +0000

Hello, I’m Baiqiang. I’m excited to share a mid-term update from the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project. This journey, mentored by Luanzheng “Lenny” Guo and Dongfang Zhao, has taken a fascinating and unexpected turn, leading to a much deeper understanding of what it takes to build truly reliable AI for science.

The Search for an Invisible Bug

As a quick recap, our project tackles the critical problem of non-determinism in Retrieval-Augmented Generation (RAG) systems. For science to be trustworthy, it must be repeatable. If an AI system gives different answers to the same question, it fails this fundamental test. Our initial goal, outlined in my proposal, was to find and fix the sources of this inconsistency, which we believed lay within the retrieval algorithms themselves.

To do this, we built a comprehensive testing framework capable of running thousands of controlled experiments. We designed it to meticulously measure the consistency of retrieval results while varying everything from the indexing algorithm to the underlying hardware.

A Surprising Discovery: The Usual Suspect is Innocent

The common wisdom in the community is that high-performance, approximate search libraries like FAISS are a major source of randomness. We put this to the test, running repeated queries against various index types, including complex ones like HNSW and IndexIVF.

Our results were clear and surprising: FAISS is remarkably reproducible out of the box. When run on a consistent hardware and software stack, it returns the exact same results, every single time. The library appears to have robust internal seed management that ensures deterministic behavior.

This finding was a pivotal moment. The non-reproducibility that researchers observe in practice is real, but it doesn’t come from where we expected. The problem isn’t the algorithm itself, but the environment it runs in. Our investigation immediately shifted to find the real culprits.

Pinpointing the True Sources of Non-Determinism

Our framework quickly helped us identify the true sources of inconsistency:

Hardware-Induced Variation (CPU vs. GPU): This is the most significant factor. Running the exact same retrieval code can produce different document rankings and even different document sets when executed on a CPU versus a GPU. This is likely due to subtle differences in floating-point arithmetic and library optimizations in the hardware stack.
The Impact of Numerical Precision: We also confirmed that changing the floating-point precision of the data (e.g., from FP32 to FP16) can introduce small numerical variations that are just large enough to reorder the results, potentially changing the evidence the LLM receives.

Our Mission Refined: Building Tools for Environmental Control

This discovery has sharpened our project’s mission. The challenge is not to “fix” a supposedly random algorithm, but to develop the tools and best practices to control for the entire experimental environment. Our focus for the second half of the project is to:

Develop a Hardware-Aware Configuration Tracker: We are building a tool that goes beyond logging software versions. It will capture the critical details of the hardware environment—CPU/GPU model, CUDA version, etc.—and link them directly to an experiment’s results.
Create a Cross-Environment Validation Suite: Our open-source benchmarking suite will empower researchers to test their own pipelines. Crucially, it will help them identify and diagnose inconsistencies when moving workflows between different machines, such as from a local laptop to a cloud-based GPU.
Establish New Best Practices: We will distill our findings into clear, actionable guidance. The key recommendation is no longer just about choosing the right algorithm, but ensuring a consistent and well-documented hardware and software environment to guarantee reproducible outcomes.

By following the evidence, we’ve uncovered the root cause of a critical problem in AI-driven research. We are now developing the solutions needed to manage it, paving the way for a future where scientific discoveries powered by AI are built on a foundation of verifiable trust.

Midterm Update: Building Intelligent Observability for NRP

Fri, 01 Aug 2025 00:00:00 +0000

I’m pleased to share the progress we’ve made on my OSRE 2025 project, “Intelligent Observability for Seam: A GenAI Approach” since my initial announcement. We’re working toward our core goal: building an ML-powered service for NRP that analyzes monitoring data, detects anomalies, and provides trustworthy GenAI explanations.

How Our Agents Support the Observability Mission

We’ve been developing specialized agents and tools that work together to support our original project vision:

1. Prometheus Metrics Analysis Agent

Function: Continuously ingests and processes NRP’s Prometheus metrics
Progress: We’ve implemented initial data pipelines for key system metrics
Purpose: Provides the foundation for anomaly detection by establishing normal behavior baselines

Function: Clarifies ambiguous metrics or patterns before generating explanations
Progress: We’ve implemented a basic version of Conformal Revision of Questions to resolve metric ambiguities
Purpose: Aims to ensure explanations address the right system behaviors (e.g., distinguishing CPU saturation from memory pressure)
Deliverable Impact: We hope this will improve accuracy of GenAI explanations by eliminating misinterpretations

3. Explanation Generation Agent (AIS)

Function: Creates human-readable explanations and root-cause analysis
Progress: We’ve built a prototype of the Automated Information Seeker with a Plan→Validate→Execute→Assess→Revise cycle
Purpose: Transforms technical anomalies into actionable insights for operators
Deliverable Impact: Intended to directly deliver on the GenAI explanation component of our tool

Integration Progress

We’re working to connect our agents into a unified observability pipeline:

Data Collection: Prometheus metrics → Analysis Agent
Anomaly Detection: With statistical confidence bounds (in development)
Query Refinement: Resolving ambiguities before explanation
Explanation Generation: Human-readable analysis with uncertainty awareness
Feedback Loop: System learning from operator interactions (planned)

Hardware Testing Opportunity

This project has given us a valuable opportunity to test our observability framework on Qualcomm Cloud AI 100 Ultra hardware. We’re beginning to port different LLM architectures specifically for:

Exploring anomaly detection performance on specialized AI hardware
Testing explanation generation quality across different model architectures
Comparing GLM-4.5 against other models for observability-specific tasks

Next Phase: Completing the Observability Tool

For the remainder of OSRE 2025, we’re focused on:

Finalizing integration of all agents into a cohesive anomaly detection tool with matrix
Validating that our GenAI explanations help operators resolve issues faster for users, which we plan to test on the nautilus matrix platform
Optimizing performance on specialized hardware for NRP’s scale
Preparing the open-source release of our intelligent observability tool

Acknowledgments

I’m deeply grateful to my lead mentor Mohammad Firas Sada for his guidance in keeping our work focused on NRP’s observability needs. His insights have been invaluable in navigating the challenges of this project.

While we’ve developed several agents and frameworks, everything we’re building serves the original mission: creating an intelligent observability tool that helps NRP operators solve problems faster and keep complex research systems running smoothly.

I look forward to sharing more progress on our observability tool with GenAI explanations in the coming weeks!

Midterm Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows

Wed, 30 Jul 2025 00:00:00 +0000

Refresher about the Project

Hi everyone! for the last month I have been working with my mentors Professor Fraida Fund, and Mohamed Saeed on our Project Applying MLOps to overcome reproducibility barriers in machine learning research As a refresher, our goal is to build a template generator for a reproducible machine learning training workflows at the Chameleon testbed. We want to provide our users with the necessary environment configuration in a handy way. so they won’t be overwhelmed with all the intricate details of setting the environment. This will allow for validation and further development of their setup.

What we have done so far

The current workflow begins in JupyterHub, where the user provides basic details such as project name, site, and node type. the notebooks handle key setup tasks, like creating storage buckets, provisioning and configuring a server with GPU support, and mounting buckets locally via rclone. Once the host environment is ready, the user will SSH that machine, generates the necessary variables via a script and launches a containerized virtual lab that integrates Jupyter and MLflow. Inside the container, users authenticate with GitHub, connect or initialize their repositories, and can immediately begin training models, with all metrics, artifacts, and environment details logged for reproducibility.

The progress on the project so far is as follows:

We finalized the selection of frameworks and storage options.

Artifacts are now logged directly from the MLflow server to the Chameleon object store, without relying on a database backend or an intermediate MinIO S3 layer.

Different jupyter lab images for each framework.

We’ve started with the top ML frameworks — PyTorch Lightning, Keras/TensorFlow, and Scikit-Learn. Each framework now has its own image, which will later be tailored to the user’s selection.

Github CLI and Hugging Face integration inside the container.

The Jupyter container now integrates both the GitHub CLI and Hugging Face authentication. Users can manage their code repositories via GitHub CLI commands and authenticate with Hugging Face tokens to download/upload models and datasets. This eliminates the need for manual credential setup and streamlines ML experimentation within the environment.

Custom Logging Utility

To ensure robust tracking of code versioning and environment details, we added a custom logging utility.
These logs are stored alongside metrics and model artifacts in MLflow, ensuring every experiment is fully documented and reproducible. summary of the functionalities:

`log_git()` — Captures Code Versioning

Uses Git commands (via subprocess) to log:

Current branch name
Commit hash
Repository status (clean or dirty)

Example Output:

commit: a7c3e9d
branch: main
status: dirty (1 file modified)
# and git diff output

`log_python()`— Tracks the Python Environment

Platform information + Python environment info (version)
Exports a full pip freeze list to a .txt file
Saved as an MLflow artifact to guarantee exact package version reproducibility

Example Output (pip freeze extract):

numpy==1.26.4
pandas==2.2.1
scikit-learn==1.4.2
torch==2.2.0

`log_gpu()` - Records GPU Information

Detects available GPU devices
Collects details using NVIDIA’s pynvml or AMD’s ROCm tools
Logs:
GPU name
Driver version
CUDA/ROCm version
Captures gpu-type-smi output for deeper inspection

These utilities ensure that each run can be traced back with:

The exact code version
The full Python environment
The hardware details used

Initial customizable template

We’ve prototyped an initial customizable template using Cookiecutter. it provides an interactive CLI, users provide some key project details (e.g., project name, frameworks, GPU type and integrations if any). Cookiecutter then generates a ready-to-use project structure with pre-configured integrations, reducing manual setup and ensuring consistency across environments.

The user will have notebooks to communicate with chameleon testbed resources, containerized environment and custom training scripts to plug their code.

What’s Next

Template Generation via Config + interactive widgets
We are exploring different ways to generate experiment templates using configuration files and interactive widgets in jupyter notebooks. This would let users quickly customize logging setups and considered to be more user-friendly.
AMD-Compatible Images
Extend support by building and testing Docker images optimized for AMD GPUs. Up to now, our development efforts has focused on NVIDIA GPUs using CUDA-based images
End-to-End Lifecycle Example
Provide a larger example demonstrating the entire ML workflow:
- Data preparation
- Training with GPU logging
- Tracking metrics, artifacts, and environment info in MLflow
- Model evaluation and logging
- Reproducing results on different hardware backends

Working on this project so far has been both challenging and eye-opening. I’ve seen how many moving parts need to come together for a smooth workflow. The support from my mentors has been key in helping me turning challenges into real progress.

Thank you for following along — I’m looking forward to sharing more concrete results soon.

Robot Manipulation with Scenic-RoboSuite

Wed, 30 Jul 2025 00:00:00 +0000

We’re Sahil, continuing work on the Scenic-RoboSuite integration for GSoC 2025. This project is mentored by Daniel Fremont and Eric Vin.

Since the last update, the Scenic-RoboSuite interface has made significant progress. The bidirectional bridge is now functional - robots can read sensor data and execute behaviors based on observations. However, these features are still in early stages and we’re working on making them more stable and consistent.

We’ve integrated RoboSuite’s Operational Space Control into Scenic. This control method lets you command the robot’s hand directly in 3D space (like “move 10cm left”) instead of calculating complex joint rotations. While the integration works, it’s rough around the edges and we’re currently focused on stabilizing it across different scenarios.

The main challenge was architectural - RoboSuite expects all robot commands bundled together each timestep, while Scenic processes them one by one. We solved this with a pending actions system that collects everything first, then executes in one go. Time synchronization was another challenge, matching Scenic’s steps with MuJoCo’s physics.

We’ve implemented a basic pick-and-place behavior for basic testing. The robot reads sensor data, calculates where to move, and adjusts continuously. It can successfully grasp and lift objects, though consistency varies between runs. The system supports three robot models and works with RoboSuite’s pre-built environments.

Custom world building is currently on hold. We’ve decided to focus on integrating existing RoboSuite features into Scenic first, then build Scenic’s capabilities like dynamic scenario randomization on top. For our first prototype, we’re aiming to extend the pick-and-place behavior into a full randomization demo - Scenic will randomly position the cube each run, and the robot will adapt to find and grasp it regardless of location.

The next two weeks focus on stabilizing current features and preparing this randomized scenario prototype. Expanding the behavior library and supporting additional environments will come in future phases after we have a solid foundation.

The core bridge between Scenic and RoboSuite is operational, but there’s significant work ahead to make it reliable and user-friendly.

Type Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits

Tue, 29 Jul 2025 00:00:00 +0000

Hello! I’m Siva Sathyaseelan D N, a pre-final year B.Tech + M.Tech Engineering student at IIT BHU, Varanasi, India. With a deep-rooted passion for software development and scientific computing. I thrive at the intersection of code and real-world problem-solving. For two years, I’ve engaged in open-source work across scientific simulation, blockchain, and cloud-native technologies, through hobby projects, hackathons, internships, and an LFX mentee. I’m contributing toType Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits under the mentorship of Ben Greenman. My proposal can be viewed here!

Project Overview

Gradual typing enhances untyped languages like JavaScript and Python with static type checkers in systems like TypeScript, Flow, Mypy, Pyright, and Typed Racket, using type narrowing to refine types via runtime checks (e.g., typeof item[“price”] === “number”). Designs vary, TypeScript permits unverified predicates, Flow ensures soundness, and Typed Racket tracks types compositionally—prompting the If-T benchmark ift-benchmark to evaluate narrowing across five languages, though it omits tools like Sorbet, Hack, Luau, Pyre, Cinder/Static Python, Typed Clojure, and Elixir, and the risks of unsound narrowings remain unclear.

Objectives

Extend the If-T benchmark to Sorbet, Hack, Luau, Pyre, Cinder/Static Python, Typed Clojure, and potentially Elixir.
Analyze their type narrowing precision, expressiveness, and soundness.
Conduct a corpus study of TypeScript or Python code using GitHub or Software Heritage APIs.
Assess the prevalence and exploit potential of unsound narrowings.
Link corpus findings to benchmark results for broader insights.

Progress So Far

During the first half of the SoR 2025 period, I focused on lextending the If-T benchmark to Sorbet, Pyre, Cinder/Static Python, Typed Clojure. These are the PRs which extends If-T benchmark:

Sorbet -> https://github.com/utahplt/ifT-benchmark/pull/20
Pyre -> https://github.com/utahplt/ifT-benchmark/pull/26
Typed Clojure -> https://github.com/utahplt/ifT-benchmark/pull/27
Cinder -> https://github.com/utahplt/ifT-benchmark/pull/28

What’s Next

I will be working on Conduct a corpus study of TypeScript or Python code using GitHub or Software Heritage APIs. Assess the prevalence and exploit potential of unsound narrowings. Also Link corpus findings to benchmark results for broader insights TGUsage.

Final Thoughts

Working on Type Narrowing has been incredibly rewarding, it’s more than just code. It’s studying the type systems of different programming languages which is very important for the large scale software systems and softwware security, and I’m honored to be a part of that.

Big thanks to my mentors Ben Greenman for their support and thoughtful feedback throughout. I’ve learned a ton already, and I can’t wait to keep building.

AIDRIN Privacy-Centric Enhancements: Backend & UX Upgrades

Fri, 25 Jul 2025 00:00:00 +0000

⏱️ Reading time: 5–6 minutes

Hey everyone,

If you’ve ever wondered what it takes to make AI data pipelines not just smarter, but safer and more transparent, you’re in the right place. The last few weeks working on AIDRIN for GSoC have been a deep dive into the engine room of privacy and backend systems that power the AIDRIN project. My focus has been on building out the core privacy infrastructure and backend features that power AIDRIN’s ability to give users real, actionable insights about their data. It’s been challenging, sometimes messy, but incredibly rewarding to see these changes make a tangible difference.

Having Dr. Jean Luca Bez and Prof. Suren Byna as mentors, along with the support of the entire team, has truly made all the difference. Their guidance, encouragement, and collaborative spirit have been a huge part of this journey, whether I’m brainstorming new ideas or just trying to untangle a tricky bug.

Privacy Metrics: Making Data Safer

A major part of my work has been putting data privacy at the front and center in AIDRIN. I focused on integrating essential privacy metrics like k-anonymity, l-diversity, t-closeness, and more, making sure they’re not just theoretical checkboxes, but real tools that users can interact with and understand. Now, these metrics are fully wired up in the backend and visualized in AIDRIN, so privacy risks are no longer just a vague concern. They are something AI data preparers can actually see and act on. Getting these metrics to work seamlessly with different datasets and ensuring their accuracy took some serious backend engineering, but the payoff has been worth it.

Speeding Things Up (So You Don’t Have To Wait Around)

As AIDRIN started handling bigger datasets, some of the calculations can be time-consuming because data has to be accessed every time a metric is computed. To address this, I added caching for previously computed metrics, like class imbalance and privacy checks, and set up asynchronous execution with Celery and Redis. This should make the app super responsive. Rather than waiting for heavy computations to finish, one can start taking notes about other metrics or explore different parts of the app while their results are loading in the background. It’s a small change, but it helps keep the workflow moving smoothly.

Small Touch Ups That (Hopefully) Make a Big Difference

I also spent time on the details that make the app easier to use. Tooltips now explain what the privacy metrics actually mean, error messages are clearer, and there’s a new cache info page where you can see and clear your cached data. The sensitive attribute dropdown is less confusing now, especially if you’re working with quasi-identifiers. These tweaks might seem minor, but they add up and make the app friendlier for everyone.

Docs, Docs, Docs

I’m a big believer that good documentation is just as important as good code. I updated the docs to cover all the new features, added citations for the privacy metrics, and made the install process a bit more straightforward. Hopefully, this means new users and contributors can get up to speed without too much hassle.

Huge Thanks to My Mentors and the Team

I really want to shine a light on Dr. Bez, Prof. Byna, and the entire AIDRIN team here. Their encouragement, practical advice, and collaborative spirit have been a huge part of my progress. Whether I’m stuck on a bug, brainstorming a new feature, or just need a second opinion, there’s always someone ready to help me think things through. Their experience and support have shaped not just the technical side of my work, but also how I approach problem-solving and teamwork.

What’s Next?

Looking ahead, I’m planning to expand AIDRIN’s support for multimodal datasets and keep refining the privacy and fairness modules. There’s always something new to learn or improve, and I’m excited to keep building. If you’re interested in data quality, privacy, or open-source AI tools, I’d love to connect and swap ideas.

Thanks for reading and for following along with my GSoC journey. I’ll be back soon with more updates!

This is the second post in my 3-part GSoC series with AIDRIN. Stay tuned for the final update.

Halfway Blog - WildBerryEye: Mechanical Design & Weather-Resistant Enclosure

Fri, 25 Jul 2025 00:00:00 +0000

Hi everyone! My name is Teodor Langan, and I am an undergraduate studying Robotics Engineering at the University of California, Santa Cruz. I’m happy to share the progress I have been able to make over the last six weeks on my GSoC 2025 project. Over the last six weeks, I have been working on developing the hardware for the WildBerryEye project, mentored by Carlos Isaac Espinosa.

Project Overview

The WildBerryEye project enables AI-powered ecological monitoring using Raspberry Pi cameras and computer vision models. However, achieving this requires a reliable enclosure that can support long-term deployment in the wild. The goal for my project is to address this need by designing a modular, 3D-printable camera casing that protects WildBerryEye’s electronics from outside factors such as rain, dust, and bugs, while remaining easy to print and assemble. To achieve this, my main responsibilities for this project include:

Implementing a modular design and development-friendly features for ease of assembly and flexible use across hardware setups
Prototyping and testing enclosures outdoors to assess durability, water resistance, and ventilation—then iterating based on results
Developing clear documentation, assembly instructions, and designing with open-source tools
Exploring material options and print techniques to improve outdoor lifespan and environmental resilience

Designed largely with FreeCAD and tested in real outdoor conditions, the open-source enclosure will ensure WildBerryEye hardware can be deployed in natural environments for continuous, low-maintenance data collection.

Progress So Far

Over the past 6 weeks, great progress has been made on the design of the WildBerryEye camera enclosure. Some key accomplishments include:

Full 3D Assembly Model of Electronics: Modeled all core components used in the WildBerryEye system to serve as a reference for enclosure design. For parts without existing CAD models, accurate measurements were taken and custom models were created in FreeCAD.
Initial Enclosure Prototype: Designed and 3D-printed a first full prototype featuring a hinge-latch mechanism to allow tool-free easy access to internal electronics for development and maintenance.
Design Iteration Based on Testing: Based on the results of the first print, created an improved version with better electronics integration, port alignment, and more functionality.

Challenges & Next Steps

Field-Ready Integration: Preparing for field testing with upcoming prototypes by making sure that all internal electronics are securely mounted and fully accessible within the enclosure.
Latch Mechanism Refinement: Finalizing a reliable hinge-latch design that can keep the enclosure sealed during outdoor use while remaining easy to open for maintenance.
Balancing Modularity, Size, and Weatherproofing: Maintaining a compact form factor without compromising on modularity or weather resistance—especially when routing cables and mounting components.
Material Experimentation: Beginning test prints with TPU, a flexible filament that may provide improved seals or gaskets for added protection.
Ventilation Without Exposure: Exploring airflow solutions such as labyrinth-style vents to enable heat dissipation without letting in moisture or debris.

Final Thoughts

These past 6 weeks have helped me immensely to grow my skills in mechanical design, CAD modeling, and field-focused prototyping. The WildBerryEye system can help researchers monitor pollinators and other wildlife in their natural habitats without requiring constant in-person observation or high-maintenance setups. By enabling long-term, autonomous data collection in outdoor environments, it opens new possibilities for low-cost, scalable ecological monitoring.

I’m especially grateful to my mentor Carlos Isaac Espinosa and the WildBerryEye team for their ongoing support. Excited for the second half, where the design will face real-world testing and help bring this impactful system one step closer to field deployment!

Mid-term Blog: Building a Simulator for Benchmarking Replicated Systems

Fri, 25 Jul 2025 00:00:00 +0000

Introduction

Hello there, I’m Michael. In this report, I’ll be sharing my progress as part of the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges project under the mentorship of Fadhil Kurnia.

About the Project

The goal of the project is to build a language-agnostic interface that enables communication between clients and any consensus protocol such as MultiPaxos, Raft, Zookeeper Atomic Broadcast (ZAB), and others. Currently, many of these protocols implement their own custom mechanisms for the client to communicate with the group of peers in the network. An implementation of MultiPaxos from the MultiPaxos Made Complete paper for example, uses a custom Protobuf definition for the packets client send to the MultiPaxos system. With the support of a generalized interface, different consensus protocols can now be tested under the same workload to compare their performance objectively.

Progress

Literature Study: Reviewed papers and implementations of various protocols including GigaPaxos, Raft, Viewstamped Replication (VSR), and ZAB. Analysis focused on their log replication strategies, fault handling, and performance implications.
Development of Custom Protocol: Two custom protocols are currently under development and will serve as initial test subjects for the testbed:
- A modified GigaPaxos protocol
- A Primary-Backup Replication protocol with strict log ordering similar to ZAB (logs are ordered based on the sequence proposed by the primary)
Most of my time has been spent working on the two protocols, particularly on snapshotting and state transfer functionality in the Primary-Backup protocol. Ideally, the testbed should be able to evaluate protocol performance in scenarios involving node failure or a new node being added. In these scenarios, different protocol implementations often vary in their decision of whether to take periodic snapshots or to roll forward whenever possible and generate a snapshot only when necessary.

Challenges

Early in the project, the initial goal was to benchmark different consensus protocols using arbitrary full-stack web applications as their workload. Different protocols would replicate a full-stack application running inside Docker containers across multiple nodes and the testbed would send requests for them to coordinate between those nodes. In fact, the 2 custom protocols being worked on are specifically made to fit these constraints.

Developing a custom protocol that supports the replication of a Docker container is in itself already a difficult task. Abstracting away the functionality that allows communicating with the docker containers, as well as handling entry logs and snapshotting the state, is an order of magnitude more complicated.

As mentioned in the first blog, an application can be categorized into two types: deterministic and non-deterministic applications. The coordination of these two types of applications are handled in very different ways. Most consensus protocols support only deterministic systems, such as key-value stores and can’t easily handle coordination of complex services or external side effects. To allow support for non-deterministic applications would require abstracting over protocol-specific log structures. This effectively restricts the interface to only support protocols that conform to the abstraction, defeating the goal of making the interface broadly usable and protocol-agnostic.

Furthermore, in order to allow any existing protocols to support running something as complex as a stateful docker container without the protocol itself even knowing adds another layer of complexity to the system.

Future Goals

Given these challenges, I decided to pivot to using only key-value stores as the application being used in the benchmark. This aligns with the implementations of most of the existing protocols which typically use key-value stores. In doing so, now the main focus would be to implement an interface that supports HTTP requests from clients to any arbitrary protocols.

Midterm Blog: Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges

Fri, 25 Jul 2025 00:00:00 +0000

Hello! I’m Panji Sri Kuncara Wisma and I want to share my midterm progress on the “Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges” project under the mentorship of Fadhil I. Kurnia.

Project Overview

The goal of our project is to create an open testbed that enables fair, reproducible evaluation of different consensus protocols (Paxos variants, EPaxos, Raft, etc.) when deployed at network edges. Currently, researchers struggle to compare these systems because they lack standardized evaluation environments and often rely on mock implementations of proprietary systems.

XDN (eXtensible Distributed Network) is one of the important consensus systems we plan to evaluate in our benchmarking testbed. Built on GigaPaxos, it allows deployment of replicated stateful services across edge locations. As part of preparing our benchmarking framework, we need to ensure that the systems we evaluate, including XDN, are robust for fair comparison.

Progress

As part of preparing our benchmarking tool, I have been working on refactoring XDN’s FUSE filesystem from C++ to Rust. This work is essential for creating a stable and reliable XDN platform.

The diagram above illustrates how the FUSE filesystem integrates with XDN’s distributed architecture. On the left, we see the standard FUSE setup where applications interact with the filesystem through the kernel’s VFS layer. On the right, the distributed replication flow is shown: Node 1 runs fuselog_core which captures filesystem operations and generates statediffs, while Nodes 2 and 3 run fuselog_apply to receive and apply these statediffs, maintaining replica consistency across the distributed system.

This FUSE component is critical for XDN’s operation as it enables transparent state capture and replication across edge nodes. By refactoring this core component from C++ to Rust, we’re hopefully strengthening the foundation for fair benchmarking comparisons in our testbed.

Core Work: C++ to Rust FUSE Filesystem Migration

XDN relies on a FUSE (Filesystem in Userspace) component to capture filesystem operations and generate “statediffs” - records of changes that get replicated across edge nodes. The original C++ implementation worked but had memory safety concerns and limited optimization capabilities.

I worked on refactoring from C++ to Rust, implementing several improvements:

New Features Added:

Zstd Compression: Reduces statediff payload sizes
Adaptive Compression: Intelligently chooses compression strategies
Advanced Pruning: Removes redundant operations (duplicate chmod/chown, created-then-deleted files)
Bincode Serialization: Helps avoid manual serialization code and reduces the risk of related bugs
Extended Operations: Added support for additional filesystem operations (mkdir, symlink, hardlinks, etc.)

Architectural Improvements:

Memory Safety: Rust’s ownership system helps prevent common memory management issues
Type Safety: Using Rust enums instead of integer constants for better type checking

Findings

The optimization results performed as expected:

Statediff Size Reductions:

MySQL workload: 572MB → 29.6MB (95% reduction)
PostgreSQL workload: 76MB → 11.9MB (84% reduction)
SQLite workload: 4MB → 29KB (99% reduction)

The combination of write coalescing, pruning, and compression proves especially effective for database workloads, where many operations involve small changes to large files.

Performance Comparison: Remarkably, the Rust implementation matches or exceeds C++ performance:

POST operations: 30% faster (10.5ms vs 15ms)
DELETE operations: 33% faster (10ms vs 15ms)
Overall latency: Consistently better (9ms vs 11ms)

Current Challenges

While the core implementation is complete and functional, I’m currently debugging occasional latency spikes that occur under specific workload patterns. These edge cases need to be resolved before moving on to the benchmarking phase, as inconsistent performance could compromise the reliability of the evaluation.

Next Steps

With the FUSE filesystem foundation nearly complete, next steps include:

Resolve latency spike issues and complete XDN stabilization
Build benchmarking framework - a comparison tool that can systematically evaluate different consensus protocols with standardized metrics.
Run systematic evaluation across protocols

The optimized filesystem will hopefully provide a stable base for reproducible performance comparisons between distributed consensus protocols.

Reproducibility of Interactive Notebooks in Distributed Environments

Fri, 25 Jul 2025 00:00:00 +0000

I am sharing a overview of my project Reproducibility of Interactive Notebooks in Distributed Environments and an udpate at the midway mark.

Project Overview

Generate execution logs for a notebook program.
Generate code and data dependencies for notebook programs in an automated manner.
Utilize the generated dependencies at various granularities to automate the deployment and execution of notebooks in a parallel and distributed environment.
Audit and pacakge the notebook code running in a distributed environment.

Progress So Far

Here are the details of the progress made so far.

Generation of Execution Logs

Extracting Software Dependencies

When a library such as a Python package like Numpy is used by the notebook program, an entry is made in the execution log which has the complete path of the accessed library file(s) along with additional information. We analyze the execution logs for both manager and workers to find and enlist all dependencies. So far, we are limited to Python packages, though this methodology is general and can be used to find dependencies for any programing langauge. For Python packages, their version numbers are also obtained by querying the pacakge managers like pip or Conda on the local system.

Extracting Data Dependencies

Testing the Pipeline

Next Steps

The next steps in this project are as follows:

Generate the execution logs and dependencies in a notebook at the level of each cell of code.
Utilize the dependencies at multiple levels of granularities with the goal of automating the deployment and execution of notebooks in a parallel and distributed environment.
Audit notebook program execution in a distributed environment and package it into a container on a single node.

I am very happy abut the experience I have had so far in this project and I am excited about the milestones to come. Stay tuned!

[MidTerm] Building PeerSky’s Settings System

Thu, 24 Jul 2025 00:00:00 +0000

Hi everyone, I’m Hanzhong Liu. My project focuses on building a secure and extensible peersky://settings system for the PeerSky browser, a decentralized and privacy-first browser built on Electron.

This post is a midterm check-in covering what’s been implemented so far — from IPC architecture to real-time theme and wallpaper updates — and a preview of what’s coming next.

Project Overview

Peersky’s new settings system is designed to unify browser preferences (themes, search engine, appearance, extensions, etc.) into a single modular interface. It’s accessible via a protocol route (peersky://settings) and built using web-standard HTML/CSS, scoped APIs, and Electron’s context isolation model.

Key Design Goals:

Secure preload-based API exposure via contextBridge
Fast access to user preferences with zero-flicker wallpaper updates
Extensibility for bookmarks, future plugins, and privacy tools

Midterm Progress Highlights

Electron Integration

Rather than using webFrame.executeJavaScript(), I implemented preload-scoped APIs using contextBridge and ipcRenderer to prevent injection vulnerabilities and ensure synchronous availability during early page load. Each internal protocol (settings, home, bookmarks) is granted its own API access level.

Code: src/pages/unified-preload.js

Modular Settings Page

The UI lives in a single HTML file with sidebar-based navigation (Appearance, Search, Bookmarks, Extensions). Each section updates independently using event-driven IPC and live sync.

Wallpaper & Theme Switching

Supports both built-in wallpapers and custom uploads
Background applies instantly using sendSync() during preload
Themes (light, dark, system) are controlled using root-level CSS variables and real-time IPC events

Cache & Search Engine

Added IPC handler to clear both Electron session and P2P cache directories (ipfs/, hyper/)
Settings API allows switching between DuckDuckGo, Ecosia, and Startpage via dropdown

Example: Adding a New Setting (`autoSave`)

I also documented how developers can add new settings like autoSave using:

settings-manager.js for default values and validation
Preload event listeners (onAutoSaveChanged)
UI toggles and save logic in settings.js

Documentation link: Settings Guide

Reflection

I’m really thankful for the mentorship I’ve received from Akhilesh Thite. His guidance has been the perfect balance of autonomy and support. He challenged me to reason clearly about technical choices, especially when I thought some of them are minor and not worthing paying attention to. His feedback helped me write cleaner, better-scoped code. This project has helped me grow as a software engineer in ways I didn’t fully anticipate, but I’ve enjoyed it so so much.

You can explore the project here:
https://github.com/p2plabsxyz/peersky-browser

Midterm for Smart Environments

Thu, 24 Jul 2025 00:00:00 +0000

What is EnvGym?

EnvGym is a general multi-agent framework designed to automate the construction of executable environments for reproducing research prototypes from top-tier conferences and journals. While reproducibility has become a growing concern in the research community, the process of setting up environments remains time-consuming, error-prone, and often poorly documented.

EnvGym addresses this gap by leveraging LLM-powered agents to analyze project instructions, resolve dependencies, configure execution environments, and validate results—thereby reducing human overhead and improving reproducibility at scale.

Progress

New Tools

Initially, our agent had access to only one tool: the command line. This constrained the agent’s ability to decompose complex tasks and respond flexibly to failures. Over the last few weeks, we introduced a modular tool system, enabling the agent to handle specific subtasks more effectively.

The new toolset includes:

dockerrun: Executes Dockerfiles.
hardware_checking, hardware_adjustment: Tailor builds to available resources.
history_manager, stats: Tracks historical data for improvement and reproducibility.
planning: Generates high-level execution plans.
summarize: Interprets build results to adjust subsequent iterations.
writing_docker_initial, writing_docker_revision: Generate and refine Dockerfiles.

While some of those tools, such as dockerrun, run programmatic scripts, other scripts such as planning are more complex and use LLMs themselves.

Agent Re-Architecture: Moving Beyond Codex

We transitioned away from OpenAI’s Codex agent implementation. While powerful, Codex’s framework was overly reliant on its CLI frontend, which added unnecessary complexity and limited customizability for our research context.

We implemented our own lightweight, customizable agent pipeline that integrates LLM-based planning with iterative execution. Conceptually, the agent executes the following loop:

Repo Scanning
Hardware Check
Planning & Initial Dockerfile Generation
Docker Execution
Progress Summarization & Adjustment
Iterative Dockerfile Refinement (up to 20 rounds)
Success Check & Logging

This new agent design is easier to control, extend, and debug—aligning better with the needs of reproducibility research.

Prompt Engineering

For each tool that requires LLMs to function, we created a set of custom prompts that outline the task and breaks down the goals. For instance, the prompt used in summarize differs from the one in planning, allowing us to optimize the behavior of LLM agents per context.

Performance Gains

With these improvements, EnvGym now successfully replicates 9 repositories, surpassing our baseline Codex agent which struggled with the same set. We’ve observed more reliable planning, better handling of edge-case dependencies, and faster convergence in iterative Dockerfile revisions.

Next Steps

Granular Evaluation Metric

We plan to adopt a tree-structured rubric-based evaluation, inspired by PaperBench. Instead of binary success/failure, each repo will be assigned a reproducibility score from 0–100.

Key tasks include:

Rubric Design: Define a hierarchical rubric with criteria like dependency resolution, test success rate, runtime match, etc.
Manual Annotation: Build a dataset of ground-truth rubrics for a subset of repos to calibrate our automatic judge.
Judge Implementation: Develop an LLM-based judge function that takes (i) rubric and (ii) environment state, and returns a reproducibility score.

Source: Starace, Giulio, et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research.” arXiv preprint arXiv:2504.01848 (2025).

This will make EnvGym suitable for benchmarking. We will run our new method and obtain a score to compare with baseline methods!

Conclusion

EnvGym has made strong progress toward automating reproducibility in computational research. Through modularization, agentic design, and prompt optimizations, we’ve surpassed existing baselines and laid the groundwork for even more improvement.

The upcoming focus on metrics and benchmarking will elevate EnvGym from a functional prototype to a standardized reproducibility benchmark tool and also quantitatively prove that our new agentic method is better than existing tools such as Codex. Excited for what’s to come!

Autofill

; 20250724-Sam_Huang

Midway Through GSoC: ENTS

Thu, 24 Jul 2025 00:00:00 +0000

Midway Through GSoC

Hi everyone! I’m Devansh Kukreja, and I’m excited to share a midterm update on my Google Summer of Code 2025 project with the University of California, Santa Cruz Open Source Program Office (UC OSPO) under the Open Source Research Experience (OSRE). I’m contributing to ENTS, a platform that supports real-time monitoring and visualization of environmental sensor networks.

Project Overview

The Environmental NeTworked Sensor (ENTS) platform is an open-source web portal designed to collect, visualize, and analyze data from distributed sensor networks. It’s used by researchers and citizen scientists to monitor field-deployed sensors measuring soil moisture, temperature, voltage, and current—supporting critical research on sustainability and environmental change.

My project focuses on improving the platform’s stability, usability, and extensibility through:

Fixing bugs in the data visualization components.
Enhancing real-time chart synchronization and data point selection.
Improving overall system error handling and reliability.
Building a Logger Registration System that enables users to register and configure their logging devices.
Exploring integration with The Things Network (TTN) to support LoRaWAN-based wireless sensor connectivity.

Progress So Far

During the first half of the GSoC period, I focused on laying the groundwork for a more robust and user-friendly system. Highlights include:

Enhanced date range logic: Improved the way the dashboard selects time periods by automatically choosing a recent two-week window with valid sensor data. This ensures charts always display meaningful insights and avoids showing blank states.
Improved chart rendering: Refined how charts behave when there’s no data or when unusual values (like negatives) are present. This includes smoother axis alignment and fallback messaging when data is unavailable.
Refactored cell management UI: Cleaned up and improved the modals used to manage cells and sensors, fixing several UI/UX issues and bugs to make interactions more intuitive and consistent.
Enabled smart URL syncing: The dashboard state now stays in sync with the URL, making it easier to share specific views or navigate back to previous states without losing context.

What’s Next

In the second half of the program, I’ll be focusing on:

Building out and polishing the Logger Registration UI based on the backend schema and wireframes.
Finalizing the onboarding flow for field loggers, linking registration data to ingestion and dashboard views.
Continuing work on LoRaWAN support with TTN, aiming to enable basic OTA provisioning for future deployments.
Exploring an admin dashboard that helps visualize device health, sync status, and alert on any anomalies.

Final Thoughts

Working on ENTS has been incredibly rewarding—it’s more than just code. It’s about making tools that help scientists and conservationists understand our changing environment, and I’m honored to be a part of that.

Big thanks to my mentors Colleen Josephson, John Madden, and Alec Levy for their support and thoughtful feedback throughout. I’ve learned a ton already, and I can’t wait to keep building.

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Wed, 23 Jul 2025 00:00:00 +0000

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 1

Hello all! 👋 My name is Tharit, and I’m a computer science student at the University of Texas at Austin. This summer, I am fortunate to participate in the Google Summer of Code (GSoC) 2025 program, hosted by UC OSPO and the PyLops team. My project focuses on enabling NCCL GPU-to-GPU communication in PyLops-MPI, under the guidance of mentors Matteo Ravasi and Yuxi Hong.

You might have come across this post if you’re a PyLops user interested in scaling PyLops-MPI with GPU/NCCL support, or if you’re exploring GSoC projects and wondering what we are up to. Either way, I hope this post gives you useful insights.

What is PyLops-MPI?

If you’ve worked with inverse problems, you’ve likely come across PyLops. It’s a Python library that provides an extensive suite of linear operators and solvers. Operators are designed with a clear focus on the forward and adjoint pair (A and A.T) whilst solvers take operators and data to solve the associate inverse problem. In fields such as geophysics, astrophysics, or medical imaging,, inverse problem are solved routinely to image the Earth, the space, or the human body from remote measurements. In all cases, real-life problems tend to consume a lot of computing and require a lot of memory. PyLops allows users to express these problems in an abstract manner that is reminiscent of the underlying equations whilst not compromising on efficiency.

PyLops-MPI is the distributed extension of PyLops, introduced during GSoC 2023. It enables users to scale their computations over CPU and GPU clusters via MPI. However, up until now, even GPU-based communications were routed through MPI, introducing potential performance bottlenecks.

The Goal of the Project

Our goal is to take PyLops-MPI to the next level by enabling GPU-to-GPU collective communications directly using NVIDIA NCCL. This allows full utilization of high-bandwidth interconnects like NVLink, and avoids unnecessary memory transfers through the host CPU. This blog marks the midpoint of the program (week 6 of 12), and I’d like to reflect on the progress so far, challenges faced, and what’s coming next.

What is a Collective Communication anyway?

In PyLops-MPI, distributed computations require nodes to exchange information, for example, during gradient computations or reductions in iterative solvers. A naive implementation (useful for a thought experiment) would involve each node taking turns broadcasting data, which can be quite slow. NVIDIA’s NCCL abstracts away the complexity of topology-aware communication. For example, in the image below, if GPUs are most effective by communicating in a ring fashion for all-reduce operation, NCCL will automatically pick that layout and not use GPU 01-GPU 04 and GPU 02-GPU 03 communication links.

Example of a compute node with 4 GPUs attached, directly connected to each other with NVLink

What we achieved, so far

It is probably best to tell stories through the sequence of pull requests.

Core Changes in DistributedArray (PR #130)

This PR introduces NCCL support into the DistributedArray class. The design allows users to optionally pass both a NcclCommunicator and a MPI.Comm. By doing so, small control data (e.g., shape, dtype) is still exchanged via MPI, leveraging Python’s flexibility and minimizing performance impact. As you will see, this decision to keep two communicators turns out to be a good call. This is how the __init__ method of DistributedArray looks like with the new addition in bold:

 def __init__(self, global_shape: Union[Tuple, Integral],
 base_comm: Optional[MPI.Comm] = MPI.COMM_WORLD,
 base_comm_nccl: Optional[NcclCommunicatorType] = None, # Added to this line
 partition: Partition = Partition.SCATTER, axis: int = 0,
 local_shapes: Optional[List[Union[Tuple, Integral]]] = None,
 mask: Optional[List[Integral]] = None)

The CuPy’s NCCL API

NCCL’s API (mirroring its C++ origins) is minimalistic, requiring manual memory management. One prominent example is the implementation of allGather() Previously, using mpi4py we could leverage the Python dynamic typing nature (everything is an object, so one just sends the object). This means mpi4py allows different ranks to send different sizes of arrays. NCCL requires every rank in the communicator to send the same size. To work around this, we implemented padding and reshaping logic in multi-dimensional arrays. NCCL treats arrays as contiguous byte streams, so padding must be handled carefully ¹.

Moreover, we had to accommodate NCCL’s lower-level API, lacking conveniences like communicator’s split variants. Internally, we introduced unified abstractions such as _allgather(), _allreduce(), send(), recv(). etc to DistributedArray and modified the communication model to work seamlessly whether MPI or NCCL is used. By doing this, other developers can focus on developing new operators (that suit their needs), and abstract away the existence of different communicators.

Example of a challenge coming from having an unevenly distributed array

Keep things small: Dependency management (PR #132 and PR #135)

Despite adding this new capability, we are fully aware that not every user has access to a cluster of GPU and therefore we don’t make NCCL and CuPy mandatory dependencies. The first time someoneinstalls and experiments with PyLops-MPI is likely to run it on a single-node desktop. And we don’t want to introduce such complexity early on. This means that our code has to accommodate “optional dependency” or have a “protected import”. If we have import cupy as cp at the beginning of DistributedArray, users without GPU will encounter an error before doing anything useful at all. In other words, our library should import CuPy and NCCL automatically when both the system enables it and users also ask for it. The pattern looks like this:

nccl_test = util.find_spec("cupy") is not None and int(os.getenv("NCCL_PYLOPS_MPI", 1)) == 1
if nccl_test:
 # try import CuPy and then check for NCCL
 if nccl is available:
 # success
 else:
 # unable to import but the package is installed
else:
 # package is not installed or the environment variable disables it
# Finally, set nccl_enabled flag for other module to use for protected import

This helps preserve PyLops-MPI’s minimal base installation. This required carefully isolating imports and adapting the module resolution logic using a backend dispatching mechanism. This is something I never thought of or took it into consideration before.

The Basic Operator with NCCL PR 137

We chose MPIVStack as the first operator to implement NCCL support due to its simplicity. Several design choices emerged:

Implicit Communicator Propagation

We updated forward and adjoint calls to propagate the base_comm_nccl from input to output automatically. This way, if x is NCCL-enabled, then y = A @ x or A.H @ x will also be NCCL-enabled. This avoids mismatches and keeps operator pipelines consistent.

Interestingly, the operator itself did not need to explicitly take base_comm_nccl as an argument, reducing complexity for developers extending PyLops-MPI. This point is contrary to our initial expectation. The operator does not have to take base_comm_nccl like DistributedArray did. This is good news. This reduces the potential that other developers may have to deal with different cases of communication when adding new operators.

Optional Dual-Communicator Design

As with DistributedArray, the ability to pass both an MPI communicator and an NCCL communicator proved to be a sound decision. By maintaining NCCL as an optional backend, we gain fine-grained control over which communication paths use NCCL versus MPI. This flexibility allowed us to optimize performance-critical paths while retaining MPI for control messages and small metadata transfers.

In particular, in the communication of ghost cells, which are used for computation around the boundary, like in derivative calculation, small metadata, such as cell_fronts (typically lists of rank-sized integers) continues to be efficiently transmitted via MPI. These metadata are needed for send/receiver buffer allocations. This leverages Python’s object serialization model (list[int]) without incurring GPU synchronization costs. But the actual cell array itself is communicated with NCCL since these arrays can be large.

What’s Next?

Aside from enabling NCCL support for the remaining operators and their full test coverage, some more exciting upcoming updates are

Complex-number type support for NCCL
Benchmarking results on a real HPC system

Stay tuned for Part 2, and thanks for reading!

For the best performance mpi4py would require the buffer memory allocation as well. The mpi4py package provides two interface: buffered and non-bufferred. Currently PyLops-MPI takes the non-buffered approach. This suggest a room of optimization. ↩︎

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Tue, 22 Jul 2025 10:15:56 -0700

Midway Through OSRE

My Journey with LLMSeqRec

Hello from the Midpoint!

Hi everyone! I’m Connor Lee, a student at NYU studying Computer Science and Mathematics, and I’m excited to share the progress I’ve made halfway through the Open Source Research Experience (OSRE) with my project: LLMSeqRec – a large language model-enhanced sequential recommender system.

Over the past several weeks, I’ve had the opportunity to explore the intersection of recommender systems and large language models (LLMs), and it’s been a deep, challenging, and rewarding dive into building smarter, more contextual recommendation engines.

What is LLMSeqRec?

LLMSeqRec stands for LLM-Enhanced Contextual Sequential Recommender. Traditional sequential recommendation systems like SASRec are great at capturing patterns from user-item interactions, but they often fall short in two areas: understanding semantic context (e.g., item descriptions, reviews) and dealing with cold-start problems.

LLMSeqRec aims to address this by incorporating pretrained LLM embeddings into the recommendation pipeline. The goal is to enhance models like SASRec with semantic signals from text (like product reviews or titles), allowing them to better model user intent, long-range dependencies, and generalize to new items or users.

Progress So Far

✅ Baseline SASRec Runs

To establish a benchmark, I successfully ran the original SASRec implementation (in PyTorch) using both the MovieLens 1M and Amazon Beauty datasets. After debugging initial data formatting issues and adjusting batch sizes for local CPU/GPU compatibility, I automated training with scripts that let me scale to 200+ epochs to acheive the best performance in both Colab and on my MacBook via CPU.

Note: At this stage, we have not yet integrated LLMs into the model. These baseline runs (SASRec) serve as the control group for evaluating the future impact of LLM-based enhancements.

What’s Next

As I enter the second half of the OSRE, I’ll be shifting gears toward LLM integration, model evaluation, and running LLM-powered sequential recommendations using product metadata and contextual information. Here’s what’s ahead:

Designing pipelines to extract and align textual metadata with item sequences
Integrating LLM-generated embeddings into the recommender model
Evaluating performance changes across different dataset characteristics

📊 Experimental Results

We have not yet utilized LLMs in our current experiments. The results below reflect our reproduced baseline performance of SASRec across datasets.

Below are the performance curves on different test sets, where we evaluate model performance every 20 epochs during training:

Beauty Dataset Performance

Hit@10 performance on the test set for the Beauty dataset (every 20 epochs)

Training loss for the Beauty dataset

NDCG@10 performance on the test set for the Beauty dataset (every 20 epochs)

ML-1M Dataset Performance

Training loss for the ML-1M dataset

Hit@10 performance on the test set for the ML-1M dataset (every 20 epochs)

NDCG@10 performance on the test set for the ML-1M dataset (every 20 epochs)

These results demonstrate that our baseline SASRec reproductions are converging as expected and will serve as a solid foundation for comparison once LLM integration is complete.

Closing Thoughts

This project has been an exciting journey into both research and engineering and I’m excited to explore LLM-powered embedding integration in the upcoming phase.

I’m incredibly grateful to my mentors Dr. Linsey Pang and Dr. Bin Dong for their support and guidance throughout the project so far. I’m looking forward to sharing more technical results as we work toward building smarter, more adaptable recommender systems.

Midterm Report: KAN Integration into LLMs

Fri, 18 Jul 2025 00:00:00 +0000

Imagine if we could make neural networks that are not just more efficient, but smarter in how they learn. That’s the promise behind Kolmogorov–Arnold Networks (KANs)—a fascinating new architecture that replaces the usual “weighted sums and activation functions” with more mathematical finesse. Instead of processing all inputs in one big lump, KANs treat each input dimension individually, transforming them with elegant functions like B-splines or simpler polynomials. The idea is simple but powerful: do more with less.

For my project, I set out to explore what happens when we integrate these KAN layers into a lightweight language model called SmolLM2, training and testing it on a smol-smoltalk dataset.

Setting the Stage: SmolLM2 Meets KAN

The original SmolLM2 has 135 million parameters and 30 transformer blocks—plenty of moving parts. To keep things manageable during the initial phase, I created a mini version of the model with just 3 blocks and a trimmed-down vocabulary. This setup let me test dozens of KAN variations quickly, using a simple text classification task (AGNews) as a playground before moving on to full-scale language modeling.

Despite working with a simplified model, I managed to successfully train a full 30-block KAN-based SmolLM2. That model even passed challenging language benchmarks with flying colors—matching the performance of the original, linear-layer version. That’s a big win.

What Worked (and What Didn’t)

Along the way, I tried out a variety of KAN flavors: spline-based, radial basis functions (RBF), rational functions, and no fewer than eight types of orthogonal polynomials—like Chebyshev, Legendre, and Hermite. Each one brings its own quirks, strengths, and training times.

Some key takeaways:

Chebyshev (second kind) with a low polynomial degree (just 2!) delivered the best speed/accuracy trade-off.
Jacobi and Gegenbauer polynomials edged slightly ahead in raw accuracy but required much longer training times.
Replacing each linear layer with a KAN version (keeping parameter count similar) worked fine—but layering them in parallel or sequence didn’t add much.
A baseline with regular linear layers still performed slightly better (60.8% vs. 60.3%), but KANs showed they can come close with room for optimization.

Why This Matters

What’s compelling is not just that KANs can work, but that they bring some appealing properties:

Parameter efficiency: Good performance with fewer or similarly-sized layers.
Flexibility: They adapt well to existing hyperparameters—less fine-tuning needed.
Stability: They run smoothly in fp16 (a lower-precision format), which is critical for efficient training.
Potential for richer activations: Some existing projects still rely on activations like ReLU or SiLU alongside KANs. But I found KANs alone could learn well without them, opening up more dynamic architectures in the future.

What’s Next

With the heavy lifting done, code written, models trained, ideas tested, the remainder of the project is focused on refinement. That means more training on generative tasks, better tuning of polynomial degrees, smarter initialization strategies, and potentially making KAN-based layers more plug-and-play.

The fact that a fully KAN-powered SmolLM2 can hold its own on tough language benchmarks is more than just a proof of concept. It’s a hint that we might not have to keep scaling models indefinitely to get better performance. Instead, we can get more from each parameter, by changing how the model thinks.

Halfway Through GSoC: My Experience and Progress

Thu, 17 Jul 2025 00:00:00 +0000

As part of the RAG-ST project, my proposal under the mentorship of Ziheng Duan aims to build a retrieval-augmented generation framework to predict spatial gene expression from histology images.

🚀 Achievements

✅ Ran the HEST-1K Pipeline

I successfully ran gene expression prediction models on the HEST-1K dataset, reproducing baseline image-to-expression workflows and setting up data loaders, evaluation metrics, and visual inspection of outputs.

✅ Explored Tangram’s Alignment Code

I studied and ran Tangram, a well-known scRNA-seq to ST alignment method, gaining key insights into cross-modality mapping. These ideas will inform our strategy to align histology images to scRNA-seq data.

✅ Designed the RAG-ST Architecture

I drafted the architecture for the RAG-ST pipeline, including:

Vision encoder to process image patches.
Retrieval module to find relevant examples from a curated database.
Generation head that conditions predictions on the retrieved examples — allowing transparency and context-aware outputs.

🧠 Challenges

Data Alignment: Spatial transcriptomics datasets often lack perfect alignment between histology, gene expression, and scRNA-seq, requiring custom preprocessing and normalization.
Trade-off Between Interpretability and Accuracy: Retrieval-augmented designs allow us to trace the origin of predictions but require care to avoid overfitting or performance drops.
Computation: High-resolution images and large-scale retrieval can be computationally expensive. I’ve begun exploring downsampling and vector database indexing strategies.

🔜 What’s Next

🔧 Build the end-to-end retrieval-generation pipeline
🧬 Prototype histology-to-scRNA-seq alignment using adapted Tangram ideas
📊 Benchmark RAG-ST vs. MLP baselines
👁️ Develop interpretability visualizations to show which samples were retrieved for each prediction

🧾 Deliverables Progress

Deliverable	Status
HEST-1K Baseline Pipeline	✅ Completed
Tangram Exploration	✅ Completed
Data Curation	🟡 In Progress
RAG-ST Architecture	✅ Drafted
Full Pipeline	⏳ Planned
Evaluation & Comparison	⏳ Planned

🙌 Closing Thoughts

It’s been a rewarding first half of GSoC. I’ve gained hands-on experience with spatial transcriptomics datasets, explored state-of-the-art tools like Tangram, and laid the groundwork for a new interpretable gene prediction model.

I’m excited to continue building RAG-ST and look forward to sharing more results soon. Huge thanks to my mentor Ziheng Duan for the guidance and support throughout!

If you have questions or want to discuss spatial modeling, feel free to reach out.

Midterm Blog - WildBerryEye User Interface

Wed, 16 Jul 2025 00:00:00 +0000

Hi, my name is Sophie Tao, I am an alumn at the University of Washington, with majoring in Electrical and Computer Engineering, I’m happy to share the progress I have been able to make over the last six weeks on my GSoC 2025 project, WildBerryEye, mentored by Carlos Isaac Espinosa.

Project Overview

WildBerryEye is an open-source initiative to support ecological monitoring of pollinators such as bees and hummingbirds using edge computing and computer vision. The project leverages a Raspberry Pi and YOLO for object detection and aims to provide an accessible, responsive, and real-time web interface for researchers, ecologists, and citizen scientists.

This project specifically focuses on building the frontend and backend infrastructure for WildBerryEye’s user interface, enabling:

Real-time pollinator detection preview
- Real-time image capture
- Real time video capture
Responsive, User-friendly UI
Object detection
Researcher-friendly configuration and usability

Progress So Far

✅ Phase 1: Setup

Frontend: Completed React + TypeScript project initialization with routing and base components. Pages include:
- Home page (with image preview)
- Dashboard page (pollinator image & video)
Backend: Flask server initialized with modular structure. Basic API endpoints stubbed as per the proposal.

✅ Phase 2: Core Features

Real-Time Communication: Frontend successfully receives image stream using WebSocket.
UI Components:
- Implemented image carousel preview on homepage.
- Image Capture (Image download)
- Video Capture (Video Preview, Video Recording)
- Sidebar-based navigation and page structure fully integrated.
API Development:
- Implemented core endpoints such as /home, and/dashboard routes.
- Backend handlers structured for image and video capture.

Challenges Encountered

⚠️ Real-time Image Testing: Lack of consistent live camera input made local testing inconsistent.
⚠️ Allocate the camera module for both capture image and capture video.
⚠️ Obtain the proper format of the video.

Next Steps

Enable more features for video capture
Integrated with Machine Learning Model
Conduct at least one usability test (self + external user) and incorporate feedback.
Final Testing & Docs

Summary

At this midterm stage, the WildBerryEye UI project is on track with core milestones completed, including real-time communication, component setup, and backend API structure. The remaining work focuses on refinement, visualizations, testing, and documentation to ensure a polished final product by the end of GSoC 2025.

Mid-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing

Tue, 15 Jul 2025 00:00:00 +0000

Introduction

About the Project

As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for investigators to discover relevant projects, notes, and assets—across both current and archived work—using information that is either user-entered or passively collected by StatWrap.

Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:

Evaluating open-source search libraries suitable for local indexing and retrieval
Building the full-text search functionality directly into the StatWrap UI to allow seamless querying across projects
Ensuring reliability through the development of unit tests and comprehensive system testing
Implementing a classification system to label projects as “Active,” “Pinned,” or “Past” within the user interface

Progress

It has been more than six weeks since the project began, and significant progress has been made. Here’s a breakdown:

1. Descriptive Comparison of Open-Source Libraries

2. The Libraries

Lunr.js
A small, client-side full-text search engine that mimics Solr capabilities.
- Field-based search, boosting
- Supports TF-IDF, inverted index
- No built-in fuzzy search (only basic wildcards)
- Can serialize/deserialize index
- Not designed for large datasets
- Moderate memory usage and indexing speed
- Good documentation
- Best for: Static websites or SPAs needing simple in-browser search
ElasticLunr.js
A lightweight, more flexible alternative to Lunr.js.
- Dynamic index (add/remove docs)
- Field-based and weighted search
- No advanced fuzzy matching
- Faster and more customizable than Lunr
- Smaller footprint
- Easy to use and maintain
- Best for: Developers wanting Lunr-like features with simpler customization
Fuse.js
A fuzzy search library ideal for small to medium datasets.
- Fuzzy search with typo tolerance
- Deep key/path searching
- No need to build index
- Highly configurable (threshold, distance, etc.)
- Linear scan = slower on large datasets
- Not full-text search (scoring-based match)
- Extremely easy to set up and use
- Best for: Fuzzy search in small in-memory arrays (e.g., auto-suggest, dropdown filters)
FlexSearch
A blazing-fast, modular search engine with advanced indexing options.
- Extremely fast search and indexing
- Supports phonetic, typo-tolerant, and partial matching
- Asynchronous support
- Multi-language + Unicode-friendly
- Low memory footprint
- Configuration can be complex for beginners
- Best for: High-performance search in large/multilingual datasets
MiniSearch
A small, full-text search engine with balanced performance and simplicity.
- Fast indexing and searching
- Fuzzy search, stemming, stop words
- Field boosting and prefix search
- Compact, can serialize index
- Clean and modern API
- Lightweight and easy to maintain
- Best for: Balanced, in-browser full-text search for moderate datasets
Search-Index
A persistent, full-featured search engine for Node.js and browsers.
- Persistent storage with LevelDB
- Real-time indexing
- Fielded queries, faceting, filtering
- Advanced queries (Boolean, range, etc.)
- Slightly heavier setup
- Good for offline/local-first apps
- Browser usage more complex than others
- Best for: Node.js apps, not directly compatible with the Electron + React environment of StatWrap

3. Developer Experience and Maintenance

We analyzed the download trends of the search libraries using npm trends, and also reviewed their maintenance statistics to assess how frequently they are updated.

4. Comparative Analysis After Testing

Each search library was benchmarked against a predefined set of queries based on the same evaluation criteria.
We are yet to finalize the weights for each criterion, which will be done during the end-term evaluation.

5. The User Interface

The user interface includes options to search using three search modes (Basic, Advanced, Boolean operators) with configurable parameters. Results are sorted based on relevance score (highest first), and also grouped by category.

6. Overall Functioning

Indexing Workflow
- Projects are processed sequentially
- Metadata, files, people, and notes are indexed (larger files are queued for later)
- Uses a “brute-force” recursive approach to walk through project directories
  - Skips directories like node_modules, .git, .statwrap
  - Identifies eligible text files for indexing
  - Logs progress every 10 files
Document Creation Logic
- Reads file content as UTF-8 text
- Builds searchable documents with filename, content, and metadata
- Auto-generates tags based on content and file type
- Adds documents to the search index and document store
- Handles errors gracefully with debug logging
Search Functionality
- Uses field-weighted search
- Enriches results with document metadata
- Supports filtering by type or project
- Groups results by category (files, projects, people, etc.)
- Implements caching for improved performance
- Search statistics are generated to monitor performance

Challenges and End-Term Goals

In-memory Indexing Metadata Storing
Most JavaScript search libraries (like Fuse.js, Lunr, MiniSearch) store indexes entirely in memory, which can become problematic for large-scale datasets. A key challenge is designing a scalable solution that allows for disk persistence or lazy loading to prevent memory overflows.
Deciding the Weights Accordingly
An important challenge is tuning the relevance scoring by assigning appropriate weights to different aspects of the search, such as exact word matches, prefix matches, and typo tolerance. For instance, we prefer exact matches to be ranked higher than fuzzy or partial matches.
Implementing the Selected Library
Once a library is selected (based on speed, features, and compatibility with Electron + React), the next challenge is integrating it into StatWrap efficiently—ensuring local indexing, accurate search results, and smooth performance even with large projects.
Classifying Active and Past Projects in the User Interface
To improve navigation and search scoping, we plan to introduce three project sections in the interface: Pinned, Active, and Past projects. This classification will help users prioritize relevant content while enabling smarter indexing strategies.

Stay tuned for the next blog!

Midway Through GSoC

Mon, 14 Jul 2025 00:00:00 +0000

Midway Through GSoC

Hello everyone! I’m Pratham Devadiga, and I’m thrilled to share a midterm progress update on my GSoC 2025 project with the Open Source Research Experience (OSRE). My project is focused on building the first open-source billion-scale vector embeddings dataset from real-world open source code to support benchmarking of Approximate Nearest Neighbor (ANN) algorithms and facilitate research in Retrieval-Augmented Generation (RAG).

Project Overview

The goal of this project is to address a critical gap in the ecosystem: existing ANN benchmarks are either synthetic or limited in scale. With the explosion of code-focused LLMs and embedding models, there’s a pressing need for:

High-volume, high-dimensional vector datasets built from real-world data (open-source codebases).
Open, reproducible benchmarks that reflect realistic RAG workloads.
A dataset that can be used to evaluate ANN libraries like FAISS, HNSW, and Annoy on massive and practical retrieval tasks.

Our approach is to use high-quality open-source code repositories to extract meaningful code chunks, encode them into vector embeddings using open models, and make these datasets publicly available with metadata for downstream benchmarking and analysis.

Progress So Far

We’ve made substantial foundational progress in the first half of the coding period. Key highlights:

Tested multiple embedding models such as codeBERT, MiniLM-L6-v2, and all-mpnet-base-v2, evaluating trade-offs in speed, dimensionality, and GPU memory.
Selected codebert-base (768d) as the current model for phase one due to its stable performance and manageable resource footprint.
Implemented and validated a complete script pipeline to:
- Traverse large open-source repositories.
- Extract and chunk code intelligently (functions, classes, modules).
- Encode code into embeddings and attach metadata (repo, file path, license).
- Store results efficiently in parquet and NumPy formats.
Tested all components of the pipeline on sample datasets using multi-GPU setups, ensuring compatibility and robustness.

Challenges and Learnings

Building a billion-scale dataset from real-world codebases is no small task. Here’s what we’ve encountered and learned along the way:

1. Multi-GPU Pipeline Design

Naively parallelizing the embedding process caused memory overflow and deadlocks due to model reloading across processes. We refactored the code using torch.multiprocessing and pinned GPU contexts to avoid such issues, improving throughput on multi-GPU machines.

2. Embedding Trade-offs

We experimented with larger models but found that their generation time and memory use were too high to be practical in early phases. This helped us narrow down to scalable configurations for initial dataset generation.

3. Preparing for Scale

Although the embeddings are not generated yet, all scripts are now modular, parallelized, and reproducible, ensuring a smooth transition to billion-scale data generation in the second half.

What’s Next

The second half of the project will focus on:

Scaling up embedding generation to >1B code chunks across hundreds of open-source repositories.
Running benchmarks using FAISS, HNSW, and Annoy on these embeddings.
Releasing the dataset on Hugging Face and AWS S3 with sharded access and metadata.
Writing a detailed benchmarking report comparing speed, accuracy, and memory trade-offs across ANN algorithms.

Final Thoughts

This journey so far has taught me a lot about building large-scale ML pipelines, managing real-world compute constraints, and ensuring reproducibility for research-grade datasets. I’m grateful to my mentor Jayjeet Chakraborty and the OSRE team for their continuous support and guidance.

Excited for the next half, where the real scale begins!

Stay tuned for updates. You can find more about the project on my OSRE project page.

CarbonCast

Thu, 10 Jul 2025 00:00:00 +0000

As part of the CarbonCast project, my proposal under the mentorship of Professor Abel Souza aims to build an API that makes carbon intensity forecasts more accessible and actionable.

Under the mentorship of Professor Abel Souza, my proposal is centered around building upon CarbonCast to create an API to enable user access and utilization of energy data in optimizing their electricity consumption. Before diving into the details of the project, I’d like to share a bit about my background.

About Me

Hi, I’m Tanush—a rising senior at the University of Massachusetts Amherst, majoring in Computer Science and Mathematics and graduating in Spring 2026. Currently, I’m an AI Intern for the Commonwealth of Massachusetts Department of Unemployment Assistance, where I’m developing an end-to-end retrieval-augmented generation (RAG) chatbot on AWS.

In the past, I’ve contributed to CarbonCast in a different capacity, designing a user interface to help visualize carbon intensity forecasts. I also worked at MathWorks as a Machine Learning Intern, where I collaborated in an AGILE environment to design and deploy predictive models that improved precision torque control and dynamic responsiveness in motor-driven robotic and industrial systems.

I’m excited to bring these experiences to this year’s GSoC project, where I’ll be building tools to make carbon data more accessible and actionable for everyone.

What is CarbonCast?

CarbonCast is a Python-based machine-learning library designed to forecast the carbon intensity of electrical grids. Carbon intensity refers to the amount of carbon emitted per kilowatt-hour (kWh) of electricity consumed. Developed in Python, the current version of CarbonCast delivers accurate forecasts in numerous regions by using historical energy production data of a particular geographical region, time of day/year, and weather forecasts as features.

However, there is no easy way to access, visualize, and utilize the data through a standard interface. In addition, much important information is left out and is not available to users. For instance, electricity grids often import electricity from neighboring regions, and so electricity consumption depends on both electricity generation and imports. Moreover, it is imperative for each energy source to utilize a tailored predictive mechanism. Consequently, any carbon optimization solution trying to reduce carbon emissions due to its electricity consumption will benefit more from following a consumption-based carbon intensity signal.

Unlike other third-party carbon services, CarbonCast’s model is open-sourced, allowing users to study, understand, and improve its behavior. This transparency invites public collaboration and innovation. It also contrasts sharply with proprietary services that often withhold both the logic behind their models and the data they are trained on.

Why This Matters

Electricity usage is one of the largest contributors to carbon emissions globally. Carbon intensity—the amount of carbon emitted per kilowatt-hour of electricity consumed—varies based on how electricity is generated and demanded (for example, coal versus solar). With better visibility into when the grid is cleaner, individuals and organizations can shift their energy consumption to lower-carbon periods and lower prices. This enables everyday energy optimizations without compromising comfort or productivity.

By improving CarbonCast’s accessibility and functionality, we are helping people and institutions answer questions like:

When is the best time to charge my EV to reduce environmental impact?
Can I run my energy-hungry server jobs when the electricity is cheaper?
How do I actually reduce my emissions without guessing?

By providing clear, accurate forecasts of carbon intensity, CarbonCast can help users make informed decisions to optimize their energy footprint and reduce emissions without sacrificing convenience or productivity.

What I’m Building

The plan for this summer is to develop the backend API services for CarbonCast. This summer, I’m focused on two major goals:

Geographical Expansion

I am extending CarbonCast’s compatibility to support more regional electricity grids. Each model will be customized for local grid behavior and renewable energy characteristics. This involves tuning the model pipeline to adapt to each region’s energy mix, weather patterns, and reporting granularity.

System Refactoring and Modularity

The original CarbonCast system was built as a research artifact. To refine it into production-grade infrastructure, I am refactoring the codebase to improve modularity. This makes it easier to plug in new regions, update forecasting algorithms, and integrate new data sources.

Impact Beyond Research

The paper that inspired this project, Multi-day Forecasting of Electric Grid Carbon Intensity using Machine Learning, pioneered the idea of forecasting carbon intensity over multiple days using a hierarchical machine learning model. This goes beyond the typical 24-hour day-ahead models that are common in the industry and allows for better planning and longer-term decision-making.

CarbonCast builds directly on that foundation by transforming research into practical, real-world infrastructure. It is an open-source library that anyone can run, contribute to, and benefit from. Whether you’re a developer building carbon-aware applications, a policymaker working on grid decarbonization strategies, or a sustainability-conscious individual looking to reduce your carbon footprint, CarbonCast provides the tools to make informed, impactful choices.

Looking Ahead

I am excited to contribute to a project that blends machine learning, systems engineering, sustainability, and public impact. My goal is to help make it easier for everyone to see, understand, and act on their carbon footprint while also providing the “visibility” people need to take meaningful, informed actions.

Rectilinear Floorplans in OpenROAD

Thu, 10 Jul 2025 00:00:00 +0000

Google Summer of Code ‘25: Enabling Rectilinear Floorplanning in OpenROAD

This summer, under the guidance of my mentors Eder Monteiro and Augusto Berndt at the OpenROAD project, I am implementing support for polygonal (specifically rectilinear) die shapes in OpenROAD’s floorplanning flow.

Here’s a link to my original proposal

What is OpenROAD and why polygonal floorplans?

OpenROAD is a fully autonomous RTL-to-GDS digital layout toolchain. The OpenROAD flow delivers an Autonomous, No-Human-In-Loop (NHIL) flow, 24 hour turnaround from RTL-GDSII for rapid design exploration and physical design implementation.

Until now, OpenROAD primarily supported rectangular die shapes in its floorplanning. This limits its use for advanced packaging, 2.5D/3D ICs, or irregular chiplet-based designs, where non-rectangular dies are increasingly common.

By extending the floorplanner to handle rectilinear (non-manhattan, but still axis-aligned) dies, we open the door for a broader class of cutting-edge VLSI layouts.

Motivation and what gap does this fill?

From my background in electronics engineering, I’ve seen that advanced packaging hsd been unusual die shapes, whether for stacking, interposers, or simply to optimize area and thermal profiles.

Right now, OpenROAD’s inability to handle these non-rectangular dies is a blocker for certain modern flows. My project directly addresses this by:

Extending the Tcl and internal APIs to accept and validate polygonal die/core shapes.
Modifying the row generation and site placement algorithms to conform to polygonal boundaries.
Ensuring that all downstream modules (global placement, detailed placement, routing) can consume the new floorplan data structures without any issues.

My approach & engineering plan

My work focuses on maintaining robustness and backward compatibility, while introducing this major new feature.

Floorplan Input: Users can now specify die/core shapes as sequences of x/y coordinates (-die_polygon and -core_polygon) directly from the Tcl interface.
Data structures: Extend OpenROAD’s internal representations to store arbitrary rectilinear polygons and propagate these safely through the design pipeline.
Row Generation: Develop a new functions (make_polygon_rows) to fill polygonal die areas with standard cell rows, properly clipped to the die shape.
Verification: Build rigorous regression and sanity checks. This includes both internal checks and external DEF writeouts that can be visualized in other tools.
Testing & Benchmarks: Prepare a suite of testcases (including complex T-shaped and L-shaped dies) to validate correctness.

I’m especially careful to keep the existing rectangular flow untouched. The new features only engage when the user explicitly specifies polygonal options.

Acknowledgments

I’m deeply grateful to my mentors from the OpenROAD community who have given invaluable guidance - Eder Monteiro and Augusto Berndt. I’m also excited to contribute this to an open-source EDA project that’s shaping the future of accessible hardware design.

Auditing Skin Tone Bias in Text-to-Image Models

Wed, 09 Jul 2025 00:00:00 +0000

As part of the Stable Diffusion Bias Project, my proposal focuses on evaluating bias in visual outputs of generative AI models, particularly skin tone bias in Stable Diffusion.

The goal is to analyze how models render people based on prompts like “a doctor” or “a homeless person,” and whether certain prompts systematically result in lighter or darker skin tones—even when race isn’t explicitly mentioned.

🧪 What I’ve Done So Far

Designed a prompt template covering six social categories (e.g., criminal justice, profession, socioeconomic)
Generated image datasets using Stable Diffusion with varied seeds
Built a preprocessing pipeline to estimate melanin values from generated faces
Created early visualizations showing distributional trends in skin tone
Identified early evidence of bias in prompts linked to status or wealth

⚒️ Tools and Methods

Stable Diffusion for controlled image generation
BioSkin pipeline to extract melanin metrics
Fitzpatrick skin type approximation (in development as a validation method)
Python-based data analysis and prompt auditing
openai/CLIP and BLIP for optional image-text alignment scoring

🔍 What I’m Seeing

Preliminary results show that even neutral prompts like “a portrait of a professor” tend to favor lighter skin tones, while prompts such as “a manual laborer” or “a homeless person” skew toward darker tones. These trends are not always obvious to the human eye, which is why quantitative skin tone analysis is essential.

I’m now exploring whether prompt engineering (e.g., adding “fair,” “dark-skinned,” or “diverse” descriptors) can help mitigate these imbalances.

🚧 What’s Next

Expand dataset to 60 prompts across 6 categories
Incorporate alternate T2I models (Midjourney, DALL·E 3)
Write a technical report and reproducible evaluation framework
Submit a short paper or workshop proposal to a fairness or ethics venue

Benchmarking the Future: Exploring High-Speed Scientific Data Streaming

Sun, 06 Jul 2025 00:00:00 +0000

Hello! I’m Ankit Kumar, and although I’m a bit late with this introduction post due to a busy period filled with interviews and college formalities, I’m excited to share my journey with the OSRE 2025 program and the fascinating world of scientific data streaming.

About Me

I’m currently pursuing my BTech degree at the Indraprastha Institute of Information Technology Delhi (IIIT Delhi) and am based in New Delhi, India. As I approach graduation, I’m thrilled to be working on a project that perfectly aligns with my interests in systems and networking.

My passion for technology has led me through various experiences:

Software Developer at CloudLabs: I worked at a platform founded by Dr. Sumit J Darak that facilitates remote access to actual FPGA boards on a slot basis, making hardware experimentation accessible to students worldwide.
Data Mining Intern at TaskTracker.in: This experience gave me insights into large-scale data processing and analysis.
Undergraduate Researcher: Currently working under Dr. Mukulika Maity on benchmarking QUIC and TCP protocols across different environments including bare metal, virtual machines, and containers.

I chose this OSRE project because it represents an incredible opportunity to work with some of the best minds in the industry at Argonne National Laboratory (ANL) while diving deep into cutting-edge networking technologies.

My Project: SciStream Performance Analysis

As part of the SciStream project, I’m focusing on two critical aspects of high-performance scientific data streaming:

1. TCP/UDP Performace Benchmarking

I’m conducting comprehensive benchmarking of SSH and TLS tunnels using various open-source tools and parameters. This work is crucial for understanding how different protocols and their overhead impact the performance of real-time scientific data streaming. The goal is to provide researchers with evidence-based recommendations for moving/processing their high-speed data transfers without compromising performance.

2. QUIC Proxy Exploration

I’m exploring different QUIC proxy implementations to understand their potential advantages over traditional TCP+TLS proxies in scientific workflows. QUIC, the protocol that powers modern web applications like YouTube, offers promising features for scientific data streaming, but comprehensive benchmarking is needed to validate its benefits.

Working with Cutting-Edge Testbeds

Currently, I’m conducting experiments using both the FABRIC testbed and ESnet testbed. These platforms provide access to real high-speed network infrastructure, allowing me to test protocols and configurations under realistic conditions that mirror actual scientific computing environments.

The Team Experience

These past two weeks have been incredibly rewarding, working alongside:

Alain Zhang - my project mate from UC San Diego, cool guy.
Flavio Castro - My project mentor and manager, goto person for my issues. currently at anl as a research development software engineer.
Joaquin Chung - Super mentor, brains behind the project. His guidance on the project is super valubale.
Rajkumar Kettimuthu - Lead Scientist in our project whose comments on our paper critique are invaluable.
Seena Vazifedunn - Graduate Research Assistant at University of Chicago. He asks very relevant and important questions during our report presentation and his feedbacks are very insightful.

The collaborative nature of this project has been fantastic, combining perspectives from different institutions and backgrounds to tackle complex networking challenges.

Stay tuned for updates!

This work is part of the SciStream project at Argonne National Laboratory, reimagining how scientific data moves across modern research infrastructure.

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Wed, 25 Jun 2025 00:00:00 +0000

Hello, I’m Baiqiang. As part of the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project, I am excited to introduce my work on a crucial challenge in modern computational science. My proposal under the mentorship of Luanzheng “Lenny” Guo at Pacific Northwest National Laboratory and Dongfang Zhao at the University of Washington aims to enhance the reproducibility of AI-driven scientific workflows.

The Problem: A Crisis of Confidence in AI for Science

Large Language Models (LLMs) are transforming scientific research, from accelerating literature reviews to generating novel hypotheses. However, their power is matched by their pitfalls: a tendency to “hallucinate” facts and a lack of transparency. Retrieval-Augmented Generation (RAG) was developed as a powerful solution, grounding LLM outputs in factual evidence retrieved from a specific knowledge base (like a database of scientific papers).

But a hidden problem lurks within RAG: non-determinism. The very first step of a RAG system—the similarity search that finds relevant documents—can produce different results even when asked the same question. Variations in indexing algorithms, data updates, or even the underlying software can change which documents are retrieved. For science, this is a critical flaw. If an experiment cannot be repeated with the same results, its conclusions cannot be trusted. This project tackles that challenge head-on.

Our Mission: Forging a Path to Reproducible RAG

This project proposes a comprehensive solution to systematically identify, measure, and mitigate non-determinism in RAG frameworks. Our goal is to empower researchers to build and use AI tools with confidence.

Our approach is built on four key pillars:

Systematic Analysis: We will conduct a deep dive into popular RAG components (like FAISS, ScaNN, and HNSW) to pinpoint the exact sources of randomness and variability.
Rigorous Benchmarking: We will develop a public, open-source benchmarking suite using standardized scientific datasets (from PubMed, arXiv, etc.). This will allow anyone to quantitatively measure the reproducibility of their own RAG pipeline using clear metrics like retrieval overlap and rank correlation.
Targeted Enhancements: Based on our findings, we will implement practical solutions, including:
- Promoting deterministic algorithms and configurations.
- Building robust data versioning and provenance tracking tools (inspired by DVC and Git LFS).
- Creating tools for precise configuration management to capture the entire experimental setup.
Practical Guidance and Open Source Tools: We will distill our insights into comprehensive documentation, reusable code examples, and best practices. All tools and findings will be contributed back to the open-source community.

From Friction to Flow: Why I'm Building Widgets for Reproducible Research

Tue, 24 Jun 2025 00:00:00 +0000

This summer, I’m building Jupyter Widgets to reduce friction in reproducible workflows on Chameleon. Along the way, I’m reflecting on what usability teaches us about the real meaning of reproducibility.

Supercomputing Competition: Reproducibility Reality Check

My first reproducibility experience threw me into the deep end—trying to recreate a tsunami simulation with a GitHub repository, a scientific paper, and a lot of assumptions. I was part of a student cluster competition at the Supercomputing Conference, where one of our challenges was to reproduce the results of a prior-year paper. I assumed “reproduce” meant something like “re-run the code and get the same numbers.” But what we actually had to do was rebuild the entire computing environment from scratch—on different hardware, with different software versions, and vague documentation. I remember thinking: If all these conditions are so different, what are we really trying to learn by conducting reproducibility experiments? That experience left me with more questions than answers, and those questions have stayed with me. In fact, they’ve become central to my PhD research.

Summer of Reproducibility: Lessons from 100+ Experiments on Chameleon

I’m currently a PhD student and research software engineer exploring questions around what computational reproducibility really means, and when and why it matters. I also participated in the Summer of Reproducibility 2024, where I helped assess over 100 public experiments on the Chameleon platform. Our analysis revealed key friction points—especially around usability—that don’t necessarily prevent reproducibility in the strictest sense, but introduce barriers in terms of time, effort, and clarity. These issues may not stop an expert from reproducing an experiment, but they can easily deter others from even trying. This summer’s project is about reducing that friction—some of which I experienced firsthand—by improving the interface between researchers and the infrastructure they rely on.

From Psychology Labs to Jupyter Notebooks: Usability is Central to Reproducibility

My thinking shifted further when I was working as a research software engineer at Purdue, supporting a psychology lab that relied on a complex statistical package. For most researchers in the lab, using the tool meant wrestling with cryptic scripts and opaque parameters. So I built a simple Jupyter-based interface to help them visualize input matrices, validate settings, and run analyses without writing code. The difference was immediate: suddenly, people could actually use the tool. It wasn’t just more convenient—it made the research process more transparent and repeatable. That experience was a turning point for me. I realized that usability isn’t a nice-to-have; it’s critical for reproducibility.

Since that first experience, I’ve leaned into building better interfaces for research workflows—especially using Jupyter Widgets. Over the past few years, I’ve developed and taught tutorials on how to turn scientific notebooks into interactive web apps, including at the SciPy conference in 2023 and 2024. These tutorials go beyond the basics: I focus on building real, multi-tab applications that reflect the complexity of actual research tools. Teaching others how to do this has deepened my own knowledge of the widget ecosystem and reinforced my belief that good interfaces can dramatically reduce the effort it takes to reproduce and reuse scientific code. That’s exactly the kind of usability work I’m continuing this summer—this time by improving the interface between researchers and the Chameleon platform itself.

Making Chameleon Even More Reproducible with Widgets

This summer, I’m returning to Chameleon with a more focused goal: reducing some of the friction I encountered during last year’s reproducibility project. One of Chameleon’s standout features is its Jupyter-based interface, which already goes a long way toward making reproducibility more achievable. My work builds on that strong foundation by improving and extending interactive widgets in the Python-chi library — making tasks like provisioning resources, managing leases, and tracking experiment progress on Chameleon even more intuitive. For example, instead of manually digging through IDs to find an existing lease, a widget could present your current leases in a dropdown or table, making it easier to pick up where you left off and avoid unintentionally reserving unnecessary resources. It’s a small feature, but smoothing out this kind of interaction can make the difference between someone giving up or trying again. That’s what this project is about.

Looking Ahead: Building for People, Not Just Platforms

I’m excited to spend the next few weeks digging into these questions—not just about what we can build, but how small improvements in usability can ripple outward to support more reproducible, maintainable, and accessible research. Reproducibility isn’t just about rerunning code; it’s about supporting the people who do the work. I’ll be sharing updates as the project progresses, and I’m looking forward to learning (and building) along the way. I’m incredibly grateful to once again take part in this paid experience, made possible by the 2025 Open Source Research Experience team and my mentors.

Applying MLOps to overcome reproducibility barriers in machine learning research

Sun, 22 Jun 2025 00:00:00 +0000

About the Project

Hello! I’m Ahmed, an undergraduate Computer Science student at the University of Khartoum I’m working on making machine learning research more reproducible for open access research facilities like Chameleon testbed, under the project Applying MLOps to overcome reproducibility barriers in machine learning research, mentored by Prof. Fraida Fund and Mohamed Saeed. as part of this project my proposal aims to build a template generator that generates repositories for reproducible model training on the Chameleon testbed.

Reproducibility

We argue that unless reproducing research becomes as vital and mainstream part of scientific exploration as reading papers is today, reproducibility will be hard to sustain in the long term because the incentives to make research results reproducible won’t outweigh the still considerable costs

— Three Pillars of Practical Reproducibility Paper

By Reproducibility in science we refer to the ability to obtain consistent results using the same methods and conditions as the previous study. in simple words if I used the same data and metholodgy that was used before, I should obtain the same results. this principle is mapped to almost every scientific field including both Machine Learning research in science and core Machine Learning.

Challenges in Reproducibility

The same way the famous paper about the repoducibility crisis in science was published in in 2016, similar discussions have been published discussing this in machine learning research setting, the paper state of the art reproducibility in artificial intelligence after analayzing 400 hundereds papers from top AI conferences, it was found that around 6% shared code, approximately 33% shared test data. In contrast, 54% only shared a pseudocode (summary of the algorithm).

The lack of software dependency management, proper version control, log tracking, and effective artifacts sharing made it very difficult to reproduce research in machine learning.

Reproducibility in machine learning is largely supported by MLOps practices which is the case in the industry where the majority of researchers are backed by software engineers who are responsible of setting experimental environments or develop tools that streamline the workflow.However, in academic settings reproducibility remains a great challenge, researchers prefer to focus on coding, and worry a little about the the complexities invloved in configuring their experimental environment,As a result, the adaptation and standardization of MLOps practices in academia progress slowly. The best way to ensure a seamleas experience with MLOps, is to make these capabilities easily accessible to the researchers’ workflow. by developing a tool that steamlines the process of provisioning resources, enviornment setup, model training and artifacts tracking, that ensures reproducible results.

Proposed Solution

We want the researchers to spin up ML research instances/bare metal on Chameleon testbed while keeping the technical complexity involved in configuring and stitching everything together abstracted, users simply answer frew questions about their project info, frameworks, tools, features and integrations if there are any, and have a full generated,reproducible project. it contains a provisioning/infrastracture config layer for provisioning resources on the cloud, a dockerfile to spin up services and presistent storage for data,the ML tracking server system that logs the artifacts, metadata, environment configuration, system specification (GPUs type) and Git status using Mlflow, powered by a postgresSQL for storing metadata and a S3 Minio bucket to store artifacts.ML code at its core is a containarized training environment backed by persistent storage for the artifacts generated from the experiment and the datasets and containarization of all these to ensure reproducibility.we aim to make the cloud experience easier, by dealing with the configuration needed for setting up the environment having a 3rd party framework, enabling seamless access to benchmarking dataset or any necessary components from services like Hugging face and GitHub as an example will be accessible from the container easily. for more techincal details about the solution you can read my propsal here.

By addressing these challenges we can accelerate the scientific discovery. this not benefits those who are conducting the research but also the once building on top of it in the future. I look forward to share more updates as the project progresses and I welcome feedback from others interested in advancing reproducibility in ML research.

Building a Benchmarking Suite for Cache Performance Evaluation

Sat, 21 Jun 2025 00:00:00 +0000

Hi! I’m Haocheng Xia, a Computer Science student at the University of Illinois Urbana-Champaign, passionate about the intersection of machine learning and storage systems. Specifically, I’m keen on workload analysis and KV cache management for large language models.

This summer, I’m happy to be a part of SoR 2025 and OSRE 2025. I’m contributing to the CacheBench project. My initiative, ‘Building a Benchmarking Suite for Cache Performance Evaluation,’ will create a robust platform. This involves extensive simulation of existing eviction algorithms using libCacheSim, developing microbenchmarks, and building a user-friendly platform for researchers to effortlessly evaluate novel cache designs. The ultimate goal is to establish a competitive leaderboard.

My contributions will include a comprehensive dataset detailing simulated miss ratios and throughput of current cache eviction algorithms, an extension to libCacheSim for executing microbenchmarks both locally and on our online platform, and the creation and ongoing maintenance of a public web leaderboard. I’m grateful to be mentored by Juncheng Yang and Yazhuo Zhang.

I’m thrilled to be part of building tools that empower users and advance the vision of a more decentralized web. Looking forward to a productive summer!

RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Thu, 19 Jun 2025 00:00:00 +0000

Hi everyone! My name is Zeyu, and I will be working on a project for a retrieval-enhanced generative framework for spatial transcriptomics during Google Summer of Code 2025. My project is called RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics and is supervised by Ziheng Duan. The goal is to develop a retrieval-enhanced generative framework for predicting spatial gene expression from histological images, making spatial transcriptomics more affordable and easier to implement. You can view my full proposal here!

Spatial transcriptomics enables the capture of gene expression profiles with spatial resolution, providing unprecedented insights into cellular organization and the tissue microenvironment. However, its widespread application is limited by high costs and technical complexity. In contrast, histological imaging is inexpensive and widely accessible. If we can accurately predict gene expression from histology images, then high-resolution spatial information can be inferred without costly experiments.

My project will:

Create a large-scale paired dataset combining HEST histology images with reference gene expression profiles from CellxGene.
Design a novel RAG-ST architecture that enables both interpretable and controllable generation of spatial gene expression.
Benchmark RAG-ST against current state-of-the-art models for image-based gene expression inference.
Open-source the full codebase and provide comprehensive tutorials to support future research and development.

I am excited to contribute to this project and help broaden access to spatial transcriptomics insights through machine learning–powered predictions!

Zeyu Zou

University of Northeastern Graduate

Zeyu Zou is a graduate student at the University of Northeastern, where he is majoring in Analytics.

EnvGym – An AI System for Reproducible Custom Computing Environments

Mon, 16 Jun 2025 00:00:00 +0000

Hello, My name is Yiming Cheng. I am a Pre-doc researcher in Computer Science at University of Chicago. I’m excited to be working with the Summer of Reproducibility and the Chameleon Cloud community as a project leader. My project is EnvGym that focuses on developing an AI-driven system to automatically generate and configure reproducible computing environments based on natural language descriptions from artifact descriptions, Trovi artifacts, and research papers.

The complexity of environment setup often hinders reproducibility in scientific computing. My project aims to bridge the knowledge gap between experiment authors and reviewers by translating natural language requirements into actionable, reproducible configurations using AI and NLP techniques.

Project Overview

EnvGym addresses fundamental reproducibility barriers by:

Using AI to translate natural language environment requirements into actionable configurations
Automatically generating machine images deployable on bare metal and VM instances
Bridging the knowledge gap between experiment authors and reviewers
Standardizing environment creation across different hardware platforms

June 10 – June 16, 2025

Getting started with the project setup and initial development:

I began designing the NLP pipeline architecture to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into structured environment “recipes”
I set up the initial project repository and development environment
I met with my mentor Prof. Kexin Pei to discuss the project roadmap and technical approach
I started researching existing artifact descriptions from conferences and Trovi to understand common patterns in environment requirements
I began prototyping the backend environment builder logic that will convert parsed requirements into machine-image definitions
I explored Chameleon’s APIs for provisioning servers and automated configuration

Next Steps

Continue developing the NLP component for requirement parsing
Implement the core backend logic for environment generation
Begin integration with Chameleon Cloud APIs
Start building the user interface for environment specification

This is an exciting and challenging project that combines my interests in AI systems and reproducible research. I’m looking forward to building a system that will help researchers focus on their science rather than struggling with environment setup issues.

Thanks for reading, I will keep you updated as I make progress on EnvGym!

Smart Environments – An AI System for Reproducible Custom Computing Environments

Mon, 16 Jun 2025 00:00:00 +0000

Hi everyone, I’m Sam! I’m excited to be working with the Argonne National Laboratory and SoR this summer on Smart Environments. Have you ever encountered a great opensource project and wanted to run it or use it locally, only to find that it’s such a headache to set up all the dependencies? Maybe your system version wasn’t correct, or a piece of software was outdated, or the dependencies were incompatible with something you had already on your machine?

In comes EnvGym to save the day! We want EnvGym to be an agent that would help reproduce opensource projects by automatically setting up the environmental dependencies required to get them running. That’s what I will be working on for the rest of the summer! To make EnvGym work, we will be leveraging LLM agents to tackle the problem. We will use EnvGym to read documentations, understand code structures, run commands to set up environments, and reflectively react to any errors and warnings.

To build EnvGym, I have the following to-do’s in mind:

Building a dataset that includes repos to be reproduced
Establishing a baseline using current methods
Implementing the actual EnvGym algorithm
Testing EnvGym against baseline performance and iteratively improving it
Deploying EnvGym to real-world use cases and gathering feedback

Here is the repo that we are working on: https://github.com/EaminC/EnvGym/tree/main

More updates to come, thanks for reading!

Assessing and Enhancing CC-Snapshot for Reproducible Experiment Enviroments

Sun, 15 Jun 2025 00:00:00 +0000

Hello, My name is Zahra Temori. I am a rising senior in Computer Science at University of Delaware. I’m excited to be working with the Summer of Reproduciblity and the Chameleon Cloud community. My project is cc-snapshot that focuses on enhancing features for helping researchers capture and share reproducible experimental environments within the Chameleon Cloud testbed.

Here is a detailed information about my project and plans to work for summer proposal.

June 10 – June 14, 2025

Getting started with the first milestone and beginning to explore the Chameleon Cloud and the project:

I began familiarizing myself with the Chameleon Cloud platform. I created an account and successfully accessed a project.
I learned how to launch an instance and create a lease for using computing resources.
I met with my mentor to discuss the project goals and outline the next steps.
I experimented with the environment and captured a snapshot to understand the process.

It has been less than a week and I have learned a lot specially about the Chameleon Cloud and how it is different from other clouds like AWS. I am exited to learn more and make progress.

Thanks for reading, I will keep ypu updated as I work :)

Building a Billion-Scale Vector Embeddings Dataset

Sun, 15 Jun 2025 00:00:00 +0000

Billion Vector Embeddings Dataset

As part of the Billion-Scale Embeddings Dataset project, my proposal under the mentorship of Jayjeet Chakraborty aims to create the first large-scale, real-world vector embeddings dataset—bridging the critical gap in Approximate Nearest Neighbor (ANN) benchmarks and Retrieval-Augmented Generation (RAG) systems.

Motivation

Existing ANN benchmarks often fall short—they’re either synthetic (like SIFT) or too small-scale (≤1M vectors). With the rapid evolution of LLM-based vector search systems (e.g., OpenAI’s 3072d text-embedding-3-large), there’s a growing need for:

High-dimensional (>1000d), large-scale (>100M) embeddings
Real-world distributions (Wikipedia-scale text)
Open, reproducible benchmarks for the community

Project Goals

Generate 1 billion embeddings from English Wikipedia using open-source models.
Create multiple dimensional variants: 1024d, 4096d, and 8192d.
Deduplicate, compress, and store embeddings with rich metadata (URL, timestamps, models).
Benchmark ANN performance on FAISS, HNSW, and Annoy.
Distribute the dataset via HuggingFace & AWS S3 with shard-level access.

Open Source Impact

ANN Libraries: Enable reproducible benchmarking for real-world workloads.
RAG Systems: Evaluate and optimize retrieval at scale using real Wikipedia text.
Researchers: Conduct large-scale studies on dimensionality, ANN accuracy, and compression trade-offs.

Develop a clean and intuitive web-based interface for WildberryEye

Sun, 15 Jun 2025 00:00:00 +0000

As part of the WildberryEye, my proposal under the mentorship of Isaac Espinosa aims to develop a clean, intuitive, and responsive web-based interface to support real-time pollinator detection, data visualization, and system configuration.

WildberryEye leverages edge computing (Raspberry Pi 5) and object detection (YOLO) to monitor pollinators like bees and hummingbirds. The expectations for this project focuse on developing a full-stack web interface to support real-time pollinator detection, data visualization, and system configuration. The whole development also include the real-time data extraction from the Raspberry Pi 5). The final result empowers researchers and contributors to engage with environmental data in an accessible and meaningful way.

Developing an Open Testbed for Edge Replication System Evaluation

Sun, 15 Jun 2025 00:00:00 +0000

Hi, I’m Panji. I’m currently contributing to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges under the mentorship of Fadhil I. Kurnia. You can find more details on the project proposal here.

The primary challenge we’re addressing is the current difficulty in fairly comparing different edge replication systems. To fix this, we’re trying to build a testing platform with four key parts. We’re collecting real data about how people actually use edge services, creating a tool that can simulate realistic user traffic across many locations, building a system that mimics network delays between hundreds of edge servers, and packaging everything into an open-source toolkit.

This will let researchers test different coordination methods like EPaxos, Raft, and others using the same data and conditions. We hope this will help provide researchers with a more standardized way to evaluate their systems. We’re working with multiple programming languages and focusing on making complex edge computing scenarios accessible to everyone in the research community.

One of the most interesting aspects of this project is tackling the challenge of creating realistic simulations that accurately reflect the performance characteristics different coordination protocols would exhibit in actual edge deployments. The end goal is to provide the research community with a standardized, reproducible environment for edge replication.

Implement Web Extensions & System Settings Integration

Sun, 15 Jun 2025 00:00:00 +0000

Hi! I’m Hanzhong Liu, a Computer Science student at Fordham University with a minor in Business Administration. My interests lie in distributed systems, backend engineering, and decentralized tools—especially systems that prioritize user autonomy and privacy.

This summer, I’m contributing to the Peersky project as part of OSRE 2025 through Google Summer of Code. My project, “Implement Web Extensions & System Settings Integration,” will add full support for local browser extensions in Peersky, allowing users to customize their experience without relying on centralized extension stores.

Deliverables include an extension loader, drag-and-drop installation for .zip and Git-based extensions, manifest validation, sandboxing, and a unified peersky://settings page for managing everything from themes to privacy tools. Pre-installed extensions like uBlock Origin and DScan will be bundled by default.

You can read my full proposal here. My mentor for this project is Akhilesh Thite.

I’m excited to help build tools that empower users to take control of their browsing experience—and to contribute to the vision of a more decentralized web. Looking forward to the summer ahead!

Into the VR-Verse: My GSoC Adventure Begins!

Sun, 15 Jun 2025 00:00:00 +0000

Hello! I’m Kajal Jotwani, an undergraduate Computer Science student from India who is passionate about building creative, interactive technologies and contributing to open source. This summer, as part of Google Summer of Code 2025, I will be working on the Brahma / Allocentric WebXR Interfaces project under the mentorship of Samir Ghosh. You can read my complete proposal here.

This project focuses on creating a formalized framework for building collaborative and cross-platform WebXR-based experiences. As part of its first public release of Brahma- a lightweight open-source toolkit, our goal is to formalize the framework, create documentation, and implement example applications like multi-user games and scientific visualizations. This will help make Brahma extensible and accessible for a wider developer community.

I’m excited to be working on this project and will be documenting my journey, learnings, and progress here throughout the summer.

Introducing Scenic-RoboSuite Interface

Sun, 15 Jun 2025 00:00:00 +0000

Hey! I’m Sahil, working on integrating Scenic with RoboSuite for GSoC 2025. My project is mentored by Daniel Fremont and Eric Vin .

I’m connecting Scenic (a probabilistic programming language for scenarios) with RoboSuite (a robotics simulation framework). Basically, you write simple scenario descriptions and get complex 3D robot simulations automatically.

Currently, as I’m building things and learning how Scenic works, I have been able to get the basic skeleton for the simulator interface working. I’ve implemented the simulator class and built a world model that can translate Scenic objects into RoboSuite’s simulator (which is MuJoCo-based). The interface now handles precise object placement in the world pretty well.

One of the trickier parts was figuring out the translation logic between Scenic and RoboSuite. I managed to overcome this by building a system that automatically detects the shape of objects when moving between the two frameworks, which lays a foundation for more complex object mapping later on.

I’ve also built some basic example scenarios to run and test with. Currently working on more complex examples and testing Scenic’s features like probabilistic object placement, constraint satisfaction, and spatial relationships between objects.

In summary, the “Scenic to RoboSuite” part of the interface is pretty much done. For next week, I need to work on the “RoboSuite to Scenic” part - basically getting feedback and state information flowing back from the simulation. Achieving this will make a complete bridge and give us a working simulator interface, which is the first major milestone for the project.

Kolmogorov-Arnold-based Transformer for LLMs

Sun, 15 Jun 2025 00:00:00 +0000

Project: KALLM

Proposal: proposal

Mentors:

Sai Suman Lamba Karanam
Prof. Zahmeeth Sakkaff

I am modifying existing large language models to make them more efficient by replacing some of their layers with Kolmogorov-Arnold Network (KAN) modules. These KAN layers use compact univariate polynomial approximations, which can reduce parameter count and improve interpretability. The project explores how to integrate these layers into Transformers, and how far we can push this idea by combining or stacking KAN modules with different polynomial bases. The goal is to keep performance competitive while lowering computational costs.

Beyond just speeding up training, I am exploring several other promising directions. One is testing whether transfer learning remains effective when replacing the linear layers of a pretrained LLM with KAN modules, or when swapping between different KAN configurations. I am also considering curriculum learning strategies that gradually increase KAN complexity during training. I have studied all major KAN implementations and early experiments with a custom Transformer architecture show encouraging results. However, I have found that most LLMs rely on functional-style activation definitions in PyTorch, which makes it difficult to build a universal wrapper. Because of this, KAN-based models will likely need to be integrated manually on a case-by-case basis.

Open Source Repository Browser

Sun, 15 Jun 2025 00:00:00 +0000

Hi! I’m Param Arora, a Computer Science student at Manipal Institute of Technology. My interests lie in backend engineering and AI.

This summer, I’m contributing to the ORB project as part of OSRE 2025 through Google Summer of Code.

My project, “UC Open Source Repository Browser [UC ORB]”, is a discovery platform that maps and categorizes open source projects across the UC system. It offers a comprehensive web interface with intuitive search, advanced filtering capabilities, responsive design, and integrated visualizations of project metrics.

You can read my full proposal here. My amazing mentor for this project is Juanita Gomez.

Looking forward to the summer ahead!

Scaling Sensor Networks for Environmental Research

Sun, 15 Jun 2025 00:00:00 +0000

Hi! I’m Devansh Kukreja, a researcher, indie developer, and Computer Science undergrad. I’m interested in distributed systems, orchestration services, and real-time data platforms. I enjoy working on systems that help different components connect and run smoothly at scale.

This summer, I’m contributing to the ENTS (Environmental NeTworked Sensor) platform with the University of California, Santa Cruz Open Source Program Office as part of Google Summer of Code 2025.

ENTS is an open-source web portal designed to collect, visualize, and analyze data from large-scale environmental sensor networks. It helps researchers and citizen scientists monitor sensors like soil moisture, temperature, current, and voltage supporting real-time environmental research in outdoor settings.

My work this summer focuses on improving the platform’s reliability and usability. I’ll be fixing visualization bugs, enhancing chart synchronization, making data point selection more intuitive, and improving error handling. Alongside that, I’m building a Logger Registration System that lets users easily add and configure their data loggers, with potential support for over-the-air provisioning via The Things Network (TTN) for LoRaWAN-based devices.

You can check out my full proposal here. I’m grateful to be mentored by Colleen Josephson, John Madden, and Alec Levy, who are guiding the project with incredible insight and support.

By the end of the summer, ENTS will be a more stable, user-friendly, and extensible platform—better equipped to support environmental research at scale. I’m super excited to learn, build, and contribute to something meaningful!

Type Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits

Sun, 15 Jun 2025 00:00:00 +0000

Hello! I’m Siva Sathyaseelan D N, a pre-final year B.Tech + M.Tech Engineering student at IIT BHU, Varanasi, India. With a deep-rooted passion for software development and scientific computing. I thrive at the intersection of code and real-world problem-solving. For two years, I’ve engaged in open-source work across scientific simulation, blockchain, and cloud-native technologies, through hobby projects, hackathons, internships, and an LFX mentee. I will be working on Type Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits under the mentorship of Ben Greenman. My proposal can be viewed here!

Building a Simulator for Benchmarking Replicated Systems

Sat, 14 Jun 2025 00:00:00 +0000

Hi, I’m Michael. I’m currently contributing to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges under the mentorship of Fadhil Kurnia. You can find more details on the project proposal here.

What we are trying to achieve is to create a system to test and evaluate the performance of different consensus protocols and consistency models under the same application and workload. The consensus protocols and consistency models are both tested on various replicated black-box applications. Essentially, the testbed itself is able to deploy any arbitrary stateful application on multiple machines (nodes) as long as it is packaged in the form of a docker image. The consensus protocol is used to perform synchronization between the stateful part of the application (in most cases, the database). The goal is that by the end of this project, the testbed we are building has provided the functionality and abstraction to support the creation of new consensus protocols to run tests on.

One major challenge in implementing this is with regards to the handling of replication on the running docker containers. Generally, the services that can be deployed in this system would be of two types:

A Deterministic Application (An application that will always return the same output when given the same input. e.g., a simple CRUD app)
A Non-Deterministic Application (An application that may return the different outputs when given the same input. e.g., an LLM which may return different response from the same prompt request)

Both of these application types requires different implementation of consensus protocols. In the case of a deterministic application, since all request will always yield the same response (and the same changes inside the database of the application itself), the replication protocol can perform replication on the request to all nodes. On the other hand, in a non-determinisitic application, the replication protocol applies synchronization on the state of the database directly since a different response may be returned on the same request.

Kicking Off Intelligent Observability for Seam: My OSRE 2025 Journey

Sat, 14 Jun 2025 00:00:00 +0000

Hi! I’m Manish K Reddy (@kredd2506), a graduate student based in the United States, and I’m excited to join the OSRE 2025 cohort. This summer, I’ll be working with the San Diego Supercomputer Center (SDSC) and the National Research Platform (NRP) on a project that blends my interests in machine learning, cloud systems, and real-world impact.

The National Research Platform (NRP) has moved beyond its original vision as a “ScienceDMZ data freeway” and evolved into a distributed cloud supercomputer, empowering research and education across more than 50 institutions. SDSC, located at UC San Diego, is recognized internationally for driving innovation in data, supercomputing, and advanced cyberinfrastructure.

My project, “Intelligent Observability for Seam: A GenAI Approach” focuses on building an ML-powered service for NRP. The goal is to analyze monitoring data (starting with Prometheus metrics), automatically detect anomalies, and use generative AI (GenAI) for human-readable explanations and root-cause analysis. This will help researchers and operators solve problems faster and keep complex research systems running smoothly.

I am especially grateful to my lead mentor Mohammad Firas Sada, who is personally guiding me throughout this project. I also want to thank Jeffrey Weekley and Derek Weitzel for their support and guidance.
You can read my initial proposal here (PDF).

GenAI-Driven Observability for NRP

Topics: Machine Learning, Observability, DevOps, High Performance Computing, LLMs, GenAI, Distributed Systems
Skills: Python, Prometheus, Docker, Kubernetes, FastAPI, PyTorch, Pandas, LLM APIs, scikit-learn, PostgreSQL
Difficulty: Medium
Size: 350 hours
Mentors: Mohammad Firas Sada, Jeffrey Weekley, Derek Weitzel

This summer, I’m looking forward to:

Delivering an open-source anomaly detection tool for NRP
Building GenAI features for better explanations and root-cause analysis
Learning from my mentors and contributing to a vibrant open science community

Thanks for reading, and I’m looking forward to sharing my journey and progress in the coming weeks!

LINQS: Autograder (LLM Detection)

Sat, 14 Jun 2025 00:00:00 +0000

LINQS: Autograder (GSoC ‘25)

As part of the LINQS: Autograder (LLM Detection) my proposal under the mentorship of Eriq Augustine, Lucas Ellenberger, and Lise Getoor aims to build a tool for AI plagiarism detection in code.

Problem Statement

Academic institutions are facing new sets of challenges in maintaining academic integrity with the rise of Large Language Models and tools like ChatGPT and GitHub Copilot, and their easier accessibility to students. Students are increasingly using these tools for assistance with their coursework, especially in programming assignments.

While these tools are useful for purposes such as brainstorming, research, and drafting, its use in completing assignments often crosses ethical boundaries. The use of these tools by students makes it difficult to uphold fairness in grading and ensure they are truly learning.

AI-generated code often lacks unique identifiers, rendering traditional plagiarism detectors like MOSS ineffective in detecting AI-generated code. That’s why there is a need for better systems that can assess whether code was AI generated by spotting underlying patterns.

Project Overview:

This is the problem that I am working to address with my project ‘LLM Detection’.

I aim to build a system that helps academic institutions ensure fairness and integrity in students’ work. To accomplish this goal, I will be working on 2 tasks:

Building a tool which determines whether a given piece of code was written by AI or not.
Designing and implementing a mechanism to compute a confidence score that indicates the likelihood of AI involvement in the code.

This tool can discourage students from copying or completing entire assignments using AI tools, encouraging honest and independent work.

(Read my full GSoC proposal here: Proposal)

About me:

Hey there!

My name is Anvi Kohli, I am a senior majoring in Computer Science and AI from India. This summer I will be contributing to the Autograder project by the LINQS Lab, under the guidance of Eriq Augustine, Lucas Ellenberger, and Lise Getoor.

A problem-solver at heart, I love to brainstorm, solve, and optimize complex issues. An instance being reaching the grand finals of the Smart India Hackathon to become the third best team nationwide with our app – “PM Poshan”. This app was built to digitize the monitoring and functioning of the mid-day meal scheme in India. It gave me the opportunity to improve my versatility and exposed me to all stages of the product development cycle.

I have hands-on experience in a multitude of domains such as AI/Data Science, cloud, full-stack development, and DevOps. Within AI, I have worked in GenAI, Computer Vision, Deep Learning and Classical Machine Learning. Apart from this, I have a strong interest in entrepreneurship, travelling, and cooking.

MPI Appliance for HPC Research on Chameleon

Sat, 14 Jun 2025 00:00:00 +0000

Hi Everyone,

I’m Rohan Babbar from Delhi, India. This summer, I’m excited to be working with the Argonne National Laboratory and the Chameleon Cloud community. My project focuses on developing an MPI Appliance to support reproducible High-Performance Computing (HPC) research on the Chameleon testbed.

For more details about the project and the planned work for the summer, you can read my proposal here.

👥 Community Bonding Period

Although the project officially started on June 2, 2025, I made good use of the community bonding period beforehand.

I began by getting access to the Chameleon testbed, familiarizing myself with its features and tools.
I experimented with different configurations to understand the ecosystem.
My mentor, Ken Raffenetti, and I had regular check-ins to align our vision and finalize our milestones, many of which were laid out in my proposal.

🔧 June 2 – June 14, 2025

Our first milestone was to build a base image with MPI pre-installed. For this:

We decided to use Spack, a flexible package manager tailored for HPC environments.
The image includes multiple MPI implementations, allowing users to choose the one that best suits their needs and switch between them using simple Lua Module commands.

📌 That’s all for now! Stay tuned for more updates in the next blog.

Thanks for reading!

StatWrap: Cross-Project Searching and Classification using Local Indexing

Sat, 14 Jun 2025 00:00:00 +0000

Hello👋! I am Debangi Ghosh, currently pursuing a degree in Mathematics and Computing at IIT (BHU) Varanasi, India. This summer, I will be working on the StatWrap: Cross-Project Searching and Classification using Local Indexing project under the mentorship of Luke Rasmussen. You can view my project proposal for more details.

My project aims to address the challenges in project navigation and discoverability by integrating a robust full-text search capability within the user interface. Instead of relying on basic keyword-based search—where remembering exact terms can be difficult—we plan to implement a natural language-based full-text search. This approach involves two main stages: indexing, which functions like creating a searchable map of the content, and searching, which retrieves relevant information from that map. We will evaluate and compare available open-source libraries to choose and implement the most effective one. In addition, my project aims to enhance project organization by introducing a new classification system that clearly distinguishes between “Active” and “Past” projects in the user interface. This will improve clarity, reduce clutter, and provide a more streamlined experience as the number of projects grows.

Stay tuned for updates on my progress in the coming weeks! 🚀

WildBerryEye: Mechanical Design & Weather-Resistant Enclosure

Sat, 14 Jun 2025 00:00:00 +0000

Hello! My name is Teodor Langan, an undergraduate student currently persueing a Robotics Engineering degree at the University of California, Santa Cruz. This Summer, I’ll be working on developing the hardware for the WildBerryEye project, mentored by Carlos Isaac Espinosa. Here is my project proposal!

My project focuses on tackling the hardware challenge for WildBerryEye, an open-source ecological monitoring platform built on Raspberry Pi. To reliably support the real-time object detection provided by the system, it requires a robust and weather-resistant camera enclosure that can reliably protect its electronics in the field. To address this, I will be designing and prototyping a modular, 3D-printable camera case using FreeCAD this Summer. The case will be able to protect electrical components from rain and dust while incorporating proper ventilation and heat dissipation features. Designed using FreeCAD, the entire model will be fully open-source, allowing for easy adoption and modification by the community. Over this Summer, this work will incorporate multiple rounds of field testing to test and refine the design under accurate field conditions. Ultimately, my project aims to deliver a detailed open-source FreeCAD model, full assembly documentation, and a user guide.

I’m excited to see what we can learn througout the development of my project!

Improving AI Data Pipelines in AIDRIN: A Privacy-Centric and Multimodal Expansion

Thu, 12 Jun 2025 00:00:00 +0000

⏱️ Reading time: 4–5 minutes

Hi 👋

I’m Harish Balaji, a Master’s student at NYU with a focus on Artificial Intelligence, Machine Learning, and Cybersecurity. I’m especially interested in building scalable systems that reflect responsible AI principles. For me, data quality isn’t just a technical detail. It’s a foundational aspect of building models that are reliable, fair, and reproducible in the real world.

This summer, I’m contributing to AIDRIN (AI Data Readiness Inspector) as part of Google Summer of Code 2025. I’m grateful to be working under the mentorship of Dr. Jean Luca Bez and Prof. Suren Byna from the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL).

AIDRIN is an open-source framework that helps researchers and practitioners evaluate whether a dataset is truly ready to be used in production-level AI workflows. From fairness to privacy, it provides a structured lens through which we can understand the strengths and gaps in our data.

Why this work matters

In machine learning, one principle always holds true:

“Garbage in, garbage out.”

Even the most advanced models can underperform or amplify harmful biases if trained on incomplete, imbalanced, or poorly understood data. This is where AIDRIN steps in. It provides practical tools to assess datasets across key dimensions like privacy, fairness, class balance, interpretability, and support for multiple modalities.

By making these characteristics measurable and transparent, AIDRIN empowers teams to make informed decisions early in the pipeline. It helps ensure that datasets are not only large or complex, but also trustworthy, representative, and purpose-fit.

My focus this summer

As part of my GSoC 2025 project, I’ll be focusing on extending AIDRIN’s evaluation capabilities. A big part of this involves strengthening its support for privacy metrics and designing tools that can handle non-tabular datasets, such as image-based data.

The goal is to expand AIDRIN’s reach without compromising on interpretability or ease of use. More technical insights and updates will follow in the next posts as the summer progresses.

What comes next

As the AI community continues to evolve, there’s a growing shift toward data-centric practices. I believe frameworks like AIDRIN are essential for helping us move beyond the question of “Does the model work?” toward a deeper and more meaningful one: “Was the data ready in the first place?”

Over the next few weeks, I’ll be working on development, testing, and integration. I’m excited to contribute to a tool that emphasizes transparency and reproducibility across the AI lifecycle, and to share lessons and ideas with others who care about responsible AI.

If you’re exploring similar challenges or working in the space of dataset evaluation and readiness, I’d love to connect and exchange thoughts. You can also read my full GSoC 2025 proposal below for more context around the project scope and vision:

👉 Read my GSoC 2025 proposal here

This is the first in a 3-part blog series documenting my GSoC journey with AIDRIN. Stay tuned for technical updates and behind-the-scenes insights as the summer unfolds!

Reproducibility of Interactive Notebooks in Distributed Environments

Thu, 12 Jun 2025 00:00:00 +0000

Hello! I am Raza, currently a Ph.D. student in Computer Science at DePaul University. This summer, I will be working on reproducibility of notebooks in distributed enviornments, mentored by Prof. Tanu Malik. Here is a summary of my project proposal.

Interactive notebooks are web-based systems which enable encapsulating code, data, and their outputs for sharing and reproducibility. They have gained wide popularity in scientific computing due to their ease of use and portability. However, reproducing notebooks in different target environments remains challenging because notebooks do not carry the computational environment in which they are executed. This becomes even more challenging in distributed cluster environments where a notebook must be prepared to run on multiple nodes. In this project, we plan to (i) extend FLINC, an open-source user-space tool for distributed environments such that it can package notebook executions into notebook containers for execution and sharing across distributed environments, and (ii) integrate the extended Flinc with TaskVine, which provides the framework and orchestration to enable distributed notebook execution in high performance computing environments.

You can read my complete proposal here.

I am excited to work on this project and learn from the experience here!

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Sun, 08 Jun 2025 00:00:00 +0000

Google Summer of Code ‘25: Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

My project aims to introduce GPU-to-GPU collective communication calls using Nvidia’s NCCL to PyLops-MPI, an extension of the powerful PyLops library.

I’m incredibly grateful for this opportunity and excited to be mentored by two HPC experts, Yuxi Hong from Lawrence Berkeley National Laboratory and Matteo Ravasi from ShearWater GeoServices.

Here’s also the link to my original proposal

What is PyLops-MPI and NCCL ?

PyLops is a Python library that provides a rich collection of linear operators to solve inverse problems. Its MPI extension, PyLops-MPI, takes this a step further by enabling these operations to run on large-scale, distributed computing systems like HPC using the Message-Passing Interface (MPI).

Where does NCCL fit in? The NVIDIA Collective Communication Library (NCCL) is a library of highly-optimized routines for collective communication between GPU. It offers the opportunity to close the performance gap in PyLops-MPI. As we now offload more and more computationally intensive tasks to GPUs, the communication between them can become a bottleneck. NCCL offers a powerful solution to this problem, enabling high-bandwidth, low-latency communication that can significantly boost performance.

Motivation and What was Missing

As a student with a background in geophysics (B.Sc) and now pursuing computer science (M.Sc), I’ve experienced firsthand the challenges of scaling scientific computing research from a personal desktop to a high-performance computing (HPC) cluster. It can be a significant hurdle. My project aims to ease this transition for PyLops-MPI users. PyLops-MPI is something I wish existed while I was doing my undergraduate reseach !

Currently, PyLops-MPI is “CUDA-aware,” meaning it can offload computations to GPUs. However, the communication between those GPUs is still handled by the underlying MPI implementation, which isn’t always optimal. This project will address this gap by integrating NCCL to handle GPU-to-GPU communication directly. If the compute is done in GPU, there shouldn’t be a copy from GPU to CPU, transfer with MPI, and put them back to GPU again.

This will be especially impactful for memory-bound problems where high-bandwidth communication is critical. By the end of this project, we’ll have a clear, quantifiable understanding of the performance gains achieved.

My Best-Laid Plan

My approach is grounded in good software engineering practices to ensure that this new feature is both robust and genuinely useful. I was impressed by the code quality (enjoyable read) of the repository - and so I commit not to break that.

First and foremost, the goal is to seamlessly integrate NCCL without breaking what already works. A significant part of my effort will be dedicated to rigorous testing. This means not only ensuring that all existing tests pass but also developing a new, comprehensive test suite to validate the correctness of the GPU-to-GPU communication across different hardware setups.

Once we’re confident that the integration is solid, the exciting part begins: benchmarking (or you may call it “Moment of Truth”)! The plan is to measure the performance of end-to-end iterative solvers. These solvers are a perfect test case because they involve a mix of intensive gradient computations on the GPU and frequent AllReduce calls to sync up processes. This will give us a clear picture of the speedup and efficiency gains from using NCCL.

Finally, to make sure this work benefits the entire community, I will create clear documentation and tutorials. The goal is to make it easy for any user to leverage this new GPU-accelerated communication in their own research and applications.

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Fri, 06 Jun 2025 10:15:56 -0700

Project Description

Sequential Recommender Systems are widely used in scientific and business applications to analyze and predict patterns over time. In biology and ecology, they help track species behavior by suggesting related research on migration patterns and environmental changes. Medical applications include personalized treatment recommendations based on patient history and predicting disease progression. In physics and engineering, these systems optimize experimental setups by suggesting relevant past experiments or simulations. Environmental and climate science applications include forecasting climate trends and recommending datasets for monitoring deforestation or pollution. In business and e-commerce, sequential recommenders enhance user experiences by predicting consumer behavior, suggesting personalized products, and optimizing marketing strategies based on browsing and purchase history. By leveraging sequential dependencies, these recommender systems enhance research efficiency, knowledge discovery, and business decision-making across various domains. Traditional sequential recommendation systems rely on historical user interactions to predict future preferences, but they often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors. Existing models primarily use predefined embeddings and handcrafted features, limiting their ability to generalize across diverse recommendation scenarios. To address these challenges, we propose LLM Enhanced Contextual Sequential Recommender (LLMSeqRec), which leverages Large Language Models (LLMs) to enrich sequential recommendations with deep contextual understanding and adaptive reasoning. By integrating LLM-generated embeddings and contextual representations, LLMSeqRec enhances user intent modeling, cold-start recommendations, and long-range dependencies in sequential data. Unlike traditional models that rely solely on structured interaction logs, LLMSeqRec dynamically interprets and augments sequences with semantic context, leading to more accurate and personalized recommendations. This fusion of LLM intelligence with sequential modeling enables a more scalable, adaptable, and explainable recommender system, bridging the gap between traditional sequence-based approaches and advanced AI-driven recommendations.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), this project aims to develop an LLM-Enhanced Contextual Sequential Recommender (LLMSeqRec) to improve sequential recommendation accuracy across various scientific and business applications. Sequential recommender systems are widely used to analyze and predict patterns over time, assisting in fields such as biology, ecology, medicine, physics, engineering, environmental science, and e-commerce. However, traditional models often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors, as they primarily rely on vanilla sequential Id orders. To address these limitations, this project will leverage Large Language Models (LLMs) to enhance context-aware sequential recommendations by dynamically integrating LLM-generated embeddings and contextual representations. The core challenge lies in designing LLMSeqRec, a unified and scalable model capable of enriching user intent modeling, mitigating cold-start issues, and capturing long-range dependencies within sequential data. Unlike conventional systems that rely solely on structured interaction logs, LLMSeqRec will interpret and augment sequences with semantic context, resulting in more accurate, adaptable, and explainable recommendations. Below is an outline of the methodologies and models that will be developed in this project:

Step 1: Data Preprocessing & Feature Creation: Develop a data processing pipeline to parse user’s sequential interaction behaviors into sequential data points for LLM-based embeddings and contextual sequential transformer modeling; Extract user behavior sequences, items’ metadata, and temporal patterns to create context-aware sequential representations for training, validation and testing; The data source can be from Amazon open public data or Movie Lense data set. The data points creation can follow SASRec (in the reference 1).
Step 2: Model Development: Design and implement LLM-enhanced sequential recommendation models, integrating pretrained language models to augment user-item interactions with semantic context; Develop an adaptive mechanism to incorporate external contextual signals, such as product descriptions, reviews into the sequential recommendation process; The baseline model can be SASRec pytorch implementation.
Step 3: Evaluation: : Benchmark LLMSeqRec against state-of-the-art sequential recommenders, evaluating on accuracy, NDCG and cold-start performance; Conduct ablation studies to analyze the impact of LLM-generated embeddings on recommendation quality; Optimize model inference speed and efficiency for real-time recommendation scenarios.

Project Deliverables

This project will deliver three components, software, model training, validation and performance evaluation and demo. The software which implements the above LLMSeqRec model will be hosted on the github repo as open-access repositories. The evaluation results and demo will be published along the github repo .

LLMSeqRec

Topics: LLM Enhanced Contextual Sequential Recommender
Skills: Proficiency in Python, Pytorch, Github, Self-attention, Transformer
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Self-Attentive Sequential Recommendation (SASRec)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Amazon Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews
Movie Lense Data: https://grouplens.org/datasets/movielens/

Introduction

I’m Connor, a student at NYU studying CS and Math. This summer I’ve gotten the opportunity to work on LLMSeqRec under Dr. Bin Dong and Dr. Linsey Pang.

In today’s digital age, sequential recommender systems power everything from e-commerce suggestions to personalized content everywhere. However, traditional models fall short in capturing user intent, adapting to dynamic behavior, or tackling cold-start problems. That’s where LLMSeqRec comes in.

Problem Statement

Most sequential recommender systems rely heavily on historical user-item interactions and predefined embeddings. This approach limits their ability to understand nuanced user preferences, struggles to scale across domains, and performs poorly in scenarios like new users or sparse data. The absence of semantic and contextual modeling is a major gap in current solutions.

Overview of project

LLMSeqRec is a novel, LLM-enhanced sequential recommender framework that bridges this gap. By leveraging large language models (LLMs), it incorporates semantic embeddings and prompt-based contextual modeling to understand both user behavior and item metadata at a deeper level. The system explores two core approaches:

Embedding-based: LLMs generate embeddings from item attributes.
Prompt-based: LLMs receive full transaction history in natural language format and infer recommendations.

These techniques are tested using well-known datasets (e.g., Amazon, MovieLens), and evaluated with ranking metrics like NDCG@10 and Hit@10. The goal: deliver more accurate, context-rich, and explainable recommendations.

Next Steps

The project is currently progressing through stages including model training, embedding integration, and evaluation. Upcoming tasks include:

Fine-tuning enhanced models
Designing zero-/few-shot prompts
Running comparative experiments
Publishing findings and writing technical blogs

As part of the LLMSeqRec my proposal under the mentorship of Dr. Bin Dong and Dr. Linsey Pang.

Understanding Skin-Tone based Bias in Text-to-Image Models Using Stable Diffusion

Tue, 27 May 2025 00:00:00 +0000

This project investigates skin tone bias in text-to-image generation by analyzing the output of Stable Diffusion models when prompted with socially and occupationally descriptive text. Despite the growing popularity of generative models like Stable Diffusion, little has been done to evaluate how these models reproduce or amplify visual bias—especially related to skin tone, perceived race, and social class—based solely on textual prompts.

This work builds on prior studies of bias in large language models (LLMs) and vision-language models (VLMs), and aims to explore how biases manifest visually, without explicitly specifying race or ethnicity in the input prompt. Our approach combines systematic prompt generation, model-based image creation, and skin tone quantification to assess disparities across generated samples.

The ultimate goal is to develop a reproducible evaluation pipeline, visualize disparities across demographic and occupational prompts, and explore strategies to mitigate representational harms in generative models.

Our goal is to create a reproducible pipeline for:

Generating images from prompts
Annotating or analyzing them using computer vision tools
Measuring bias across categories like skin tone, gender presentation, or status markers

Project webpage: https://github.com/marzianizam/ucsc-ospo.github.io/tree/main/content/project/osre25/UCSC/FairFace

Project Idea: Measuring Bias in AI-Generated Portraits

Topics: Responsible AI, Generative Models, Ethics in AI
Skills: Python, PyTorch, Stable Diffusion, Prompt Engineering, Data Analysis
Difficulty: Medium
Size: 350 hours
Mentors:
- Marzia Binta Nizam (mailto:manizam@ucsc.edu)
- Professor James Davis (mailto:davisje@ucsc.edu)

Background

Recent research has shown that text-to-image models can perpetuate racial and gender stereotypes through visual output. For instance, prompts like “CEO” or “nurse” often produce racially skewed results even when no explicit race or demographic cues are provided. This project examines whether similar disparities exist along skin tone dimensions, focusing on subtle biases rather than overt stereotypes.

The key challenge is that visual bias is not always easy to measure. This project addresses this issue by utilizing melanin-level quantification, a continuous and interpretable proxy for skin tone, in conjunction with consistent prompt templating and multi-sample averaging to ensure statistical rigor.

Objectives

Generate datasets using consistent prompts (e.g., “A portrait of a doctor”, “A homeless person”, etc.)
Use Stable Diffusion (and optionally, other models like DALL·E or Midjourney) to generate diverse image sets
Measure bias across demographic and occupational categories using image processing tools
Visualize the distribution of melanin values and facial features across samples
Explore prompt-level mitigation strategies to improve fairness in output

Deliverables

Open-source codebase for prompt generation and image evaluation
Statistical analysis of visual bias trends
Blog post or visual explainer on findings
Final report and recommendations on prompt engineering or model constraints

UC Open Source Repository Browser

Mon, 03 Mar 2025 13:00:00 -0800

The University of California Open Source Repository Browser (UC ORB) is a discovery tool designed to map and classify open source projects across the UC system. This project is a collaboration with the UC Network of Open Source Program Offices (OSPOs), which brings together six UC campuses (Santa Cruz, Berkeley, Davis, Los Angeles, Santa Barbara, and San Diego) to support open source research, promote sustainability, and establish best practices within academic environments.

By providing a centralized platform, UC ORB enhances the visibility of UC’s open source contributions, fosters collaboration among researchers and developers, and serves as a model for other institutions aiming to improve open source discovery and sustainability.

This project focuses on building the web application for UC ORB, which will serve as the primary interface for users to explore and interact with UC’s open source projects. The student will work on developing a clean, user-friendly, and scalable web application.

Develop the UC ORB Application

Topics: Web development
Skills: Experience in Python and at least one Python-based web framework (e.g., Flask, Django, FastAPI), experience with front-end technologies (React, HTML, CSS, JavaScript), familiarity with Git and collaborative development workflows, familiarity with database interaction (SQL).
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Juanita Gomez

Develop a web application that serves as the front-end interface for the UC ORB. The application will allow users to browse, search, and explore open source projects across the UC system. The project will involve integrating with the repository database to fetch and display repository data, designing an intuitive user interface, and ensuring the application is scalable and maintainable.

Specific Tasks:

Choose an appropriate Python-based web framework (e.g., Flask, Django, or FastAPI) for the backend and set up the basic structure of the application.
Develop a responsive and user-friendly front-end interface ensuring that it is accessible and works well on both desktop and mobile devices.
Add search functionality to allow users to find projects by keywords, tags, or other metadata.
Implement filtering options to narrow down search results (e.g., by campus, topic, or programming language).
Deploy the application to a cloud platform (e.g., AWS, or Google Cloud) or GitHub Pages (GitHub.io) for public access.
Create developer documentation that explains the application’s architecture, setup instructions, and contribution guidelines.
Write a short user manual to help end-users browse and use the web application effectively.

Applying MLOps to overcome reproducibility barriers in machine learning research

Sat, 01 Mar 2025 00:00:00 +0000

Topics: machine learning, MLOps, reproducibility
Skills: Python, machine learning, GitOps, systems, Linux, data, Docker
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fraida Fund and Mohamed Saeed

Project Idea Description

Reproducibility remains a significant problem in machine learning research, both in core ML and in the application of ML to other areas of science. In many cases, due to inadequate experiment tracking, dependency capturing, source code versioning, data versioning, and artifact sharing, even the authors of a paper may find it challenging to reproduce their own study several years later. This makes it difficult to vaidate and build on previous work, and raises concerns about its trustworthiness.

In contrast, outside of academic research, MLOps tools and frameworks have been identified as a key enabler of reliable, reproducible, and trustworthy machine learning systems in production. A good reference on this topic is:

Firas Bayram and Bestoun S. Ahmed. 2025. Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps Approach. ACM Comput. Surv. 57, 5, Article 121 (May 2025), 35 pages. https://doi.org/10.1145/3708497

This project seeks to bridge the gap between widely adopted practices in industry and academic research:

by making it easier for researchers and scientists to use MLOps tools to support reproducibility. To achieve this, we will develop starter templates and recipes for research in computer vision, NLP, and ML for science, that have reproducibility “baked in” thanks to the integration of MLOps tools and frameworks. Researchers will launch these templates on open access research facilities like Chameleon.
and, by developing complementary education and training materials to emphasize the important of reproducibility in ML, and how the tools and frameworks used in the starter templates can support this goal.

Writing a successful proposal for this project

A good proposal for this project should -

demonstrate a good understanding of the current barriers to reproducibility in machine learning research (specific examples are welcome),
describe a “base” starter template, including the platforms and tools that will be integrated, as well as specific adaptations of this template for computer vision, NLP, and ML for science,
explain the “user flow” - how a researcher would use the template to conduct an experiment or series of experiments, what the lifecycle of that experiment would look like, and how it would be made reproducible,
include the contributor’s own ideas about how to make the starter templates more usable, and how to make the education and training materials relatable and useful,
and show that the contributor has the necessary technical background and soft skills to contribute to this project. In particular, the contributor will need to create education and training materials that are written in a clear, straightforward, and concise manner, without unncessary jargon. The proposal should show evidence of the contributor’s writing abilities.

Github link

There is no pre-existing Git repository for this project - at the beginning of the summer, the contributor will create a new repository in the Teaching on Testbeds organization, and the project materials will “live” there.

CacheBench: Building a Benchmarking Suite for Cache Performance Evaluation

Fri, 28 Feb 2025 00:00:00 +0000

Overview

In this project, we aim to develop a comprehensive benchmarking suite, CacheBench, for evaluating the performance of cache systems in modern computing environments. Caches play a crucial role in enhancing system performance by reducing latency and improving data access speeds. However, evaluating cache performance is a complex task that requires a diverse set of workloads and metrics to capture the cache’s behavior accurately. The current focus is on the eviction algorithms and if time permits, we will extend to other components of cache design.

This project will have three main components:

Implementing and benchmarking existing cache eviction algorithms in libCacheSim using large-scale simulation. This part will mainly focus on reproducing existing works.
Developing a set of microbenchmarks and a platform for researchers to evaluate new designs with little effort in the future. This part will focus on building the open-source infrastructure for future research.
Developing a leaderboard for the community to submit new algorithms and workloads. This part will focus on building the community and fostering adoption and collaboration.

Topics: storage systems, benchmarking, performance evaluation
Skills: C programming, web programming (e.g., node.js, React), database management
Difficulty: Moderate
Size: Large (350 hours).
Mentors: Juncheng Yang, Yazhuo Zhang (yazhuo@inf.ethz.ch)

FairFace

Fri, 28 Feb 2025 00:00:00 +0000

FairFace: Reproducible Bias Evaluation in Facial AI Models via Controlled Skin Tone Manipulation

Bias in facial AI models remains a persistent issue, particularly concerning skin tone disparities. Many studies report that AI models perform differently on lighter vs. darker skin tones, but these findings are often difficult to reproduce due to variations in datasets, model architectures, and evaluation settings. The goal of this project is to investigate bias in facial AI models by manipulating skin tone and related properties in a controlled, reproducible manner. By leveraging BioSkin, we will adjust melanin levels and other skin properties on existing human datasets to assess whether face-based AI models (e.g., classification and vision-language models) exhibit biased behavior toward specific skin tones.

Topics: Fairness & Bias in AI, Face Recognition & Vision-Language Models, Dataset Augmentation for Reproducibility
Skills: Machine Learning & Computer Vision, Deep Learning (PyTorch/TensorFlow), Data Augmentation & Image Processing, Reproducibility & Documentation (GitHub, Jupyter Notebooks).
Difficulty: Moderate
Size: Medium or Large ( Can be completed in either 175 or 350 hours, depending on the depth of analysis and number of models tested.)
Mentors: James Davis, Alex Pang

Key Research Questions

Do AI models perform differently based on skin tone?
- How do classification accuracy, confidence scores, and error rates change when skin tone is altered systematically?
What are the underlying causes of bias?
- Is bias solely dependent on skin tone, or do other skin-related properties (e.g., texture, reflectance) contribute to model predictions?
- Is bias driven by dataset imbalances (e.g., underrepresentation of certain skin tones)?
- Do facial features beyond skin tone (e.g., structure, expression, pose) contribute to biased predictions?
Are bias trends reproducible?
- Can we replicate bias patterns across different datasets, model architectures, and experimental setups?
- How consistent are the findings when varying image sources and preprocessing methods?

Specific Tasks:

Dataset Selection & Preprocessing
- Choose appropriate face/human datasets (e.g., FairFace, CelebA, COCO-Human).
- Preprocess images to ensure consistent lighting, pose, and resolution before applying transformations.
Skin Tone Manipulation with BioSkin
- Systematically modify melanin levels while keeping facial features unchanged.
- Generate multiple variations per image (lighter to darker skin tones).
Model Evaluation & Bias Analysis
- Test face classification models (e.g., ResNet, FaceNet) and vision-language models (e.g., BLIP, LLaVA) on the modified images.
- Compute fairness metrics (e.g., demographic parity, equalized odds).
Investigate Underlying Causes of Bias
- Compare model behavior across different feature sets.
- Test whether bias persists across multiple datasets and model architectures.
Ensure Reproducibility
- Develop an open-source pipeline for others to replicate bias evaluations.
- Provide codebase and detailed documentation for reproducibility.

IO logger: IO tracing in the modern computing era

Fri, 28 Feb 2025 00:00:00 +0000

Overview

Storage systems are critical components of modern computing infrastructures, and understanding their performance characteristics is essential for optimizing system efficiency. There were many works from twenty to thirty years ago, but the landscape has changed significantly with the advent of

cloud computing, virtualization, and storage disaggregation on the server side
ubiquitous fast wireless networking for end users that make remote storage feasible
AI and ML workloads that generate and move massive data both in the cloud and on the edge.

In this project, we aim to develop an IO logger, a tool for tracing, logging and analyzing IO operations in various computing environments. The IO logger will capture detailed information about read and write operations, latency, throughput, and other metrics to help researchers and practitioners understand the behavior of storage systems under different workloads and configurations. By providing a comprehensive view of IO performance, the IO logger will enable users to identify bottlenecks, optimize resource utilization, and improve system efficiency.

This project will have two phases:

IO logger for *NIX systems: Develop a tool leveraging eBPF and other tools for tracing IO operations on Linux and other Unix-like systems. The tool will capture detailed information about disk reads and writes, network transfers, and other IO activities, providing insights into system performance. The tool will be open-sourced, and we will work with industry partners and testbeds to integrate it into existing monitoring and analysis tools. Moreover, we will collect and open source the IO traces to benefit the community.
IO logger for personal computing environment: Develop a tool for end-users to trace IO operations on their personal devices, such as laptops, desktops, and mobile phones. We will design and implement tools for three different platforms, Window, MacOS and Andriod. We will use the tools to collect IO traces from volunteers and real-world applications. providing insights into storage usage, network activity, and application performance. The tool will be user-friendly, lightweight, and privacy-preserving, ensuring that users can monitor their IO activities without compromising their data security.

Notable difference and challenges compared to the existing works are:

more IO requests with rich features: open-source traces from previous works were collected all after page cache, which are often write-heavy, lose most IO requests, and do not provide enough features, e.g., process name. To address this, we will build a tool that can also records requests served by page cache, which requires the tool to be efficient and cannot impose significant overhead to the ruuning systems.
focus on new applications and workloads: the existing works were mostly outdated from the 1990s, during which the Internet has not been widely used, and applications are mostly processing local data and does not communicate with outside world. While there have been a few works looked into mobile storage a decade ago. The landscape has changed significantly since then, especially with the advent of AI and ML workloads that generate and move massive data both in the cloud and on the edge. This project will look into the difference and challenges brought by these new applications and workloads.

Topics: tracing tool, operating system, eBPF, performance evaluation
Skills: C programming, system programming, eBPF, Linux kernel, mobile application development
Difficulty: Hard
Size: Large (350 hours).
Mentors: Juncheng Yang

ReasonWorld

Fri, 28 Feb 2025 00:00:00 +0000

ReasonWorld: Real-World Reasoning with a Long-Term World Model

A world model is essentially an internal representation of an environment that an AI system would construct based on external information to plan, reason, and interpret its surroundings. It stores the system’s understanding of relevant objects, spatial relationships, and/or states in the environment. Recent augmented reality (AR) and wearable technologies like Meta Aria glasses provide an opportunity to gather rich information from the real world in the form of vision, audio, and spatial data. Along with this, large language (LLM), vision language models (VLMs), and general machine learning algorithms have enabled nuanced understanding and processing of multimodal inputs that can label, summarize, and analyze experiences.

With ReasonWorld, we aim to utilize these technologies to enable advanced reasoning about important objects/events/spaces in real-world environments in a structured manner. With the help of wearable AR technology, the system would be able to capture real-world multimodal data. We aim to utilize this information to create a long-memory modeling toolkit that would support features like:

Longitudinal and structured data logging: Capture and storing of multimodal data (image, video, audio, location coordinates etc.)
Semantic summarization: Automatic scene labeling via LLMs/VLMs to identify key elements in the surroundings
Efficient retrieval: For querying and revisiting past experiences and answering questions like “Where have I seen this painting before?”
Adaptability: Continuously refining and understanding the environment and/or relationships between objects/locations.
Adaptive memory prioritization: Where the pipeline can assess the contextual significance of the captured data and retrieve those that are the most significant. The model retains meaningful, structured representations rather than raw, unfiltered data.

This real-world reasoning framework with a long-term world model can function as a structured search engine for important objects and spaces, enabling:

Recognizing and tracking significant objects, locations, and events
Supporting spatial understanding and contextual analysis
Facilitating structured documentation of environments and changes over time

Alignment with Summer of Reproducibility:

Core pipeline for AR data ingestion, event segmentation, summarization, and indexing (knowledge graph or vector database) would be made open-source.
Clear documentation of each module and how they collaborate with one another
The project could be tested with standardized datasets, simulated environments as well as controlled real-world scenarios, promoting reproducibility
Opportunities for Innovation - A transparent, modular approach invites a broad community to propose novel expansions

Specific Tasks:

A pipeline for real-time/batch ingestion of data with the wearable AR device and cleaning
Have an event segmentation module to classify whether the current object/event is contextually significant, filtering out the less relevant observations.
Have VLMs/LLMs summarize the events with the vision/audio/location data to be stored and retrieved later by structured data structures like knowledge graph, vector databases etc.
Storage optimization with prioritizing important objects and spaces, optimizing storage based on contextual significance and frequency of access.
Implement key information retrieval mechanisms
Ensure reproducibility by providing datasets and scripts

ReasonWorld

Topics: Augmented reality Multimodal learning Computer vision for AR LLM/VLM Efficient data indexing
Skills: Machine Learning and AI, Augmented Reality and Hardware integration, Data Engineering & Storage Optimization
Difficulty: Hard
Size: Large (350 hours)
Mentors: James Davis, Alex Pang

AI for Science: Automating Domain Specific Tasks with Large Language Models

Sun, 23 Feb 2025 21:30:56 -0800

Recent advancements in Large Language Models (LLMs) have transformed various fields by demonstrating remarkable capabilities in processing and generating human-like text. This project aims to explore the development of an open-source framework that leverages LLMs to enhance discovery across specialized domains.

The proposed framework will enable LLMs to analyze and interpret complex datasets, automate routine tasks, and uncover novel insights. A key focus will be on equipping LLMs with domain-specific expertise, particularly in areas where specialized tools – such as ANDES – are not widely integrated with LLM-based solutions. By bridging this gap, the framework will empower researchers and professionals to harness LLMs as intelligent assistants capable of navigating and utilizing niche computational tools effectively.

AI for Science: Automating Domain Specific Tasks with Large Language Models

Topics: Large Language Models AI for Science
Skills: Python, Experience with LLMs, Prompt Engineering, Fine-Tuning, LLM Frameworks
Difficulty: Medium-Difficult
Size: Large (350 hours)
Mentor: [Daniel Wong]Daniel Wong, [Luanzheng “Lenny” Guo]Luanzheng "Lenny" Guo

Project Tasks and Milestones

Designing an extensible framework that facilitates the integration of LLMs with specialized software and datasets.
Developing methodologies for fine-tuning LLMs to act as domain experts.
Implementing strategies for improving tool interoperability, allowing LLMs to interact seamlessly with less commonly used but critical analytical platforms.

Enhancing Reproducibility in Distributed AI Training: Leveraging Checkpointing and Metadata Analytics

Fri, 21 Feb 2025 09:00:00 -0700

Reproducibility in distributed AI training is a crucial challenge due to several sources of uncertainty, including stragglers, data variability, and inherent randomness. Stragglers—slower processing nodes in a distributed system—can introduce timing discrepancies that affect the synchronization of model updates, leading to inconsistent states across training runs. Data variability, stemming from non-deterministic data shuffling and differing data partitions across nodes, can also lead to variations in model performance. Additionally, inherent randomness in algorithm initialization, such as random weight beginnings and stochastic processes like dropout, further compounds these challenges. Reproducibility in AI is pivotal for ensuring the credibility of AI-driven scientific findings, akin to how reproducibility underpins traditional scientific research.

To enhance AI reproducibility, leveraging metadata analytics and visualization along with saved checkpoints offers a promising solution. Checkpointing in AI training is a pivotal technique that involves saving snapshots of a model and its parameters at regular intervals throughout the training process. This practice is essential for maintaining progress in the face of potential interruptions, such as hardware failures, and enables the resumption of training without having to restart from scratch. In the context of distributed AI training, checkpointing also provides a framework for analyzing and ensuring reproducibility, offering a means to systematically capture and review the training trajectory of models. Analyzing checkpoints can specifically help identify issues like stragglers, which are slower computing nodes in a distributed system that can impede synchronized progress. For example, by examining the time stamps and resource utilization data associated with each checkpoint, anomalies in processing time can be detected, revealing nodes that consistently lag behind others. This analysis enables teams to diagnose performance bottlenecks and optimize resource allocation across the distributed system, ensuring smoother and more consistent training runs. By combining checkpointing with metadata analytics, it becomes possible to pinpoint the exact training iterations where delays occur, thereby facilitating targeted investigations and solutions to improve overall system reproducibility and efficiency.

Workplan

The proposed work will include: 1) Setting up a checkpointing system within the distributed AI training framework to periodically save model states and metadata; 2) Designing a metadata analysis schema for populating model and system statistics from the saved checkpoints; 3) Conducting exploratory data analysis to identify patterns, anomalies, and sources of variability in the training process; 4) Creating visualization tools to represent metadata insights with collected statistics and patterns; 5) Using insights from metadata analytics and visualization to optimize resource distribution across the distributed system and mitigate straggler effects; and 6) Disseminating results and methodologies through academic papers, workshops, and open-source contributions.

Topics: Reproducibility AI distributed AI checkpoint metadata analysis
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Thu, 20 Feb 2025 09:00:00 -0700

Retrieval-Augmented Generation (RAG) frameworks, which merge the capabilities of retrieval systems and generative models, significantly enhance the relevance and accuracy of responses produced by large language models (LLMs). These frameworks retrieve relevant documents from a large corpus and use these documents to inform the generative process, thereby improving the contextuality and precision of the generated content. Ensuring reproducibility in data queries using similarity search within these RAG frameworks is critical for maintaining the reliability and consistency of scientific workflows. Reproducibility ensures that the same input query consistently yields the same output, which is vital for scientific tasks that rely on precise and repeatable results. Inconsistencies can arise from various sources, affecting the trustworthiness of scientific outcomes. Differences in retrieval algorithms can lead to variable sets of documents being retrieved for the same query. Variations in data indexing methods can cause inconsistencies in how documents are ranked and accessed. The stochastic nature of LLM operations introduces an element of randomness in the generative process. Updates in datasets can also alter the baseline against which queries are processed and interpreted, leading to different results over time.

This proposal aims to address these reproducibility challenges in similarity searches within RAG frameworks. This work involves analyzing the root causes of non-determinism, benchmarking and validating the consistency of query results, implementing enhancements to minimize variability, and developing tools and best practices to ensure reproducibility. Reproducibility in data queries can be influenced by several factors, including updates in datasets, differences in retrieval algorithms, varying data indexing methods, and the stochastic nature of LLM operations. Each of these factors can cause variability in the documents retrieved and in the generated responses. Ensuring consistency in query results across different runs is crucial for maintaining the integrity of LLM-driven scientific research, allowing researchers to confidently build upon prior work and achieve reliable, trustworthy outcomes.

Workplan

The proposed work will include: (1) Identifying sources of non-determinism and variability, such as algorithmic differences and indexing methods, in RAG; (2) Utilizing standardized scientific datasets to benchmark the reproducibility of similarity search results across different RAG frameworks; (3) Establishing protocols for handling dataset updates to ensure that such changes do not impact the reproducibility of similarity search results; and (4) Implementing mechanisms to track and document updates to datasets, ensuring that changes are reflected consistently across all instances of the RAG framework. By addressing these areas, the proposed work aims to mitigate challenges related to reproducibility in similarity search queries within RAG frameworks, ultimately enhancing the reliability and trustworthiness of scientific research outcomes.

Topics: Reproducibility LLM RAG Scientific Workflows
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Exploration of I/O Reproducibility with HDF5

Wed, 19 Feb 2025 09:00:00 -0700

Parallel I/O is a critical component in high-performance computing (HPC), allowing multiple processes to read and write data concurrently from a shared storage system. HDF5—a widely adopted data model and library for managing complex scientific data—supports parallel I/O but introduces challenges in I/O reproducibility, where repeated executions do not always produce identical results. This lack of reproducibility can stem from non-deterministic execution orders, variations in collective buffering strategies, and race conditions in metadata and dataset chunking operations within HDF5’s parallel I/O hierarchy. Moreover, many HDF5 operations that leverage MPI I/O require collective communication; that is, all processes within a communicator must participate in operations such as metadata creation, chunk allocation, and data aggregation. These collective calls ensure that the file structure and data layout remain consistent across processes, but they also introduce additional synchronization complexity that can impact reproducibility if not properly managed. In HPC scientific workflows, consistent I/O reproducibility is essential for accurate debugging, validation, and benchmarking, ensuring that scientific results are both verifiable and trustworthy. Tools such as h5bench—a suite of I/O kernels designed to exercise HDF5 I/O on parallel file systems—play an important role in identifying these reproducibility challenges, tuning performance, and ultimately supporting the overall robustness of large-scale scientific applications.

Workplan

The proposed work will include (1) analyzing and characterizing parallel I/O operations in HDF5 with h5bench miniapps, (2) exploring and validating potential reproducibility challenges within the parallel I/O hierarchy (e.g., MPI I/O), and (3) implementing solutions to address parallel I/O reproducibility.

Topics: Parallel I/O MPI-I/O Reproducibility HPC HDF5
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo and [Wei Zhang]Wei Zhang

Peersky Browser

Tue, 18 Feb 2025 12:00:00 -0800

Peersky Browser is an experimental personal gatekeeper to a new way of accessing web content. In a world where a handful of big companies control most of the internet, Peersky leverages distributed web technologies—IPFS, Hypercore, and Web3—to return control to the users. With integrated local P2P applications, Peersky offers a fresh, community-driven approach to browsing.

Implement Web Extensions Integration

Topics: Browser Extensions, UI/UX, Electron
Skills: JavaScript, Electron.js, HTML/CSS
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Akhilesh Thite

Implement web extension support in Electron by leveraging its web extension node modules, pre-installing extensions, and providing a user interface for adding, updating, and securely managing them.

Tasks:

Loading Extensions via Electron Modules:
- Utilize Electron’s web extension node modules to load extensions, as Electron.js doesn’t support marketplace integration.
Default Pre-installed Extensions:
- Configure a set of pre-installed extensions like uBlock to offer immediate value for privacy and security.
User-Installed Extensions UI:
- Create an interface where users can add extension .zip files in peersky://settings.
- Add an option for users to manually update all installed extensions.
Validate and Sandbox Extensions:
- Check the integrity and manifest structure of the uploaded extensions to ensure they meet Chrome Manifest V3 requirements.
- Apply sandboxing techniques and enforce strict content security policies to mitigate potential risks.
Extension Management UI:
- Design a dedicated UI at the top right of the navigation bar to manage extensions, including stack order and pinning functionality for quick access and organization.

Implement Chat History Synchronization for Hyper Chat Rooms

Topics: P2P Communication, Hypercore Protocol, Real-time Synchronization
Skills: JavaScript, Distributed Systems, P2P
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Akhilesh Thite

Implement chat history synchronization for Hyper chat rooms, ensuring that new devices retrieve all past messages—including those sent while offline—for a seamless user experience. Additionally, research and experiment with mDNS to enable true offline, peer-to-peer messaging on local networks.

Tasks:

History Retrieval Mechanism:
- Implement chat history synchronization so that when a new device joins a Hyper chat room, it retrieves the entire chat history from the Hypercore feed.
Offline Message Inclusion:
- Ensure that devices that were offline during prior messages can still access the full chat history upon joining the room, even after messages were sent in their absence.
UI Integration:
- Create a seamless experience for users across devices by ensuring that no messages are lost and that users can access the full chat history regardless of their online status.
Research mDNS (Multicast DNS):
- mDNS is a protocol that allows devices on the same local network to communicate with each other without the need for a central DNS server. This enables peer-to-peer communication, especially in offline environments, making it ideal for offline messaging.
- Experiment with the mDNS() function to enable peer-to-peer communication for offline chat rooms.
Create Hyper Chat Web App Version:
- Currently, Hyper chat is accessed via peersky://p2p/chat. Develop a web app version of Hyper chat that can be hosted on the hyper:// protocol (hyper://chat.p2plabs.xyz). This way, other P2P browsers (like Agregore) can use it to communicate.

AR4VIP

Tue, 18 Feb 2025 00:00:00 +0000

We are interested in developing navigation aids for visually impaired people (VIP) using AR/VR technologies. Our intended use is primarily indoors or outdoors but within private confines e.g. person’s backyard. Using AR/VR headsets or smart glasses allows navigation without using a cane and frees the users’ hands for other tasks.

Continue Development on Meta Quest 3 Headset

Topics: Dynamic scenes Spatial audio Proximity detection
Skills: AR/VR familiarity, WebXR, Unity, SLAM, good communicator, good documentation skills
Difficulty: Moderate
Size: Medium or large (175 or 350 hours)
Mentors: Alex Pang, James Davis

Continue development and field testing with the Meta Quest 3 headset. See this repository page for current status.

Specific tasks:

Improve spatial audio mapping
Improve obstacle detection, at different heights, with pre-scanned geometry as well as dynamic objects e.g. other people, pets, doors
Special handling of hazards e.g. stairs, uneven floors, etc.
Explore/incorporate AI to help identify objects in the scene when requested by user

New Development on Smart Glasses

Topics: Dynamic scenes Spatial audio Proximity detection
Skills: AR/VR familiarity, WebXR, Unity, SLAM, good communicator, good documentation skills
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Alex Pang, James Davis

VR headsets are bulky and awkward, but currently is more advanced than AR glasses in terms of programmability. Ultimately, the form factor of smart glasses is more practical for extended use by our target users. There are many vendors working on pushing out their version of smart glasses targetting various applications e.g. alternative for watching TV, etc. We are interested in those that provide capabilities to support spatial computing. Most of these will likely have their own brand specific APIs. This project has 2 goals: (a) develop generic brand-independent API, perhaps extensions to WebXR, to support overarching goal of navigation aid for VIP, and (b) port functionality of VR version to smart glasses while taking advantage of smart glass functionalities and sensors.

Specific tasks:

Explore current and soon-to-be-available smart glass options e.g. Snap Spectacles, Xreal Air 2 ultra, etc. and select a platform to work on (subject to cost and availability of SDK). At a minimum, glass should be microphones and speakers, and cameras. Infrared cameras or other low light capability is a plus. Sufficient battery life or option for quick exchange.
Identify support provided by SDK e.g. does it do realtime scene reconstruction? does it support spatial audio? etc. If it supports features outside of WebXR, provide generic hooks to improve portability of code to other smart glasses.
Port and extend functionalities from the Meta Quest 3 VR headsets to smart glass platform.
Add AI support if glasses support them.
Provide documentation of work.

Assessing and Enhancing CC-Snapshot for Reproducible Experiment Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

A critical challenge in computer systems research reproducibility is establishing and sharing experimental environments. While open testbeds like Chameleon provide access to hardware resources, researchers still face significant barriers when attempting to recreate the precise software configurations, dependencies, and system states needed for reproducible experiments. Environment snapshotting tools offer a solution, but face technical challenges in consistently capturing running systems without introducing distortions or requiring disruptive system modifications. This project addresses these fundamental reproducibility barriers by enhancing CC-Snapshot, an tool capturing the experimental environment configured by the user on bare metal images, to create more reliable and consistent system captures that can be shared and redeployed without loss of fidelity.

CC-Snapshot is a tool on the Chameleon testbed that enables users to package their customized environments as complex images or appliances. By allowing researchers to share these environments easily, CC-Snapshot offers a powerful mechanism for reproducibility, ensuring that experiments can be replicated and extended by others.

In this project, you will review existing CC-Snapshot workflows, research the latest snapshotting technologies, and develop enhancements that improve the tool’s usability and reliability. This includes ensuring snapshots are created consistently (even when the OS is actively running), preserving the integrity of user systems, and exploring advanced features such as out-of-band snapshotting and API-based triggers.

Key Outcomes

Improved Snapshot Consistency: New methods to capture the full state of a disk without risking corruption or data inconsistency.
Enhanced Reproducibility: A refined workflow that allows researchers to reliably share custom environments, facilitating collaborative and repeatable experiments.
User-Friendly Tooling: Streamlined processes that reduce disruption to running systems—so installing dependencies or rebooting into special environments is less burdensome.
Exploratory Features (Stretch Goals): Advanced mechanisms to stream disk data in real time during snapshotting and to initiate snapshots via an API call (for parity with VM snapshots).

Topics: Cloud Computing, Systems & Infrastructure, Reproducibility, Operating System Internals

Skills: Linux / OS Concepts, Cloud Tools, Systems Programming / Scripting, DevOps / CI

Difficulty: Moderate

Size: Medium

Mentors: Michael Sherman, Mark Powers

Tasks:

Ensure Snapshot Consistency
- Reboot into a ramdisk and copy the offline disk.
- Use kexec to switch to/from a ramdisk environment without a full reboot.
- Change images to use a snapshot-capable filesystem (e.g., LVM) for safer live snapshots.
- Investigate additional methods (e.g., blog.benjojo.co.uk) for safely imaging live disks.
Prevent System Modifications During Snapshot
- Currently, CC-Snapshot installs dependencies (e.g., qemu-img) on the running system, affecting its state.
- In-Band Fix: Download and run tools in a temp directory with static linking, avoiding system-level changes.
- Out-of-Band Approach: Snapshots done via ramdisk or kexec do not require altering the running system.
API-Triggered Snapshots
- Extend or integrate with the Nova “snapshot instance” API to support the same workflow for bare metal.
- Leverage Ironic’s new “service steps” feature for an automated snapshot pipeline.
(Stretch Goal) Streaming Snapshots
- Modify the workflow to stream data directly to storage, rather than making a full local copy first.
- Explore incremental or differential snapshot techniques to reduce bandwidth usage and storage overhead.

CarbonCast: Building an end-to-end consumption-based Carbon Intensity Forecasting service

Tue, 18 Feb 2025 00:00:00 +0000

CarbonCast is a machine-learning-based approach to provide multi-day forecasts of the electrical grid’s carbon intensity. Developed in Python, the current version of CarbonCast delivers accurate forecasts in numerous regions by using historical source production data of a particular geographical region, time of day/year, and weather forecasts as features. However, there is no easy way to access and visualize the data through a standard interface. In addition, much important information is left out and is not available to users. For instance, electricity grids often import electricity from neighboring regions and so electricity consumption depends on both electricity generation and imports. Moreover, it is imperative for each energy source to utilize a tailored predictive mechanism. Consequently, any carbon optimization solution trying to reduce carbon emissions due to its electricity consumption will benefit more from following a consumption-based CI signal.

The plan for this project is to develop both the frontend and the backend API services for CarbonCast. We also intend to enhance CarbonCast by implementing an architecture wherein each region can employ a distinct interface for their predictive modeling. In scenarios where these new models do not yield superior outcomes within a region, the current architecture will serve as a fallback solution.

Building an end-to-end consumption-based Carbon Intensity Forecasting service

Topics: Databases Machine Learning
Skills: Python, command line (bash), MySQL, Django, machine learning, cronjob
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Abel Souza

Develop a containerized end-to-end backend, API, and frontend for collecting, estimating, and visualizing real-time and forecast electrical grid’s carbon intensity data in a scalable manner.

Tasks:

Research web technologies and frameworks relevant to CarbonCast development.
Run and collect CarbonCast’s data (CSV)
Ingest CSV into a MySQL or SQLite database
Develop an Application Programming Interface (API) and a Web User Interface (UI) to provide real-time data access and visualization.
Deploy the CarbonCast API as a service and dockerize it so that other users and applications can locally deploy and use it easily.
Implement a choropleth web map to visualize the carbon intensity data across the different geographical regions supported by CarbonCast.
Enhance CarbonCast by implementing an extensible architecture wherein every region can employ distinct models for their predictive modeling.

Chameleon Trovi Support for Complex Experiment Appliances

Tue, 18 Feb 2025 00:00:00 +0000

Overview

The discoverability and accessibility of research artifacts remains a significant barrier to reproducibility in computer science research. While digital libraries index research papers, they rarely provide direct access to the artifacts needed to reproduce experiments, especially complex multi-node systems. Additionally, when artifacts are available, they often lack standardized metadata, versioning, and deployment mechanisms that would enable researchers to easily find and reuse them. This project addresses these challenges by extending Trovi, a repository of experimental artifacts executable on open platforms, to support complex, multi-node appliances, making sophisticated experimental environments discoverable, shareable, and deployable through a standardized interface - ultimately lowering the barriers to reproducing complex systems experiments.

Chameleon has historically enabled researchers to orchestrate complex appliances—large, multi-node clusters configured via OpenStack Heat—to conduct advanced experiments. Meanwhile, Chameleon team introduced Trovi as repository for open platforms (beyond Chameleon) that pioneers mechanisms for artifact and platform integration leading to immediate execution for pratical reproducibility. This project aims to bridge the two by adding support in Trovi for importing, discovering, and launching complex appliances. By integrating these capabilities, researchers will be able to one-click deploy complex appliances directly from the Trovi dashboard, archive them for future reference, and reproduce experiments on demand.

Key Outcomes

Extended Trovi API: Enable the import and management of complex appliances as artifacts.
Streamlined One-Click Launch: Integrate with Chameleon’s existing provisioning workflows so users can launch multi-node clusters directly from Trovi.
Enhanced Dashboard Experience: Provide UI assistance for discovering, reviewing, and customizing complex appliance artifacts.
Improved Artifact Reproducibility: Automate the process of exporting CC-snapshot images and other resources to ensure everything is preserved across sites (UC, TACC), highlighting any parameters that need user attention for cross-site portability.

Topics: Reproducible Research, Cloud Computing & Orchestration, OpenStack Heat, UI/UX & Web Development

Skills: Python, APIs, Cloud (OpenStack), DevOps & Automation, Frontend

Difficulty: Hard

Size: Large

Mentors: Mark Powers

Tasks:

Extensions to the Trovi API
- Add support for importing complex appliances as artifacts (including Heat templates, metadata, and associated disk images).
- Develop methods for tagging, versioning, and categorizing these appliances, making them easier to discover.
One-Click Launch of Complex Appliances
- Integrate with Chameleon’s orchestration engine, enabling single-click cluster deployments from the Trovi UI.
- Validate correct configuration and resource availability through automated checks.
Trovi Dashboard Enhancements
- Update the front-end to provide intuitive controls for customizing or parameterizing complex appliances before launching.
- Offer a clear workflow for reviewing dependencies, resource requirements, and usage instructions.
Automated Export & Multi-Site Testing
- Streamline the export of snapshots or images into Trovi as part of the appliance import process.
- Optionally re-run the imported appliances at multiple sites (UC, TACC), detecting any unparameterized settings or missing dependencies.

Contextualization – Extending Chameleon’s Orchestration for One-Click Experiment Deployment

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Reproducibility in computer systems research is often hindered by the quality and completeness of artifact descriptions and the complexity of establishing experimental environments. When experiments involve multiple interconnected components, researchers struggle with hardcoded configurations, inadequate documentation of setup processes, and missing validation steps that would verify correct environment establishment. This project addresses these challenges by extending orchestration capabilities beyond basic hardware provisioning to include comprehensive contextualization—making complex, multi-component experimental environments deployable via parameterized templates with clear validation points, standardized metadata, and minimal user intervention—thus significantly reducing the barriers to reproducing complex distributed systems experiments.

Chameleon already provides powerful capabilities to orchestrate and configure resources through Heat templates (similar to Terraform) and the python-chi library. However, these focus primarily on provisioning (i.e., allocating and configuring hardware resources). This project goes a step further by addressing contextualization—the process of creating complete, ready-to-use experimental environments that incorporate everything from network layout to instance-level configuration and discovery—with additional features such as parameterized templates, experiment-level metadata, and output reporting.

Key Outcomes

Template-Based One-Click Launch: Users can deploy multi-resource experiments (VMs, networks, storage, etc.) via a single click or a minimal set of input parameters.
Enhanced Experiment Contextualization: Each launched resource can gain access to global “experiment-level” metadata (e.g., IP-to-hostname mappings for cluster authentication) and outputs that summarize important details.
Streamlined User Experience: An asynchronous deployment workflow that provides notifications and uses “outputs” to highlight critical connection information (e.g., bastion host IP, final results).
Optional Advanced Features: Partial reconfiguration to avoid full rebuilds when changes are minor, an “export” function to capture existing deployments into a new template, and potential publishing to Trovi for reproducibility and archiving.

Topics: Cloud Computing & Orchestration, Infrastructure as Code, DevOps & Automation, Reproducible Research Environments

Skills:

OpenStack & Heat Templates: Familiarity with provisioning resources on Chameleon using Heat or Terraform-like workflows.
Python & Scripting: For enhancing or extending the python-chi library.
Systems / Network Knowledge: Understanding multi-VM topologies, cluster configurations, and network-level interactions.
CI/CD & DevOps: Experience building or integrating asynchronous deployment and notifications.

Difficulty: Hard

Size: Large (suitable for a semester-long project or a summer internship)

Mentors: Paul Marshall

Tasks:

One-Click Template Launch
- Design a template (in Heat or similar) specifying multiple cloud resources (images, networks, disk images, SSH keys, etc.).
- Ensure the template author can define input parameters with defaults.
- Allow the user to launch the template quickly with default values or adjust parameters before deployment.
Asynchronous Provisioning & Notifications
- Implement a long-running process that deploys resources step-by-step.
- Provide status updates to the user (e.g., via UI notifications, email, or logs) when deployments complete or fail.
Experiment-Level Metadata
- Inject metadata such as IP-to-hostname mappings to each instance for easy cluster authentication.
- Allow the template to define “outputs” (like a public IP of a bastion or location of final results).
Partial Reconfiguration (Optional)
- Enable partial updates if only one of several servers changes, saving time and resources.
- Improve fault tolerance by avoiding full redeploys in the event of partial failures.
Export Running Configurations into a New Template (Optional)
- Build a web-interface or script to detect existing user-owned resources (servers, networks, etc.).
- Generate a proposed template from those resources, suggesting parameters (e.g., flavor, disk image, or SSH key).
- Extend or modify existing templates by adding discovered resources.
Integration with Trovi / Multi-Site Testing (Optional)
- Provide a method to archive or publish the final template (and associated disk images, data sets) in Trovi.
- Attempt to re-run the template at multiple Chameleon sites (e.g., UC, TACC) to identify parameters or modifications needed for cross-site reproducibility.

MPI Appliance for HPC Research on Chameleon

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Message Passing Interface (MPI) is the dominant programming model for high-performance computing (HPC), enabling applications to scale efficiently across thousands of processing cores. In reproducibility initiatives for HPC research, MPI implementations are critical as they manage the complex communications that underpin parallel scientific applications. However, reproducing MPI-based experiments remains challenging due to the need for specific library versions, network configurations, and multi-node setups that must be precisely orchestrated.

The popularity of an “MPI cluster” as a base layer for many results in HPC caused support for MPI template and appliance to be specifically requested by the SC24 reproducibility chair to support the conference’s reproducibility initiative, providing researchers with standardized environments for validating results. By extending the work begun for SC24, this project aims to create higher-quality, ready-to-use, and maintainable MPI environments for the Chameleon testbed that abstracts away complex configuration details while ensuring consistent performance across experiments—thus making HPC experiments more accessible and reproducible for the broader research community.

You will lead efforts to configure disk images with the necessary MPI dependencies and provide orchestration templates that set up networking and instances automatically. The resulting appliance will allow researchers to quickly and consistently deploy distributed computing environments with MPI. The goal is to facilitate reproducible and scalable computational experiments for a wide range of scientific and engineering applications.

Key Outcomes

Ready-to-Use MPI Disk Images: Create one or more images pre-configured with the correct versions of MPI and dependencies, ensuring a consistent environment.
Simple Cluster Configuration Scripts: Provide scripts or playbooks that efficiently bring up a fully functional MPI cluster on Chameleon, abstracting away manual setup steps.
Orchestration Template: An automated workflow that sets up networks, instances, and additional resources needed to run large-scale MPI workloads.

Topics: High-Performance Computing (HPC), Cloud Computing, MPI & Distributed Systems, DevOps & Automation

Skills:

MPI & Parallel Programming: Understanding of MPI libraries, cluster configuration, and typical HPC workflows.
Cloud Orchestration: Familiarity with OpenStack Heat or other Infrastructure-as-Code (IaC) tools for provisioning resources.
Linux System Administration: Experience configuring and troubleshooting packages, network settings, and performance optimizations.
Scripting & Automation: Ability to write scripts (e.g., Bash, Python) to automate setup and deployment steps.

Difficulty: Moderate to Hard

Size: Medium

Mentor: Ken Raffenetti

Tasks

Disk Images with MPI Dependencies
- Build base images with the correct versions of MPI (e.g., MPICH, OpenMPI) and any required libraries (e.g., GCC, network libraries).
- Ensure all packages are up to date and tested for compatibility with Chameleon’s bare metal and/or VM environments.
Cluster Setup Scripts
- Develop lightweight scripts or Ansible playbooks that join new instances into an MPI cluster, configuring hostnames, SSH keys, and MPI runtime settings.
- Validate cluster functionality by running simple distributed “Hello World” tests and more advanced benchmarks (e.g., Intel MPI Benchmarks).
Orchestration Template
- Provide a Heat template (or similar) specifying the network configuration, instance counts, and environment variables for MPI.
- Enable easy parameterization of cluster size, disk images, and other variables so users can customize their setups on the fly.
Integration & Testing
- Document best practices for launching and using the MPI images in Chameleon.
- Demonstrate reproducibility with multiple cluster sizes and workloads to ensure reliability.

Smart Environments – An AI System for Reproducible Custom Computing Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

The complexity of environment setup and the expertise required to configure specialized software stacks can often hinder efforts to reproduce important scientific achievements in HPC and systems studies. Researchers often struggle with incomplete or ambiguous artifact descriptions that make assumptions about “common knowledge” that is actually specific domain expertise. When trying to reproduce experiments, reviewers may spend excessive time debugging environment inconsistencies rather than evaluating the actual research. These challenges are compounded when experiments need to run on different hardware configurations.

This project seeks to address these fundamental reproducibility barriers by using AI to translate natural language environment requirements often used in papers or artifact descriptions into actionable, reproducible configurations—bridging the knowledge gap between experiment authors and reviewers while standardizing environment creation across different hardware platforms. We will develop an AI-driven system that automatically generates and configures reproducible computing environments based on artifact descriptions from conferences, Trovi artifacts on the Chameleon testbed, and other reliable sources for scientific experiment code and associated documentation. Leveraging Natural Language Processing (NLP), the system will allow researchers to describe desired environments in plain English, then map those descriptions onto predefined configuration templates. By simplifying environment creation and ensuring reproducibility, the system promises to eliminate duplicate setup efforts, accelerate research workflows, and promote consistent experimentation practices across diverse hardware.

Key Outcomes

Working Prototype: A system that automatically generates machine images deployable on bare metal and VM instances, based on user-provided requirements.
Comprehensive Documentation: Detailed user manuals, guides, and best practices tailored to researchers, ensuring a smooth adoption process.
Live Demo: A demonstration environment (e.g., a web app or Jupyter notebook) that shows how to request, configure, and launch reproducible cloud environments on both hardware profiles.
Long-Term Impact: Building blocks for future AI-driven automation of cloud infrastructure, reducing human error and enabling fast, repeatable research pipelines.

Topics: Reproducibility, AI & NLP, Cloud Computing, DevOps and Automation

Skills:

Machine Learning / AI: Familiarity with NLP methods to interpret user requirements.
Python: Primary language for backend services and cloud interactions.
Cloud API Integration: Experience with OpenStack or similar APIs to provision and configure images on both bare metal and virtual machines.
DevOps: Automated environment configuration, CI/CD workflows, and containerization.

Difficulty: Hard

Size: Large

Mentors: Paul Marshall

Tasks:

Requirement Gathering & NLP Design
- Research the specific needs of researchers building experimental setups.
- Design an NLP pipeline to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into environment “recipes.”
Backend Environment Builder
- Implement logic that converts parsed user requirements into machine-image definitions for bare metal and VM instances.
- Integrate with Chameleon’s APIs to provision servers, install software, and run configuration validation automatically.
Front-End & User Experience
- Develop an intuitive web or CLI interface that researchers can use to capture experiment environment requirements.
- Provide real-time status updates during environment setup, along with meaningful error messages and quick-start templates.
Testing & Validation
- Conduct end-to-end tests using diverse software stacks (e.g., HPC libraries, machine learning frameworks) on bare metal and VM instances.
- Ensure reproducibility by re-creating the same environment multiple times and comparing configurations.
Documentation & Demonstration
- Produce user-facing documentation, including tutorials and best practices for researchers who frequently run experiments on Chameleon Cloud.
- Create a short live demo or screencast showcasing how to configure an environment for a specific research workflow.

Smart Environments – An AI System for Reproducible Custom Computing Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Key Outcomes

Working Prototype: A system that automatically generates machine images deployable on bare metal and VM instances, based on user-provided requirements.
Comprehensive Documentation: Detailed user manuals, guides, and best practices tailored to researchers, ensuring a smooth adoption process.
Live Demo: A demonstration environment (e.g., a web app or Jupyter notebook) that shows how to request, configure, and launch reproducible cloud environments on both hardware profiles.
Long-Term Impact: Building blocks for future AI-driven automation of cloud infrastructure, reducing human error and enabling fast, repeatable research pipelines.

Topics: Reproducibility, AI & NLP, Cloud Computing, DevOps and Automation

Skills:

Machine Learning / AI: Familiarity with NLP methods to interpret user requirements.
Python: Primary language for backend services and cloud interactions.
Cloud API Integration: Experience with OpenStack or similar APIs to provision and configure images on both bare metal and virtual machines.
DevOps: Automated environment configuration, CI/CD workflows, and containerization.

Difficulty: Hard

Size: Large

Mentors: Paul Marshall

Tasks:

Requirement Gathering & NLP Design
- Research the specific needs of researchers building experimental setups.
- Design an NLP pipeline to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into environment “recipes.”
Backend Environment Builder
- Implement logic that converts parsed user requirements into machine-image definitions for bare metal and VM instances.
- Integrate with Chameleon’s APIs to provision servers, install software, and run configuration validation automatically.
Front-End & User Experience
- Develop an intuitive web or CLI interface that researchers can use to capture experiment environment requirements.
- Provide real-time status updates during environment setup, along with meaningful error messages and quick-start templates.
Testing & Validation
- Conduct end-to-end tests using diverse software stacks (e.g., HPC libraries, machine learning frameworks) on bare metal and VM instances.
- Ensure reproducibility by re-creating the same environment multiple times and comparing configurations.
Documentation & Demonstration
- Produce user-facing documentation, including tutorials and best practices for researchers who frequently run experiments on Chameleon Cloud.
- Create a short live demo or screencast showcasing how to configure an environment for a specific research workflow.

Widgets for Python-chi in Jupyter

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Reproducibility challenges in research extend beyond code and environments to the experimental workflow itself. When experiments involve dynamic resource allocation, monitoring, and reconfiguration, researchers often struggle to document these interactive steps in a way that others can precisely follow. The lack of structured workflow documentation and real-time feedback creates barriers for reviewers attempting to reproduce experiments, as they cannot easily verify whether their resource configurations match the original experiment’s state. This project addresses these challenges by developing interactive Jupyter widgets that make experiment resource management more visual, intuitive, and self-documenting—transforming ad-hoc command sequences into reproducible workflows that automatically log interactions and configuration changes while providing immediate visual feedback on experiment topology and resource states.

As cloud researchers often work with Jupyter Notebooks for interactive data analysis and experimentation, the python-chi library offers a powerful way to automate and control resources on Chameleon Cloud. This project will extend python-chi by adding interactive widgets specifically designed for use in Jupyter, empowering users to launch, monitor, and manage their experiments without leaving the notebook environment. By bringing visual and intuitive controls directly into the user’s workflow, we aim to improve both reproducibility and usability for complex resource management tasks.

Key Outcomes

User-Friendly Jupyter Widgets: Develop a suite of widgets to visualize reserved resources, hardware availability, and experiment topologies in real time.
Integrated Experiment Management: Enable researchers to orchestrate experiments (launch, configure, monitor) within a single, notebook-centric workflow.
Enhanced Feedback & Usability: Provide clear, asynchronous status updates and resource reconfiguration progress, reducing confusion and user error.
Improved Reproducibility: By automating and logging widget interactions, experiments become more traceable and easier to replicate.

Topics: Interactive Data Tools, Cloud Resource Management, DevOps & Automation, User Experience (UX)

Skills:

Python & Jupyter: Experience creating custom Jupyter widgets, using ipywidgets or similar frameworks.
Cloud Automation: Familiarity with how resources are provisioned, monitored, and deprovisioned on Chameleon.
Frontend / GUI Development: Basic understanding of web technologies (HTML/CSS/JavaScript) can be helpful for widget design.
Software Engineering & CI: Ability to version-control, test, and deploy Python packages.

Difficulty: Moderate

Size: Medium

Mentor: Michael Sherman, Mark Powers

Tasks:

Resource Visualization Widgets
- Build custom widgets that show reserved resources (nodes, networks, storage) in Jupyter.
- Provide an interactive topology view for experiments, indicating node statuses and connections.
Experiment Setup & Execution
- Add controls for launching and managing experiments directly from notebooks.
- Show feedback (e.g., progress bars, status messages) as resources are being allocated or reconfigured.
Hardware Availability & Status Tracking
- Implement a widget that provides real-time data on Chameleon’s hardware availability (bare metal, VMs, GPU nodes, etc.).
- Allow users to filter or select specific resources based on current hardware states.
Usability & Feedback Loop
- Gather user feedback on the widget designs and workflows.
- Refine the interface to minimize clicks, improve clarity, and reduce friction for common tasks.

Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges

Sat, 15 Feb 2025 00:00:00 +0000

Project Description

Topics: Distributed systems
Skills: Java, Go, Python, Bash scripting, Linux, Docker.
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fadhil I. Kurnia

Replication is commonly employed to improve system availability and reduce latency. By maintaining multiple copies, the system can continue operating even if some replicas fail, thereby ensuring consistent availability. Placing replicas closer to users further decreases latency by minimizing the distance data must travel. A typical illustration of these advantages is a Content Delivery Network (CDN), where distributing content to edge servers can yield latencies of under 10 milliseconds when users and contents are in the same city.

In recent times, numerous edge datastores have emerged, allowing dynamic data to be served directly from network-edge replicas. Each of these replicated systems may employ different coordination protocols to synchronize replicas, leading to varied performance and consistency characteristics. For instance, Workers KV relies on a push-based coordination mechanism that provides eventual consistency, whereas Cloudflare Durable Objects and Turso deliver stronger consistency guarantees. Additionally, researchers have introduced various coordination protocols—such as SwiftPaxos, EPaxos, OPaxos, WPaxos, Raft, PANDO, and QuePaxa—each exhibiting its own performance profile, especially when being used in geo-distributed deployment.

This project aims to develop an open testbed for evaluating replicated systems and their coordination protocols under edge deployment. Currently, researchers face challenges in fairly comparing different replicated systems, as they often lack control over replica placement. Many previous studies on coordination protocols and replicated systems relied on mock implementations, particularly for well-known systems like Dynamo and Spanner, which are not open source. An open testbed would provide a standardized environment where researchers can compare various replicated systems, classes of coordination protocols, and specific protocol implementations using common benchmarks. Since the performance of replicated systems and coordination protocols varies depending on the application, workload, and replica placement, this testbed would offer a more systematic and fair evaluation framework. Furthermore, by enabling easier testing and validation, the testbed could accelerate the adoption of research prototypes in the industry.

Project Deliverables

Compilation of traces and applications from various open traces and open benchmarks.
Distributed workload generator to run the traces and applications.
Test framework to simulate latency of 100s of edge servers for measurement.
Open artifact of the traces, applications, workload generator, and test framework, published on Github.

Vector Embeddings Dataset

Tue, 11 Feb 2025 13:00:00 -0800

Vector Embeddings Dataset

Topics: Vector Embeddings LLMs Transformers
Skills: software development, apis, scripting, python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Jayjeet Chakraborty

To benchmark vector search algorithms (aka ANN algorithms), there are several datasets available but none of them represent actual real world workloads. This is because they usually have small vectors of only a few hundred dimensions. For vector search experiments to represent real world workloads, we want to have datasets with several thousand dimensions like what is generated by OpenAIs text-embedding models. This project aims to create a dataset with 1B embeddings from a wikipedia dataset using open source models. Ideally, we will have 3 versions of this dataset, with 1024, 4096, and 8192 sized embeddings to start with.

Brahma

Tue, 11 Feb 2025 12:34:56 -0700

Brahma is a lightweight framework for building collaborative and cross platform WebXR based experiences using Three.js for the front-end and a simple Node.js/WebSocket script on the backend. It was created at the Social Emotional Technology Lab to facilitate the development of novel collaborative interfaces and virtual environments capable of loading scientific datasets. For example, in the featured image, multiple avatars are exploring a marine science dataset related to seal migration paths overlaid on NOAA bathymetry and telemetry data.

It addresses a gap where prior open-source collaborative VR is no longer available such as the defunct Mozilla Hubs or proprietary engine based frameworks such as Ubiq. Furthermore, it uses very little computational resources to run and develop, enabling creators who may not have a powerful computer to run a game engine in order to develop a networked VR application.

This project involves the first public release of Brahma– creating a lightweight open source framework that facilitates multi-user games, scientific visualizations and other applications. In order to do so, we need to formalize the framework, provide documentation, and implement key examples so that the open source tool can be extensible and serve a wider community.

Mentees can expect to learn best practices for VR development and testing and gain familiarity with full stack development practices. Mentees should have access and experience using a VR headset.

Brahma / Protoocol Release and Validation

Topics: Web Development Software Architecture VR Development Computer Graphics Cloud Platforms
Skills: Node.js, Three.js
Difficulty: Moderate-Challenging
Size: Large (350 hours)
Mentors: Samir Ghosh

The proposed work includes three phases, primarily working on backend code, and API design. In the first phase, to gain familiarity, the mentee will be running and testing the Brahma backend on a variety of cloud platforms such as AWS, Google Cloud, and Azure– and learning best methods for documentation in the process. Then, in the second phase, the mentee will work on formalizing the protocol for avatar embodiment and other multi-user interfaces, testing the application with a simple pong game. In the third phase, the mentee will address telemetry, logging, and analysis considerations.

This project is well suited for someone who has interest in virtual reality, especially social VR, multi-user, or collaborative applications

Brahma / Allocentric WebXR Interfaces

Topics: Web Development VR Development Computer Graphics UX/UI
Skills: Three.js, GLSL, WebSocket
Difficulty: Moderate-Challenging
Size: Medium or large (175 or 350 hours)
Mentors: Samir Ghosh

The proposed work primarily involves front-end code and VR interface design. In the first phase, the mentee will gain familiarity with best practices for WebXR development through the implementation and documentation of simple interaction patterns. Then, the mentee will implement a simple multi-user pong game to learn about allocentric interfaces. In the final phase of the project, the mentee will design and implement one or more allocentric interface of their choosing.

This project is well suited for someone who has interest in virtual reality, especially aspects of graphics and interaction design.

WildBerryEye

Tue, 11 Feb 2025 10:15:56 -0700

WildBerryEye leverages Raspberry Pi and YOLO object detection models to monitor pollinizers like bees and hummingbirds visiting flowers. This initiative aims to enhance environmental research by automating data collection and analysis of pollinator activities, which are crucial for ecological assessments and conservation efforts. The project utilizes video data provided by Dr. Rossana Maguiña, processed through advanced machine learning techniques to accurately identify and track pollinator interactions in natural habitats.

Develop web-based user interface

Topics: Full Stack Development React Flask
Skills: Experience with full stack development and real time processing
Difficulty: Moderate to Challenging
Size: Medium or large (175 or 350 hrs)
Mentors: Carlos Isaac Espinosa Ramirez

Develop a clean and intuitive web-based interface for WildBerryEye, ensuring ease of use for researchers and contributors. The platform should present real-time pollinator detection results, facilitate data visualization, and allow users to interact with system settings efficiently. The website must be accessible, visually appealing, and optimized for both desktop and mobile users, avoiding unnecessary complexity or intrusive elements.

Specific tasks:

Frontend Development: Continue development to enhance the user interface using React and CSS, ensuring a responsive and user-friendly design.
Backend Development: Expand functionality using Flask, focusing on efficient API endpoints and seamless interaction with the frontend (excluding database implementation).
Real-Time Communication: Implement and refine real-time updates between the frontend and backend to enhance system responsiveness.
Usability & Design Optimization: Research and propose improvements to the system’s usability, design, and overall user experience.

AI Data Readiness Inspector (AIDRIN)

Tue, 11 Feb 2025 10:15:00 -0700

Garbage In Garbage Out (GIGO) is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.

AIDRIN (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of the readiness of data for AI processes, covering a broad range of readiness dimensions available in the literature. AIDRIN uses metrics in traditional data quality assessment, such as completeness, outliers, and duplicates, for data evaluation. Furthermore, AIDRIN uses metrics specific to assess data for AI, such as feature importance, feature correlations, class imbalance, fairness, privacy, and FAIR (Findability, Accessibility, Interoperability, and Reusability) principle compliance. AIDRIN provides visualizations and reports to assist data scientists in further investigating the readiness of data.

AIDRIN Visualizations and Science Gateway

The proposed work will include improvements in the AIDRIN framework to (1) enhance, extend, and optimize the visualizations of metrics related to all six pillars of AI data readiness and (2) set up a science gateway on NERSC or AWS cloud service.

Topics: data readiness AI
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench with AI workloads

Tue, 11 Feb 2025 10:15:00 -0700

h5bench is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.

Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library’s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.

h5bench with AI workloads

The proposed work will include (1) analyzing and characterizing AI workloads that rely on HDF5 datasets, (2) extracting a kernel of their I/O operations, and (3) implementing and validating the kernel in h5bench.

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

HAgent

Tue, 11 Feb 2025 00:00:00 +0000

HAgent is a platform to build AI hardware agent engine to support multiple components in chip design, such as code generation, verification, debugging, and tapeout.

HAgent is build as a compiler for for Hardware Agents, it interfaces with typical EDA tools like compilers, synthesis, and verification. There are several projects around enhancing HAgent.

BugFarm hagent step

Objective: Develop a HAgent step (pass) to create bugs in a given design.

Description: Using LLMs (Hagent APIs), the goal is to add “bugs” to input Verilog design. The goal is for other tools passes that need to fix bugs, to use this infrastructure as a bug generator. There is a MCY (https://github.com/YosysHQ/mcy) that does something similar but it does not use verilog and create a very different Verilog output. The BugFarm is supposed to have somewhat similar functionality but edit the Verilog directly which results in a code with just a few edits. Like MCY, there has to be a step to confirm that the change affects results. The project should benchmarks and compare with MCY.

Skills Needed: Python, Verilog, and understand agents
Difficulty: Medium
Size: Medium
Mentors: Jose Renau, Farzaneh Rabiei Kashanaki

HDEval Competition Repository

Objective: Create a platform for HDL programming challenges and community engagement.

Description: Develop a repository where users can solve HDL problems in Verilog, Chisel, PyRTL, etc. Implement a points system for successful solutions. Allow users to submit new problems (code, specifications, verification, and tests) that are not easily solvable by LLMs. Automate solution testing and provide feedback on submissions.

The submissions consist of 4 components: code, specification, verification, and tests. It should be possible to submit also examples of bugs in code/specification/verification/tests during the design.

If the code is different from Verilog, it should include the HDL (chisel, PyRTL,…) and also the Verilog.

The specification is free form. For any given specification, an expert on the area should be able to generate code, verification, and tests. Similarly, from any pair. Any expert should be able to generate the rest. For example, from verification and tests, it should be able to generate the code and specification.

Typical specifications consist of a plan, API, and a sample usage.

Skills Needed: Web design, some hardware understanding
Difficulty: Medium
Size: Medium
Mentors: Jose Renau, Farzaneh Rabiei Kashanaki

Integrate Silicon Compiler

Objective: Silicon Compiler is an open-source Python library that allows to interface with many EDA tools. The idea is to integrate it with HAgent to allow prompts/queries to interface with it.

Description: The agentic component requires to check with silicon compiler that the generated Python compiles but also that has reasonable parameters. This will require a react loop for compiler errors, and likely a judge loop for testing for reasonable options/flow with feedback from execution. Since there is not much training examples, it will require a few shot with a database to populate context accordingly.

The end result should allow to select different tools and options trhough silicon compiler.

Skills Needed: Backend chip design
Difficulty: High
Size: Medium
Mentors: Jose Renau

Comodore 64 or MSX or Gameboy

Objective: Create a prompt-only specification to build a hardware accelerated for the target platform (Comodore 64, MSX or Gameboy). The generated code should focus on Verilog, but it is fine to also target some other HDL. In all the cases, the project should include a generated Verilog integrated with some emulator for verification.

Description: Using Hagent, create an HDLEval benchmark (set of prompts) that provide the necessary information to create the Verilog implementation. HDLEval prompts usually consists of a high-level PLAN or specification, an API to implement, and a few examples of usage for the given API.

The result of running the bencharmk, a generated Verilog runs program in the emulator and the Verilog to compare correctness. The platform should have an already existing emulator vice-emu or mGBA to perform cosimulation against the generated specification.

Skills Needed: Verilog for front-end design
Difficulty: High
Size: Large (175 or 350 hours)
Mentors: Jose Renau

Scenic: A Language for Design and Verification of Autonomous Cyber-Physical Systems

Tue, 11 Feb 2025 00:00:00 +0000

Scenic is a probabilistic programming language for the design and verification of autonomous cyber-physical systems like self-driving cars. Scenic allows users to define scenarios for testing or training their system by putting a probability distribution on the system’s environment: the positions, orientations, and other properties of objects and agents, as well as their behaviors over time. Sampling these scenarios and running them in a simulator yields synthetic data which can be used to train or test a system. Since Scenic was released open-source in 2019, our group and many others in academia have used Scenic to find, diagnose, and fix bugs in autonomous cars, aircraft, robots, and other kinds of systems. In industry, it is being used by companies including Boeing, Meta, Deutsche Bahn, and Toyota in domains spanning autonomous driving, aviation, household robotics, railways, maritime, and virtual reality.

Our long-term goal is for Scenic to become a widely-used common representation and toolkit supporting the entire design lifecycle of AI-based cyber-physical systems. Towards this end, we have many summer projects available, ranging from adding new application domains to working on the Scenic compiler and sampler:

3D Driving Scenarios
A Library for Aviation Scenarios
Interfacing Scenic to new simulators
Optimizing and parallelizing Scenic
Improvements and infrastructure for the VerifAI toolkit

See the sections below for details.

3D Driving Scenarios

Topics: Autonomous Driving 3D modeling
Skills: Python; basic vector geometry
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

Scenic scenarios written to test autonomous vehicles use the driving domain, a Scenic library defining driving-specific concepts including cars, pedestrians, roads, lanes, and intersections. The library extracts information about road networks, such as the shapes of lanes, from files in the standard OpenDRIVE format. Currently, we only generate 2D polygons for lanes, throwing away 3D information. While this suffices for many driving scenarios, it means we cannot properly model overpasses (the roads appear to overlap) or test driving scenarios where 3D geometry is important, such as hilly terrain.

The goals of this project are to extend our road network library to generate 3D meshes (instead of 2D polygons) for roads, write new Scenic scenarios which use this new capability, and (if time allows) test autonomous driving software using them.

A Library for Aviation Scenarios

Topics: Autonomous Aircraft
Skills: Python; ideally some aviation experience
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

We have used Scenic to find, diagnose, and fix bugs in software for autonomous aircraft: in particular, this paper studied a neural network-based automated taxiing system using the X-Plane flight simulator. We also have prototype interfaces to AirSim and Microsoft Flight Simulator. However, our experiments so far have mainly focused on simple scenarios involving a single aircraft.

The goal of this project is to develop an aviation library for Scenic (like the driving domain mentioned in the previous project) which will allow users to create complex aviation scenarios in a simulator-agnostic way. The library would define concepts for aircraft, flight paths, weather, etc. and allow importing real-world data about these. The student would demonstrate the library’s functionality by writing some example scenarios and testing either simple aircraft controllers or (if time allows) ML-based flight software.

Interfacing Scenic to New Simulators

Topics: Simulation Autonomous Driving Robotics LLMs
Skills: Python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

Scenic is designed to be easily-interfaced to new simulators. Depending on student interest, we could pick a simulator which would open up new kinds of applications for Scenic and write an interface for it. Some possibilities include:

The AWSIM driving simulator (to allow testing the Autoware open-source autonomous driving software stack)
The CoppeliaSim robotics simulator
NVIDIA’s Cosmos, an LLM which generates videos from text prompts
NVIDIA’s Omniverse (various applications, e.g. simulating virtual factories)
Various simulators for which we have prototype interfaces that could be generalized and made more usable, including MuJoCo and Isaac Sim

The goal of the project would be to create an interface between Scenic and the new simulator and write scenarios demonstrating it. If time allows, we could do a case study on a realistic system for publication at an academic conference.

Optimizing and Parallelizing Scenic

Topics: Optimization Parallelization
Skills: Python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

Large-scale testing with Scenic, when one wants to generate thousands of simulations, can be very computationally-expensive. In some cases, the bottleneck is the simulator, and being able to easily run multiple simulations in parallel would greatly increase scalability. In others, Scenic itself spends substantial time trying to sample scenarios satisfying all the given constraints.

This project would explore a variety of approaches to speeding up scene and simulation generation in Scenic. Some possibilities include:

Parallelizing scene generation and simulation (e.g. using Ray)
Systematically profiling real-world Scenic programs to characterize the main bottlenecks and propose optimizations
JIT compiling Scenic’s internal sampling code (e.g. using Numba)

Improvements and Infrastructure for the VerifAI Toolkit

Topics: DevOps Documentation APIs
Skills: Python
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

VerifAI is a toolkit for design and analysis of AI-based systems that builds on top of Scenic. It adds among other features the ability to perform falsification, intelligently searching for scenarios that will cause a system to behave in an undesirable way.

The goal of this project is to improve VerifAI’s development infrastructure, documentation, and ease of use, which are currently relatively poor compared to Scenic. Specific tasks could include:

Setting up continuous integration (CI) on GitHub
Creating processes to help users/developers submit issues and PRs and deal with them in a timely manner
Writing more documentation, including tutorials and examples (not only for end users of VerifAI but those wanting to develop custom falsification components, for example)
Refactoring VerifAI’s API to make it easier to use and extend

Architecting the Future of Scientific Data: Multi-Site Streaming Without Compromise

Mon, 10 Feb 2025 00:00:00 +0000

Data is generated at ever-increasing rates, yet it’s often processed more slowly than it’s collected. Scientific instruments frequently operate below their full capacity or discard valuable data due to network bottlenecks, security domain mismatches, and insufficient real-time processing capabilities.

SciStream reimagines how scientific data moves across modern research infrastructure by providing a framework for high-speed (+100Gbps) memory-to-memory streaming that doesn’t compromise on security. Whether connecting scientific instruments to analysis clusters or bridging across institutional boundaries, SciStream provides the foundation for next-generation scientific workflows.

Building on our published research, we’re now expanding the framework’s capabilities through open-source development and community collaboration. These projects offer an opportunity for students to gain hands-on experience with cutting-edge networking and security technologies used in high-performance computing (HPC), cloud infrastructure, and large-scale scientific experiments.

SciStream-SecureBench: A Framework for Benchmarking Security Protocols in Scientific Data Streaming

Project Idea Description:

Topics: Security Protocols, Network Performance, Data Streaming, Reproducibility, High-throughput Computing
Skills: Python, Scripting, Linux, Network Protocol Analysis, Containers, Benchmarking tools
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Ever wondered why large scientific experiments need to move massive amounts of data securely and quickly? While TLS and SSH are standard for secure data transfer, there’s a surprising lack of benchmarks that evaluate their performance in high-speed scientific workflows. This project aims to fill this gap by developing a benchmarking suite that measures how different security configurations impact real-time scientific data streaming.

Specific Tasks of the Project Include

Developing benchmarking tools that measure key security performance metrics like handshake latency, throughput stability, and computational overhead.
Running real-world experiments on research testbeds (Chameleon, FABRIC) to simulate scientific data patterns.
Automating comparative analysis between TLS and SSH, with focus on streaming-specific metrics like time-to-first-byte and sustained throughput.
Documenting best practices for security protocol selection in high-performance streaming.

Why This Matters for Your Career

Gain expertise in network security and performance analysis, highly valued in cybersecurity, cloud computing, and HPC.
Work on a real research challenge with potential for publication.

SciStream-StreamBench: Comparative Analysis of Scientific Streaming Frameworks

Project Idea Description:

Topics: Data Streaming Protocols, Network Performance, Benchmarking, Distributed Systems, Real-time Computing
Skills: Python, ZeroMQ, EPICS/PVAccess, Linux, Performance Analysis, Visualization
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Scientific experiments generate enormous amounts of streaming data, but how do we choose the best framework for handling it efficiently? Despite the widespread use of ZeroMQ and PVApy, there’s little systematic benchmarking comparing their performance. This project will develop real-world benchmarks to evaluate how different frameworks handle scientific data in high-speed environments.

The Specific Tasks of the Project Include

Designing benchmarking methodologies to assess key performance metrics like synchronization overhead, time-to-first-data, and throughput stability.
Developing a test harness that simulates real-world streaming conditions (network variability, concurrent streams, dynamic data rates).
Running experiments on Chameleon and FABRIC testbeds.
Automating data collection and visualization to highlight performance trends.
Documenting best practices and framework-specific optimizations.

Why This Matters for Your Career

Get hands-on experience with real-time data processing and network performance analysis.
Learn benchmarking techniques useful for distributed systems, cloud computing, and high-performance networking.

SciStream-QUIC: Next-Generation Proxy Architecture for Scientific Data Streaming

Project Idea Description:

Topics: QUIC Protocol, Network Proxies, Performance Analysis, Protocol Design, Hardware Acceleration
Skills: Python/C++, Network Programming, QUIC (quiche/aioquic), Linux, Performance Analysis
Difficulty: Hard
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Ever wondered how YouTube loads videos faster than traditional web pages? That’s because of QUIC, a next-generation protocol designed for speed and security. Initial evaluations of federated streaming architectures (INDIS'22 paper) suggest potential benefits of QUIC, but comprehensive benchmarking is needed. This project explores whether QUIC-based proxies can outperform traditional TCP+TLS proxies for scientific data streaming, potentially revolutionizing how researchers move large datasets.

The Specific Tasks of the Project Include

Developing a QUIC-based proxy optimized for scientific workflows.
Running benchmarks to compare QUIC vs. traditional TLS proxies.
Investigating hardware encryption offloading for QUIC and TLS.
Designing reproducible experiments using Chameleon and FABRIC testbeds.
Documenting best practices for deploying QUIC proxies in HPC environments.

Why This Matters for Your Career

Gain experience in cutting-edge networking protocols used in cloud computing (Google, Cloudflare, etc.).
Learn about hardware acceleration and its role in high-speed networking.

SciStream-Auth: Modern Authentication and User Interface for Scientific Data Streaming

Project Idea Description:

Topics: Authentication Systems, UI/UX Design, Security Integration, Scientific Computing
Skills: Python, Web Development (React/Vue), OAuth 2.0/SAML, Security Analysis
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Not a security expert? You can still contribute by designing an interactive front-end!

In today’s scientific computing landscape, authentication and user experience often act as barriers to adoption rather than enabling seamless collaboration. While SciStream excels at high-speed data transfer, its reliance on a single authentication provider and command-line interface limits its accessibility. This project aims to transform SciStream into a more versatile platform by implementing a modular authentication system and developing an intuitive graphical interface.

By expanding beyond Globus Auth to support multiple authentication frameworks, we can enable broader adoption across different scientific communities while maintaining robust security. Coupled with a modern GUI that visualizes real-time streaming activity, this enhancement will make SciStream more accessible to researchers—allowing them to focus on their science rather than wrestling with complex configurations.

This project will design a user-friendly interface that makes secure scientific data streaming as intuitive as using a cloud storage service. You’ll also gain hands-on experience with authentication methods used by industry leaders like Google and Facebook, while directly improving access to scientific data.

The Specific Tasks of the Project Include

Design and implementation of a pluggable authentication system supporting multiple providers (OAuth 2.0, SAML, OpenID Connect, certificate-based auth)
Development of a modern, responsive GUI using web technologies that provides real-time visualization of system status
Creation of comprehensive security testing protocols to validate the authentication implementations
Implementation of session management and secure credential handling within the GUI
Design of an intuitive interface for managing streaming configurations and monitoring data flows
Creation of documentation and examples to help facilities integrate their preferred authentication mechanisms

Kolmogorov-Arnold-based Transformer for LLMs: Implementation, Evaluation and Benchmarking

Sun, 09 Feb 2025 10:15:56 -0700

Project Objectives

KALLM project proposes a new Kolmogorov-Arnold Network (KAN)-based Transformer implementation of an open-source LLM called SmolLM2. Transformers have found increasing success in many open-source LLMs across language reasoning tasks like text generation, summarization and some tasks imitating advanced critical thinking. However, Kolmogorov-Arnold Networks (KANs) are an attractive alternative to the Multi-Layer Perceptrons (MLPs) which are used in these Transformer architectures by default. KAN-based Transformers (KATs) offer several advantages over MLPs. (i) They follow the universal approximation property and hence can theoretically approximate any function i.e. they can learn from any complex input patterns. (ii) They are more interpretable as they decompose the entire input into multiple manageable components with each layer processing one component. This is unlike MLPs, where each layer processes the input sequences holistically. (iii) KANs can lead to faster convergence on certain reasoning tasks due to their ability to break down the input sequences into simple univariate functions.

However, currently there exist little to no open-source implementations of KAN-based Transformers in open-source LLMs. Until recently, an efficient implementation of KAN was not available; the same can be said for KAN-based Transformer implementations. With the recent efficient implementations of open-source KAN-based Transformers (KATs), integrating them into open-source LLM engines becomes a possibility. This project will implement KAT as the core architecture of the Transformer in SmolLM2, an open-source LLM and perform evaluation and benchmarking against language reasoning tasks.

Project Methodology and Milestones

The project methodology is a mix of implementation and evaluation. The mentors are well-experienced in working with large codebases and will be available to guide through the technical and non-technical portions of the project. The step-by-step project methodology is outlined as follows.

Installation of SmolLM2 from the official Git Repo: The open-source implementation of SmolLM2 engine (hereon referred to as smollm plainly) is available on GitHub. The project primarily focuses on language reasoning and hence we limit ourselves to the SmolLM2 implementation and forego other forks of SmolLM such as the SmolVLM family.
- The project needs to be sanity checked by installing the engine on local computers.
- Following that, the students are to familiarize themselves with basic workflow such as running a sample code using the pretrained model. The instructions for installing SmolLM are located under the “tools” at smollm/tools/smol-tools subfolder.
- Next step is to train the SmolLM using the prepackaged Transformer model called “HF://mlc-ai/SmolLM2-1.7B-Instruct-q0f16-MLC". The instructions are provided here.
Implementation—KAT in SmolLM: The smollm pretrained model is at smollm/tools/smollm_local_inference/mlc.py. The pretrained model is called “HF://mlc-ai/SmolLM2-1.7B-Instruct-q0f16-MLC”. This is an MLP-based Transformer. However, we will train the smollm with transformers ourselves with the same model as well as with KAT.
- A KAT implementation is available on GitHub at ICLR2025. To implement KAT in smollm, we will replace the default Transformer (HF://mlc-ai/SmolLM2-1.7B-Instruct-q0f16-MLC) with the open-source KAT mentioned above.
Training SmolLM with default Transformer and with KAT: This step will require compute resources and requires deployment of the implementation on Chameleon Cloud and/or National Research Platform (NRP). The mentors have access to these two testbeds and will provide the students access to those resources.
- The first task of this step is to port to the implementation to Chameleon Cloud before the model can be trained. This task may require around a week’s worth of turnaround time and can be performed in parallel with steps 1 & 2 if needed.
- Training: The FULL dataset for training smollm is called smolltalk located here: HuggingFaceTB/smoltalk. The training code and instructions are at huggingface/alignment-handbook. Although the baseline uses SmolLM2-1.7.B-Instruct (pretrained model), we will instead train smollm for SmolLM2-135M-Instruct and SmolLM2-360M-Instruct as noted at the bottom of the page at HuggingFaceTB/smoltalk· Datasets at Hugging Face. According to this, for SmolLM2-135M-Instruct and SmolLM2-360M-Instruct we will ONLY use the smol-smolltalk dataset.
Benchmarking: Finally, the benchmarks used throughout this project to evaluate our implementations will be the same as those for the release (pretrained) versions SmolLM2-135M-Instruct and SmolLM2-360M-Instruct. The benchmarks for language reasoning will be chosen.

Project Timeline

The following project timeline is anticipated. Some tasks may take longer or shorter than anticipated and hence the timeline is not 100% set in stone. However, it serves as a baseline based on the mentor’s prior experience on working with similar research projects. Each cell in the timeline chart is one week.

Project Testbeds

Sai Lamba Karanam has adminstrator-level access to the National Research Platform (NRP) and will provide access to cloud resources (compute) to the students working on the project. Both mentors will also have access to Chameleon Cloud platform and will grant access to compute resources for training, evaluation and benchmarking purposes.

Project Deliverables

KALLM will be hosted on github at this repo. The mentors have extensive experience working with Machine Learning (ML) and Artifical Intelligence (AI) workflows in academic and industry settings. We seek to choose mentees that are willing to learn implementation of AI models and working with semi-large code bases. Mentees will have to become comfortable working with remote cloud testbeds (Chameleon and/or NRP) during the latter-half of the project. Some milestines described in the Project Methodolody can be done in parallel.

KALLM is part of larger collaborative effort between the mentors and involves milestones and outcomes that fall outside the scope of this project, but are related. The mentors plan to publish the large project outcomes aat ML-based venue(s) towards the end of Fall 2025. The mentees will be added as coauthors.

KALLM

Topics: Towards High Performance NCCL-enabled 2D partitioned PyLops-MPI library
Skills: Python proficiency, scikit-learn, Experience with Linux, Introductory Experience with Cloud Computing Platforms
Difficulty: Easy-Medium
Size: Medium (175 hours)
Mentor: Sai Suman Lamba Karanam, Zahmeeth Sakkaff

Smart Batching for Large Language Models

Sun, 09 Feb 2025 10:15:56 -0700

Sequence tokenization is a crucial step during Large Language Model training, fine-tuning, and inference. User prompts and training data are tokenized and zero-padded before being fed to the model in batches. This process allows models to interpret human language by breaking down complex sentences into simple token units that are numerically represented in a token set. However, the process of sequence padding for maintaining batch dimensions can introduce unnecessary overhead if batching is not properly done.

In this project, we introduce Smart Batching, where we dynamically batch sequences in a fine-tuning dataset by their respective lengths. With this method, we aim to minimize the amount of zero padding required during sequence batching, which can result in improved and efficient fine-tuning and inference speeds. We also analyze this method with other commonly used batching practices (Longest Sequence, Random Shuffling) on valuable metrics such as runtime and model accuracy.

Project Title

Topics: Large Language Models Fine-Tuning AI Transformers
Skills: Python, Pytorch, Large Language Models
Difficulty: Moderate
Size: Large (350 hours)
Mentor: [Daniel Wong]Daniel Wong, [Luanzheng “Lenny” Guo]Luanzheng "Lenny" Guo

Project Tasks and Milestones

Implement an open source smart batching framework based on HuggingFace to allow for dynamically grouping sequences of similar token lengths into batches
Analyze runtime, padding, and model accuracy with smart batching and other commonly used batching practices
Apply smart batching with distributed fine-tuning and observe large language model outputs

Towards High Performance NCCL-enabled 2D partitioned PyLops-MPI library

Sun, 09 Feb 2025 10:15:56 -0700

Project Description

PyLops ecosystem designed to enable large-scale, distributed-memory computations for matrix-free inverse problems. PyLops has achieved more than 400 stars till now. PyLops-MPI is an extension of the PyLops. It can be widely used in scientific computation problems. Developed as part of the 2023 Google Summer of Code, PyLops-MPI builds on the core PyLops framework by integrating MPI-based parallelism through the mpi4py library. This allows users to efficiently scale PyLops-based computations beyond a single node, leveraging high-performance computing (HPC) clusters for tackling increasingly large problem sizes.

By extending PyLops’ modular and user-friendly interface to distributed environments, PyLops-MPI provides researchers and engineers with a powerful tool for developing scalable scientific applications across disciplines, from geophysics to machine learning. As part of the broader PyLops ecosystem, it represents a significant step toward high-performance, parallel inverse problem solving, catering to both academia and industry.

PyLops-MPI is aimed to provide an efficient and user-friendly distibuted inverse problem solution. The software are designed to handle all three distinct use-cases in the distributed inverse problem: (1) Both model and data are fully distributed across nodes. (2) Data are distributed across nodes but model is available at all nodes. (3) Both model and data are available in all nodes (or just in the master). There are multiple use-cases for PyLops-MPI, e.g. Least-squares Migration, Multi-Dimensional Deconvolution and Post Stack Inversion - 3D. We’ve already provided a solution based on mpi4py. With the development of PyLops-MPI, we plan to upgrade the communication infrastructure of PyLops-MPI to better support GPU-based cluster and reduce the overall datavement. This will further boost the performance of PyLops-MPI.

This project is designed based on the roadmap of PyLops-MPI library. We plan to provide three major functionalities for the PyLops-MPI. Firstly, in the PyLops-MPI library, we support the NVIDIA GPU where we operates data on GPU. We are relying on CUDA-Aware MPI implementaiton in mpi4py. It requires a CUDA-Aware MPI software stacks which is strict to the system software. Alternatively, we can use NCCL instead of mpi4py when we call GPU routines. NCCL also has better support for NVLink etc. Secondly, the parallelism of PyLops-MPI is achieved through spliting the data and models to different MPI processors. It might not be scalable when we have multiply right hand side inverse problems. We have a 2D partitioned implementation while we don’t know exactly the performance of the implementation. Thus, we propose to benchmark the scalability of the new 2D partitioned implementation. We are aware that other distibuted matrix-matrix multiplication algorithms such as 2D SUMMA algorithm can be incoporated into the PyLops-MPI library. We would also like to implement the 2D SUMMA algorithm if time permits. Finally, we would like to benchmark MPI one-sided API in PyLops-MPI library since asynchronous execution features of one-sided API will improve the scalability of the algorithm.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), the project aims to benchmark and extend the capability of the PyLops-MPI library in the distributed environment. Below is an outline of the algorithms to be developed in this project:

Goal 1: Enabling NCCL API in PyLops-MPI Library: Understanding the design of PyLops-MPI. Using NCCL API for collective communication in PyLops-MPI when data is on GPU. Benchmarking the performance of NCCL compared with mpi4py in different scenarios.
Goal 2: Benchmarking 2D Parallelism in PyLops-MPI Library: Understanding the design of PyLops-MPI. Understanding the design of 2D partition design in current PyLops-MPI library. Benchmarking the performance of 2D partition design in PyLops-MPI compared with original 1D partition design. If possible, implement the 2D SUMMA algorithm.
Goal 3: Enabling MPI One-sided API in PyLops-MPI Library: Understanding the design of PyLops-MPI. Uderstanding the message roofline model and MPI one-sided API. Implementing the one-sided communication strategies in the PyLops-MPI library.

Project Benchmark Suites

We plan to use at least three use cases in the tutorial sections in PyLops-MPI. We will measure the communication volume and time to solution of these use-cases.

Project Benchmark Testbeds

Yuxi Hong is granted access to the world-leading supercomputers such as Delta, Delta-AI through an ACCESS funded grant. Delta is a dedicated, ACCESS-allocated resource designed by HPE and NCSA, delivering a highly capable GPU-focused compute environment for GPU and CPU workloads. DeltaAI is a companion system to Delta. Powered by the NVIDIA GH200 Grace Hopper Superchip, DeltaAI provides powerful compute capabilities for simulation and data science. The team also has access to Perlmutter supercomputer based in LBNL.

Project Deliverables

The PyLops-MPI is hosted in github at Repo. The paper mainly describes the design of the PyLops-MPI. Our mentors are the main developers and deisgner of the PyLops-MPI. Our mentors are also experts in HPC and MPI libraries. We will select the proper the mentees for the projects and provide the benchmark results and new functionalities of PyLops-MPI by the end of the projects. We plan to have two or three mentees for the projects since each goal is a milestone for PyLops-MPI. The three goals can be achieved seperately, they are orthogonal to each other and can be exectuted in parallel.

PyLops-MPI

Topics: Towards High Performance NCCL-enabled 2D partitioned PyLops-MPI library
Skills: Proficient in Python, Experience with MPI, Experience with GPU
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Yuxi Hong, Matteo Ravasi, Nan Ding

GeFARe: Discovering Reproducible Failure Scenarios and Developing Failure-Aware Scheduling for Genomic Workflows

Sun, 09 Feb 2025 00:00:00 +0000

Topics: genomic processing (e.g., DNA and RNA alignment), workflow scheduling, resource/cluster management, container orchestration
Skills: Linux, cloud computing (e.g., OpenStack), cluster manager (e.g., Kubernetes), systems automation (e.g., Bash/Python/Puppet), genomic workflows and applications (e.g., BWA, FastQC, Picard, GATK, STAR)
Difficulty: Hard
Size: Large (350 hours)
Mentor(s): In Kee Kim

Project Idea description

Large-scale genomic workflow executions require large-scale computing infrastructure, as well as high utilization of that infrastructure, to maximize throughput. Systems researchers have developed various techniques to achieve this goal, including scheduling, resource harvesting, tail mitigation, and failure recovery. However, many of these large-scale efforts have been carried out by separate groups/institutions that operate such large-scale infrastructure (e.g., major tech companies and national research labs). Reproducing and building upon these works at a similar scale in an academic environment is challenging – even labs with strong ties to these institutions often have to rely on trace-based research, which does not fully capture the complexities of real-world deployments.

We observe two fundamental reasons for this difficulty: 1) a lack of computational infrastructure at a comparable scale and 2) a lack of representative workloads and software stacks. Although the academic community has sought to broaden access to large-scale infrastructure through testbeds like ChameleonCloud and CloudLab, the representative workloads and software stacks to reproduce aforementioned works remain limited.

We aim to address this challenge by providing a robust, easy-to-use, and open-source environment for large-scale genomics workflow scheduling. Specifically, this environment will include: a) a suite of tools to set up infrastructure on academic cloud testbeds, b) a scheduling research platform for genomic workflows, and c) software stacks to reproduce large-scale failure scenarios.

We limit the scope of this project to only one or two major failure scenarios. For example, out-of-memory (OOM) failures occur when genomics applications run with insufficient available memory. However, we aim to make the software stack extendable for other scenarios whenever possible.

Throughout this project, students will learn to use cloud testbeds (e.g., ChameleonCloud) for workflow scheduling research. They will gain hands-on experience in open-source cluster management and container orchestration tools (e.g., Kubernetes) and will also learn about various aspects of high-performance computing when genomic workflows.

Finally, we will open-source all the code, software stacks, and datasets created during this project. Using these artifacts, we will also ensure the reproducibility of failure scenarios.

Project Deliverable

Acquire a basic understanding of genomic data processing (will mentor guidance)
Build tools to set up a multi-node cluster on ChameleonCloud
Create automation code/tools to set up genomics workflows’ input and containerized applications
Discovering failure scenarios for genomics workflow execution (will mentor guidance)
Develop a Kubernetes-based platform to implement scheduling policies (Students may use or build upon existing open-source works)
Document the steps needed to reproduce the proposed failure scenarios

StatWrap

Sun, 09 Feb 2025 00:00:00 +0000

StatWrap is a free and open-source assistive, non-invasive discovery and inventory tool to document research projects. It inventories project assets (e.g., code files, data files, manuscripts, documentation) and organizes information without additional input from the user. It also provides structure for users to add searchable and filterable notes connected to files to help communicate metadata about intent and analysis steps.

At its core, StatWrap helps investigators identify and track changes in a research project as it evolves - which may affect reproducibility. For example: (1) people on the project can change over time, so processes may not be consistently executed due to transitions in employment; (2) data changes over time, due to accruing additional cases, adding new variables, or correcting mistakes in existing data; (3) software (e.g. used for data preparation and statistical analysis) evolves as it is edited, improved, and optimized; and (4) software can break or produce different results due to changes ‘under the hood’ such as updates to statistical packages, compilers, or interpreters. StatWrap passively and actively documents these changes to support reproducibility.

Additional information:

Project Search

Topics: search, user interface, indexing
Skills: JavaScript, React
Difficulty: Medium
Size: Large (350 hours)
Mentor: Luke Rasmussen, Eric Whitley

The goal of this project is to leverage the information entered by users and passively discovered by StatWrap to facilitate cross-project searching. This functionality will allow investigators to search across projects (current and past) to find relevant projects, assets, and notes. Given the potentially sensitive nature of data included in projects, the indexing of content for searching must be done locally.

The specific tasks of the project include:

Identify and evaluate open-source projects to index content for searching
Add a new classification for projects of “Active” and “Past” in the user interface
Implement the search capability within the user interface
Develop unit tests and conduct system testing

Disentangled Generation and Editing of Pathology Images

Fri, 07 Feb 2025 00:00:00 +0000

Topics: computational pathology, image generation, disentangled representations, latent space manipulation, deep learning
Skills:
- Programming Languages:
  - Proficient in Python, with experience in machine learning libraries such as PyTorch or TensorFlow.
- Generative Models:
  - Familiarity with Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and contrastive learning methods.
- Data Analysis:
  - Image processing techniques, statistical analysis, and working with histopathology datasets.
- Biomedical Knowledge (preferred):
  - Basic understanding of histology, cancer pathology, and biological image annotation.
Difficulty: Advanced
Size: Large (350 hours). The project involves substantial computational work, model development, and evaluation of generated pathology images.
Mentors: Xi Li (contact person), Mentor Name

Project Idea Description

The project aims to advance the generation and disentanglement of pathology images, focusing on precise control over key histological features. By leveraging generative models, we seek to create synthetic histological images where specific pathological characteristics can be independently controlled.

Challenges in Current Approaches

Current methods in histopathology image generation often struggle with:

Feature Entanglement: Difficulty in isolating individual factors such as cancer presence, severity, or staining variations.
Lack of Control: Limited capability to manipulate specific pathological attributes without affecting unrelated features.
Consistency Issues: Generated images often fail to maintain realistic cellular distributions, affecting biological validity.

Project Motivation

This project proposes a disentangled representation framework to address these limitations. By separating key features within the latent space, we aim to:

Control Histological Features: Adjust factors such as cancer presence, tumor grade, number of malignant cells, and staining methods.
Ensure Spatial Consistency: Maintain the natural distribution of cells during image reconstruction and editing.
Enable Latent Space Manipulation: Provide interpretable controls for editing and generating realistic histopathology images.

Project Objectives

Disentangled Representation Learning:
- Develop generative models (e.g., VAEs, GANs) to separate and control histological features.
Latent Space Manipulation:
- Design mechanisms for intuitive editing of pathology images through latent space adjustments.
Spatial Consistency Validation:
- Implement evaluation metrics to ensure that cell distribution remains biologically consistent during image generation.

Project Deliverables

Generative Model Framework:
- An open-source Python implementation for pathology image generation and editing.
Disentangled Latent Space Tools:
- Tools for visualizing and manipulating latent spaces to control specific pathological features.
Evaluation Metrics:
- Comprehensive benchmarks assessing image quality, feature disentanglement, and biological realism.
Documentation and Tutorials:
- Clear guidelines and code examples for the research community to adopt and build upon this work.

Impact

By enabling precise control over generated histology images, this project will contribute to data augmentation, model interpretability, and biological insight in computational pathology. The disentangled approach offers new opportunities for researchers to explore disease mechanisms, develop robust diagnostic models, and improve our understanding of cancer progression and tissue morphology.

Autograder

Thu, 06 Feb 2025 13:00:00 -0800

The EduLinq Autograder is an open source tool used by several courses at UCSC to safely and quickly grade programming assignments. Grading student code is something that may seem simple at first (you just need to run their code!), but quickly becomes exceeding complex as you get more into the details. Specifically, grading a student’s code securely while providing the “last mile” service of getting code from students and sending results to instructors/TAs and the course’s LMS (e.g., Canvas) can be very difficult. The Autograder provides all of this in a free and open source project. The LINQS Lab has made many contributions to the maintain and improve the Autograder.

As an open source project, there are endless opportunities for development, improvements, and collaboration. Here, we highlight some specific projects that will work well in the summer mentorship setting.

All students interested in LINQS projects for OSRE/GSoC 2025 should fill out this form. Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project. The form will stop accepting responses once the application window closes. Do not post on any of the project repositories about OSRE/GSoC (e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2025). Remember, these are active repositories that were not created for OSRE/GSoC.

LLM Detection

Topics: AI/ML LLM Research Backend
Skills: software development, backend, systems, data munging, go, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

As Large Language Model (LLM) tools like ChatGPT become more common and powerful, instructors need tools to help determine if students are the actual authors of the code they submit. More classical instances of plagiarism are often discovered by code similarity tools like MOSS. However these tools are not sufficient for detecting code written not by a student, but by an AI model like ChatGPT or GitHub Copilot.

The task for this project is to create a system that provides a score indicating the system’s confidence that a given piece of code was written by an AI tool and not a student. This will supplement the existing code analysis tools in the Autograder. There are many approaches to completing this task that will be considered. A more software development approach can consist of levering exiting systems to create a production-ready system, whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.

See Also:

Code Analysis GUI

Topics: Frontend
Skills: software development, frontend, data munging, js, css, go
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Autograder has existing functionality to analyze the code in a student’s submission for malicious content. Relevant to this project is that the Autograder can run a pairwise similarity analysis against all submitted code. This is how most existing software plagiarism systems detect offending code. The existing infrastructure provides detailed statistics on code similarity, but does not currently have a visual way to display this data.

The task for this project is to create a web GUI using the Autograder REST API to display the results of a code analysis. The size of this project depends on how many of the existing features are going to be supported by the web GUI.

See Also:

Web GUI

Topics: Frontend
Skills: software development, frontend, js, css
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Autograder contains dozens of API endpoints, most directly representing a piece of functionality exposed to the user. All of these features are exposed in the Autograder’s Python Interface. However, the Python interface is a purely command-line interface. And although command-line interface are objectively (read: subjectively) the best, a web GUI would be more accessible to a wider audience. The autograder already has a web GUI, but it does not cover all the features available in the Autograder.

The task for this project is to augment the Autograder’s web GUI with more features. Specifically, add support for more tools used to create and administer courses.

See Also:

LMS Toolkit

Thu, 06 Feb 2025 13:00:00 -0800

The EduLinq LMS Toolkit (also called the “Canvas Tool” or “py-canvas”) is a suite of tools used by several courses at UCSC to interact with Canvas from the command line or Python. A Learning Management System (LMS) is a system that institutions use to manage courses, assignments, students, and grades. The most popular LMSs are Canvas, Blackboard, Moodle, and Brightspace. These tools can be very helpful, especially from an administrative standpoint, but can be hard to interact with. They can be especially difficult when instructors and TAs want to do something that is not explicitly supported by their built-in GUIs (e.g., when an instructor wants to use a special grading policy). The LMS Toolkit project is an effort to create a single suite of command-line tools (along with a Python interface) to connect to all the above mentioned LMSs in a simple and uniform way. So, not only can instructors and TAs easily access the modify the data held in an LMS (like a student’s grades), but they can also do it the same way on any LMS. The LINQS Lab has made many contributions to the maintain and improve the Quiz Composer.

Currently, the LMS Toolkit only supports Canvas, but this suite of projects hopes to not only expand existing support, but add support for more LMSs.

Advanced Canvas Support

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The LMS Toolkit already has basic read-write support for core Canvas functionality (working with grades and assignments). However, there are still many more features that can be supported such as group management, quiz management, quiz statistics, and assignment statuses.

The task for this project is to implement chose of set of advanced Canvas features to support (not limited to those features mentioned above), design an LMS-agnostic way to support those features, and implement those features. The flexibility in the features chosen to implement account for the variable size of this project.

See Also:

Repository for LMS Toolkit
GitHub Issues

New LMS Support: Moodle

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The goal of the LMS toolkit is to provide a single interface for all LMSs. It is a lofty goal, however there is currently only support for Canvas. Moodle is one of the more popular LMSs. Naturally, the LMS Toolkit wants to support Moodle as well. Moodle is open source, so adding support in the LMS Toolkit should not be too challenging.

The task for this project is to add basic support for the Moodle LMS. It is not necessary to support all the same features that are supported for Canvas, but at least the core features of score and assignment management should be implemented.

See Also:

New LMS Support: Blackboard

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The goal of the LMS toolkit is to provide a single interface for all LMSs. It is a lofty goal, however there is currently only support for Canvas. Blackboard (also called “Blackboard Learn”) is one of the more popular LMSs. Naturally, the LMS Toolkit wants to support Blackboard as well. However, a challenge in supporting Blackboard is that it is not open source (unlike Canvas). Therefore, support and testing on Blackboard may be very challenging.

The task for this project is to add basic support for the Blackboard LMS. It is not necessary to support all the same features that are supported for Canvas, but at least the core features of score and assignment management should be implemented. The closed nature of Blackboard makes this a challenging and uncertain project.

See Also:

New LMS Support: Brightspace

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The goal of the LMS toolkit is to provide a single interface for all LMSs. It is a lofty goal, however there is currently only support for Canvas. D2L Brightspace is one of the more popular LMSs. Naturally, the LMS Toolkit wants to support Brightspace as well. However, a challenge in supporting Brightspace is that it is not open source (unlike Canvas). Therefore, support and testing on Brightspace may be very challenging.

The task for this project is to add basic support for the Brightspace LMS. It is not necessary to support all the same features that are supported for Canvas, but at least the core features of score and assignment management should be implemented. The closed nature of Brightspace makes this a challenging and uncertain project.

See Also:

Testing / CI Infrastructure

Topics: Backend Teaching Tools Testing CI
Skills: software development, backend, testing, ci, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The goal of the LMS toolkit is to provide a single interface for all LMSs. This means that our system must communicate with several different (the LMSs), each with their own systems, data patterns, versions, and quirks. Testing will be essential to ensure that our tools keep working as the different LMSs evolve and update. The LMS Toolkit currently tests with Canvas by mocking API responses. However, this tactic does not scale well with multiple LMSs (and multiple versions of each system). A more scalable approach would be to have test instances of the different LMSs that our testing infrastructure can interact with both interactively and in continuous integration (CI).

The task for this project is to create testing infrastructure that connects to test instances of different LMS systems (e.g., Canvas). This task does not require that all the LMSs in this document are used, but the testing infrastructure should be robust enough to support them all. The open source LMSs (Canvas and Moodle) will likely be much easier to setup than the others, and should be targeted first. We should be able to run tests locally as well as in CI, and will likely heavily use Docker containers.

See Also:

Quiz Composer

Thu, 06 Feb 2025 13:00:00 -0800

The EduLinq Quiz Composer (also called the “Quiz Generator”) is a tool used by several courses at UCSC to create and maintain platform-agnostic quizzes (including exams and worksheets). Knowledge assessments like quizzes, exams, and tests are a core part of the learning process for many courses. However maintaining banks of questions, collaborating on new questions, and converting quizzes to new formats can use up a lot of time, taking time away from actually working on improving course materials. The Quiz Composer helps by providing a single text-based format that can be stored in a repository and “compiled” into many different formats including: HTML, LaTeX, PDF, Canvas, GradeScope, and QTI. The LINQS Lab has made many contributions to the maintain and improve the Quiz Composer.

Canvas Import

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Eriq Augustine, Lucas Ellenberger, Lise Getoor

The Quiz Composer houses quizzes and quiz questions in a simple and unambiguous format based on JSON and Markdown (specifically, the CommonMark specification). This allows the Quiz Composer to unambiguously create versions of the same quiz in many different formats. However, creating a quiz in the Quiz Composer format can be a daunting task for those not familiar with JSON or Markdown. Instead, it would be easier for people to import quizzes from another format into the Quiz Composer format, and then edit it as they see fit. Unfortunately not all other quiz formats, namely Canvas in this case, are unambiguous.

The task for this project is to implement the functionality of importing quizzes from Canvas to the standard Quiz Composer format. The unambiguous nature of Canvas quizzes makes this task non-trivial, and adds an additional element of design decisions to this task. It will be impossible to import quizzes 100% correctly, but we want to be able to get close enough that most people can import their quizzes without issue.

See Also:

Google Forms Export

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, python
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Eriq Augustine, Lucas Ellenberger, Lise Getoor

The Quiz Composer can export quizzes to many different formats, each with a varying level of interactivity and feature support. For example, quizzes can be exported to PDFs which will be printed and the students will just write down their answers to be checked in the future. Quizzes can also be exported to interactive platforms like Canvas where students can enter answers that may be automatically checked with feedback immediately provided to the student. On potential platform with functionality somewhere between the above two examples is Google Forms. “Forms” (an entity on Google Forms) can be something like a survey or (as of more recently) a quiz.

The task for this project is to add support for exporting quizzes from the Quiz Composer to Google Forms. There is a large overlap in the quiz features supported in Canvas (which the Quiz Composer already supports) and Google Forms, so most settings should be fairly straightforward. There may be some design work around deciding what features are specific to one quiz platform and what features can be abstracted to work across several platforms.

See Also:

Template Questions

Topics: Backend Teaching Tools API
Skills: software development, backend, data munging, python
Difficulty: Moderate-Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Lucas Ellenberger, Lise Getoor

Questions in the Quiz Composer are described using JSON and Markdown files which contain the question prompt, possible answers, and the correct answer. (Of course there are many differ question types, each with different semantics and requirements.) However, a limitation of this is that each question is always the same. You can have multiple copies of a question with slightly different prompts, numbers, and answers; but you are still limited to each question being static and unchanging. It would be useful to have “template questions” that can dynamically create static questions from a template and collection of replacement data.

The task for this project is to add support for the “template questions” discussed above. Much of the high-level design work for this issue has already been completed. But there is still the implementation and low-level design decision left to do.

See Also:

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Thu, 06 Feb 2025 10:15:56 -0700

Project Description

Project Objectives

Step 1: Data Preprocessing & Feature Creation: Develop a data processing pipeline to parse user’s sequential interaction behaviors into sequential data points for LLM-based embeddings and contextual sequential transformer modeling; Extract user behavior sequences, items’ metadata, and temporal patterns to create context-aware sequential representations for training, validation and testing; The data source can be from Amazon open public data or Movie Lense data set. The data points creation can follow SASRec (in the reference 1).
Step 2: Model Development: Design and implement LLM-enhanced sequential recommendation models, integrating pretrained language models to augment user-item interactions with semantic context; Develop an adaptive mechanism to incorporate external contextual signals, such as product descriptions, reviews into the sequential recommendation process; The baseline model can be SASRec pytorch implementation.
Step 3: Evaluation: : Benchmark LLMSeqRec against state-of-the-art sequential recommenders, evaluating on accuracy, NDCG and cold-start performance; Conduct ablation studies to analyze the impact of LLM-generated embeddings on recommendation quality; Optimize model inference speed and efficiency for real-time recommendation scenarios.

Project Deliverables

LLMSeqRec

Topics: LLM Enhanced Contextual Sequential Recommender
Skills: Proficiency in Python, Pytorch, Github, Self-attention, Transformer
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Self-Attentive Sequential Recommendation (SASRec)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Amazon Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews
Movie Lense Data: https://grouplens.org/datasets/movielens/

ReIDMM: Re-identifying Multiple Objects across Multiple Streams

Thu, 06 Feb 2025 10:15:56 -0700

Project Description

Re-identifying multiple objects across multiple streams (ReIDMM) is essential in scientific research and various industries. It involves tracking and analyzing entities across different viewpoints or time frames. In astronomy, ReIDMM helps track celestial objects like asteroids and space debris using multiple observatories. In biology and ecology, it enables the identification of animals across different camera traps and aids in tracking microscopic organisms in laboratory studies. In physics and engineering, it is used for tracking particles in high-energy physics experiments, monitoring structural changes in materials, and identifying robots or drones in lab automation. Beyond scientific applications, ReIDMM plays a critical role in industries such as retail, where it tracks customer behavior across multiple stores and improves sales and prevents theft. In smart cities, it supports traffic monitoring by identifying vehicles across intersections for improved traffic flow management. In manufacturing, it enables supply chain tracking by locating packages across conveyor belts and warehouse cameras. In autonomous systems, ReIDMM enhances multi-camera sensor fusion and warehouse robotics by identifying pedestrians, obstacles, and objects across different camera views.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), this project aims to develop an open-source algorithm for multiple-object re-identification across diverse open-source data streams. As highlighted earlier, this method is expected to have wide-ranging applications in both scientific research and industry. Utilizing an open-source dataset, our focus will be on re-identifying common objects such as vehicles and pedestrians. The primary challenge lies in designing a unified algorithm, ReIDMM, capable of performing robust multi-object re-identification across multiple streams. Users will be able to tag any object as a target in a video or image for tracking across streams. Below is an outline of the algorithms to be developed in this project:

Step 1: Target Object Identification: Randomly select a target object from an image or video using object detection models such as YOLOv7. These models detect objects by generating bounding boxes around them. Target objects could include vehicles, pedestrians, animals, or other recognizable entities. This step ensures an initial object of interest is chosen for re-identification.
Step 2: Feature Extraction and Embedding: Once the target object is identified, extract relevant features such as bounding box coordinates, timestamp, location metadata (if available), and visual characteristics. A multimodal embedding approach is used, where these features are transformed into a numerical representation (embedding vector) that captures the object’s unique identity. This allows for efficient comparison across different images or videos.
Step 3: Searching and Matching: To find the target object in other images or videos: (1) Extract embeddings of all objects detected in the other images/videos; (2) Compute similarity between the target object’s embedding and those of all detected objects using metrics like cosine similarity or Euclidean distance. (3) Rank objects by similarity, returning the most probable matches. The highest-ranked results are likely to be the same object observed from different angles, lighting conditions, or time frames.

Project Deliverables

This project will deliver three things, software, evaluation results and demo. The software which implements the above ReIDMM algorithm will be hosted on the github repo as open-access repositories. The evaluation results and demo will be published along the github repo.

ReIDMM

Topics: ReIDMM: Re-identifying Multiple Objects across Multiple Streams`
Skills: Proficient in Python, Experience with images processing, machine learning
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Linsey Pang

Reference:

Seam: Kubernetes-Aware Programmable Networking & Cloud Provisioning

Wed, 05 Feb 2025 00:00:00 +0000

Seam is a project focused on building a Kubernetes-aware programmable networking and cloud provisioning system. It combines Python, Kubernetes, P4 programming, and SmartNICs to create a robust framework for managing cloud resources, optimizing networking, and provisioning virtual machines. Students will learn about cutting-edge technologies such as Kubernetes, Docker, P4 programming, SmartNICs, KubeVirt, Prometheus, Grafana, and Flask, while working on real-world applications in high-performance computing environments. This project will help students understand the intricacies of cloud resource management and programmable networking, providing them with valuable skills for future careers in software engineering, networking, and DevOps.

The project involves creating a Python library for provisioning Kubernetes resources, including virtual machines and networking, using tools such as KubeVirt for VM provisioning and ESnet SENSE for network configuration. The library will also integrate monitoring solutions with Prometheus and Grafana for real-time metrics collection and visualization. Students will develop Flask-based dashboards for managing these resources, implement automated pipelines using GitLab CI/CD, and explore full-stack web development, database management with PostgreSQL, and API design.

In addition, students will gain hands-on experience with programmable networking using P4 and SmartNICs, learning how to write P4 programs for dynamic routing, security, and network policy enforcement at the hardware level. The integration of Kubernetes, SmartNICs, and P4 programming will allow for advanced optimizations and efficient management of high-performance cloud environments.

Thus far, the framework has been developed to allow provisioning of resources within Kubernetes, integrating Prometheus and Grafana for monitoring, and providing an interface for users to manage cloud resources. We aim to extend this by incorporating advanced network policies and improving the web interface.

Seam / Kubernetes Resource Provisioning and Management

The proposed work includes expanding the Python library to support comprehensive Kubernetes resource provisioning, network management, and virtual machine provisioning using KubeVirt. Students will enhance the current implementation to allow users to define resource limits, CPU/GPU quotas, and network policies. They will also integrate with ESnet SENSE to facilitate L2 networking, and explore the use of Prometheus and Grafana for real-time performance monitoring and metrics collection.

Topics: Kubernetes, Python, Cloud Computing, Networking, Programmable Networking, Monitoring, CI/CD
Skills: Python, Kubernetes, P4 programming, KubeVirt, ESnet SENSE, Docker, GitLab CI/CD, Prometheus, Grafana, PostgreSQL, Flask
Difficulty: Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada, Thomas A. DeFanti, Jeffrey Weekley, Derek Weitzel, Dmitry Mishin

Seam / Full-Stack Web Development and Dashboard

The proposed work includes building a Flask-based web dashboard using Bootstrap for UI, integrating it with the Python library to enable users to easily provision resources, monitor network performance, and track resource usage in real-time. The dashboard will support role-based access control (RBAC), allowing for secure multi-user management. Students will also integrate PostgreSQL for managing and storing configurations, logs, and performance metrics.

Topics: Full-Stack Web Development, Flask, Bootstrap, PostgreSQL, Kubernetes, Monitoring, DevOps
Skills: Web Development, Flask, Bootstrap, PostgreSQL, API Development, Kubernetes
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada, Thomas A. DeFanti, Jeffrey Weekley, Derek Weitzel, Dmitry Mishin

Seam / CI/CD and GitLab Integration

The proposed work includes setting up GitLab CI/CD pipelines for automated testing, deployment, and maintenance of the Python library, Kubernetes resources, and web dashboard. Students will automate the deployment of P4 programs, Kubernetes deployments, and networking configurations. They will also focus on unit testing, integration testing, and the automation of benchmarking experiments to ensure reproducibility of results.

Topics: CI/CD, GitLab, Python, Kubernetes, DevOps, Testing, Automation
Skills: GitLab CI/CD, Python, Kubernetes, Docker, Automation, Testing, Benchmarking
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada, Thomas A. DeFanti, Jeffrey Weekley, Derek Weitzel, Dmitry Mishin

Seam / Networking & SmartNIC Programming

The proposed work includes writing P4 programs to control network traffic flow, enforce network security policies, and optimize data transfer across the Kubernetes cluster. Students will gain experience with SmartNICs (Xilinx Alveo U55C, SN1000, NVIDIA Bluefield 2) and Tofino switches, using P4 to write network policies and integrate with the Kubernetes network layer (Multus, Calico). Students will also explore gRPC APIs for dynamically adjusting network policies and provisioning virtual network interfaces in real time.

Topics: Networking, P4 Programming, SmartNICs, Kubernetes Networking, Cloud Computing
Skills: P4, Networking, SmartNICs, Kubernetes Networking, Multus, Calico, gRPC
Difficulty: Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada, Thomas A. DeFanti, Jeffrey Weekley, Derek Weitzel, Dmitry Mishin

WaDAR

Wed, 05 Feb 2025 00:00:00 +0000

WaDAR (Water Radar) is an innovative, low-cost, hybrid approach to soil moisture sensing that combines the benefits of in-ground (in situ) and remote sensing technologies. Traditional soil moisture measurement methods suffer from drawbacks: in situ sensors are expensive and difficult to maintain, while remote sensing offers lower accuracy and resolution. WaDAR bridges this gap by using inexpensive underground backscatter tags paired with above-ground radars, enabling completely wireless, high-resolution soil moisture monitoring.

Key Features of WaDAR

Uses RF backscatter tags buried underground to provide high-accuracy soil moisture readings.
Uses ultra-wideband radar for above-ground sensing.
Offers an average error of just 1.4%, comparable to state-of-the-art commercial sensors.
Reduces deployment costs significantly, making it accessible for widespread agricultural use.
Supports real-time, scalable, and maintenance-free soil moisture monitoring for farmers.

Improving and Optimizing Data Processing Pipeline for More Accurate Soil Moisture Measurements

Topics: Digital Signal Processing Machine Learning
Skills: C/embedded, signal processing, machine learning, MATLAB (optional)
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Colleen Josephson, Eric Vetha

Enhance the accuracy of soil moisture measurements by refining the data processing pipeline.

Tasks:

Develop and test algorithms for noise reduction and signal improvement.
Implement advanced filtering and statistical techniques to improve measurement precision.
Validate improvements using real-world field data.
Translate algorithms into embedded to be implemented in real-time embedded hardware.

Improving Backscatter Tag PCB

Topics: Hardware Design Signal Processing
Skills: PCB design, RF knowledge
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Colleen Josephson, Eric Vetha

Enhance the performance of WaDAR’s backscatter tags by optimizing PCB design for improved signal-to-noise ratio (SNR) and implementing a communication protocol for tag identification.

Tasks:

Redesign PCB for improved readings.
Implement and test a communication protocol to distinguish between multiple tags.
Evaluate hardware changes in real-world field conditions.
Optimize power consumption and scalability for practical deployment.

Mediglot

Tue, 04 Feb 2025 00:00:00 +0000

PolyPhy is a GPU-oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our workshop and more details about our research here. Recent projects, such as Polyglot and Mediglot have focused on using PolyPhy to better visualize language embeddings.

Medicinal Language Embeddings

Topics: Large Language Models NLP Embeddings Medicine
Skills: Python, JavaScript, Data Science, Technical Communication
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Oskar Elek, Kiran Deol

This project aims to refine and enhance Mediglot, a web application for visualizing 3D medicinal embeddings, which extends the Polyglot app and leverages the PolyPhy toolkit for network-inspired data science. Mediglot currently enables users to explore high-dimensional vector representations of medicines (derived from their salt compositions) in a 3D space using UMAP, as well as analyze similarity through the innovative Monte-Carlo Physarum Machine (MCPM) metric. Unlike traditional language data, medicinal embeddings do not have an inherent sequential structure. Instead, we must work with the salt compositions of each medicine to create embeddings that are faithful to the intended purpose of each medicine.

This year, we would like to focus on exploring and integrating state-of-the-art AI techniques and algorithms to improve Mediglot’s clustering capabilities and its representation of medicinal data in 3D. The contributor will experiment with advanced large language models (LLMs) and cutting-edge AI methods to develop innovative approaches for refining clustering and extracting deeper insights from medicinal embeddings. Beyond LLMs, we would like to experiment with more traditional language processing methods to design novel embedding procedures. Additionally, we would like to experiment with other similarity metrics. While the similarity of two medicines depends on the initial embedding, we would like to examine the effects of different metrics on the kinds of insights a user can extract. Finally, the contributor is expected to evaluate and compare different algorithms for dimensionality reduction to enhance the faithfulness of the visualization and its interpretability.

The ideal contributor for this project has experience with Python (and common scientific toolkits such as NumPy, Pandas, SciPy). They will also need some experience with JavaScript and web development (MediGlot is distributed as a vanilla JS web app). Knowledge of embedding techniques for language processing is highly recommended.

Specific tasks:

Closely work with the mentors to understand the context of the project and its detailed requirements in preparation for the proposal.
Become acquainted with the tooling (PolyPhy, PolyGlot, Mediglot) prior to the start of the project period.
Explore different embedding techniques for medicinal data (including implementing novel embedding procedures).
Explore different dimensionality reduction techniques, with a focus on faithful visualizations.
Document the process and resulting findings in a publicly available report.

Enhancing PolyPhy Web Application

Topics: Web Development UI/UX Design Full Stack Development JavaScript Next.js Node.js
Skills: Full Stack Web Development, UI/UX Design, JavaScript, Next.js, Node.js, Technical Communication
Difficulty: Challenging
Size: Medium (175 hours)
Mentors: Oskar Elek, Kiran Deol

This project aims to revamp and enhance the PolyPhy web platform to better support contributors, users, and researchers. The goal is to optimize the website’s UI/UX, improve its performance, and integrate Mediglot to provide users with a seamless experience in visualizing both general network structures and 3D medicinal embeddings.

The contributor will be responsible for improving the website’s overall look, feel, and functionality, ensuring a smooth and engaging experience for both contributors and end-users. This includes addressing front-end and back-end challenges, optimizing the platform for better accessibility, and ensuring seamless integration with Mediglot.

The ideal candidate should have experience in full-stack web development, particularly with Next.js, JavaScript, and Node.js, and should be familiar with UI/UX design principles. A strong ability to communicate effectively, both in writing and through code, is essential for this role.

Specific tasks:

Collaborate with mentors to understand the project’s goals and the specific requirements for the website improvements.
UI/UX Redesign:
- Redesign and enhance the website’s navigation, layout, and visual elements to create an intuitive and visually engaging experience.
- Improve mobile responsiveness for broader accessibility across devices.
Website Performance & Stability:
- Identify and resolve performance bottlenecks, bugs, or issues affecting speed, stability, and usability.
Mediglot Integration:
- Integrate the Mediglot web application with PolyPhy, ensuring seamless functionality and a unified user experience for visualizing medicinal data alongside general network reconstructions.
Documentation:
- Document the development process, challenges, and solutions in a clear and organized manner, ensuring transparent collaboration with mentors and the community.

Type Narrowing: A Language Design Benchmark

Sat, 01 Feb 2025 00:00:00 +0000

Untyped languages such as JavaScript and Python provide a flexible starting point for software projects, but eventually, the lack of reliable types makes code hard to debug and maintain. Gradually typed languages such as TypeScript, Flow, Mypy, and Pyright address the problem with type checkers that can reason about an ever-growing subset of untyped code. Widening the subset with precise types is an ongoing challenge.

Furthermore, designs for precise gradual types need to be reproducible across languages. Ideas that works well in one language need to be validated in other contexts in a principled, scientific way to separate deep insights from language-specific hacks.

Type narrowing is a key feature of gradual languages. Narrowing uses type tests in code to refine types and push information forward along the paths that the program may follow. For example, when a type test checks an object field, later code can trust the type of the field:

// item :: JSON Object
if typeof(item["price"] == "number"):
 // item :: JSON Object,
 // where field "price" :: Number
 return item["price"] + (item["price"] * 0.30) // add tax

Nearly every gradual language agrees that some form of type narrowing is needed, but there is widespread disagreement about how much support is enough. TypeScript lets users define custom type tests, but it does not analyze those tests to see whether they are reliable. Flow does analyze tests. TypeScript does not allow asymmetric type tests (example: is_even_number), but Flow, Mypy and Pyright all do! None of the above track information compositionally through program execution, but another gradual language called Typed Racket does Is the extra machinery in Typed Racket really worth the effort?

Over the past several months, we have curated a language design benchmark for type narrowing, If-T:

https://github.com/utahplt/ift-benchmark

The benchmark presents type system challenges in a language-agnostic way to facilitate reproducibility across languages. It also includes a datasheet to encourage cross-language comparisons that focus on fundamental typing features rather than incidental difference between languages. So far, we have implemented the benchmark for five gradual languages. There are many others to explore, and much more to learn.

The goal of this project is to replicate and extend the If-T type narrowing benchmark. Outcomes include a deep understanding of principled type narrowing, and of how to construct a benchmark that enables reproducible cross-language comparisons.

Related Work:

Type Narrowing in TypeScript https://www.typescriptlang.org/docs/handbook/2/narrowing.html
Type Narrowing in Python https://typing.readthedocs.io/en/latest/spec/narrowing.html#typeguard
Logical Types for Untyped Languages https://doi.org/10.1145/1863543.1863561

Evaluate New Gradual Languages

Topics: benchmark implementation, programming languages, types
Skills: Ruby, Lua, Python, Clojure, or PHP
Difficulty: Medium
Size: Small
Mentor: Ben Greenman

Bring the If-T Benchmark to new typecheckers. Examples include Sorbet, Hack, Luau, Pyre, Cinder / Static Python, Typed Clojure, and (potentially) Elixir. Conduct a scientific, cross-language analysis to discuss the implications of benchmark results.

Do Unsound Narrowings Lead to Exploits?

Topics: corpus study, types, counterexamples
Skills: TypeScript or Python
Difficulty: Medium
Size: Small
Mentor: Ben Greenman

Investigate type narrowing in practice through a corpus study of software projects. Use the GitHub or Software Heritage APIs to search code for user-defined predicates and other instances of narrowing. Search for vulnerabilities due to the unsound typing of user-defined predicates.

Environmental NeTworked Sensor (ENTS)

Fri, 31 Jan 2025 00:00:00 +0000

ENTS I: Web portal for large-scale sensor networks

Topics: Data Visualization, Backend, Frontend, UI/UX, Analytics
Skills:
- Required: React, Javascript, Python, SQL, Git
- Nice to have: Flask, Docker, CI/CD, AWS, Authentication
Difficulty: Medium
Size: Large (350 hours)
Mentors: Colleen Josephson, John Madden, Alec Levy

The Environmental NeTworked Sensor (ENTS) platform, formally Open Sensing Platform (OSP), implements data visualization website for monitoring microbial fuel cell sensors (see GitHub). The mission is to scale up the current platform to support other researchers or citizen scientists in integrating their novel sensing hardware or microbial fuel cell sensors for monitoring and data analysis. Examples of the types of sensors currently deployed are sensors measuring soil moisture, temperature, current, and voltage in outdoor settings. The focus of the software half of the project involves building upon our existing visualization web platform, and adding additional features to support the mission. A live version of the website is available here.

Below is a list of project ideas that would be beneficial to the ENTS project. You are not limited to the following projects, and encourage new ideas that enhance the platform:

Improve streaming functionality
Generic interface for sensor measurements
Logger registration
Over the air (OTA) configuration updates
Implement unit tests and API documentation

ENTS II: Hardware to for large-scale field sensor networks

Topics: Embedded system, wireless communication, low-power remote sensing
Skills:
- Required: C/C++, Git, Github, PlatformIO
- Nice to have: STM32 HAL, ESP32 Arduino, protobuf, python, knowledge of standard communication protocols (I2C, SPI, and UART)
Difficulty: Hard
Size: Large (350 hours)
Mentors: Colleen Josephson, John Madden, Jack Lin

The Environmental NeTworked Sensor (ENTS) node aims to be a general purpose hardware platform for outdoor sensing (e.g. agriculture, ecological monitoring, etc.). The typical use case involves a sensor deployment in an agricultural field, remotely uploading measurements without interfering with farming operations. The current hardware revision (Soil Power Sensor was originally designed for monitoring power output of microbial fuel cells using high fidelity voltage and current measurement channels, as well as auxiliary sensors such as the SDI-12 TEROS-21 soil moisture sensor. The primary activities of this project will involve low-level firmware design and implementation, but may also incorporate hardware design revisions if necessary. We are looking to expand functionality to other external sensors, as well as optimize for power consumption, via significant firmware design activities.

Long-range, low-power wireless communication is achieved through a LoRa capable STM32 microcontroller with in-lab experiments using an ESP32 microcontroller to enable the simpler WiFi interface. Both wireless interfaces communicate upload measurements to our data visualization dashboard, ENTS I. The combined goal across both of these projects is to create a system that enables researchers to test and evaluate novel sensing solutions. We are looking to make the device usable to a wide range of researchers which may not have a background in electronics, so are interested in design activities that enhance user friendliness.

In total there will be 2-4 people working on the hardware with progress being tracked on GitHub. Broader project planning is tracked through a Jira board. We intend to have weekly meetings to provide updates on current issue progress along with assigning tasks. Please reach out to John Madden if there are any questions or specific ideas for the project.

Below is a list of project ideas that would be beneficial to the ENTS project. You are not limited to the following projects, and encourage new ideas that enhance the platform:

Backup logging via SD card
I2C multiplexing for multiple of the same sensors
Batch sensor measurement uploading

Causeway: Scaling Experiential Learning Through Micro-Roles

Thu, 30 Jan 2025 00:00:00 +0000

Causeway is a platform for learning to develop web applications using an Angular, RxJS, NgRx, and Firebase stack. Most online coding tutorials focus on covering the technical syntax or features of a language or framework, which means that new developers don’t have great resources for building a holistic picture of how everything they learn connects to actually developing a complex web application. Causeway breaks down the process of developing a web application into a hierarchy of micro-roles which provides learners with a clear pathway for learning that also translates to a clear process for developing an application. In the longer future, this would also enable learners to easily contribute to projects as they learn through taking on micro-roles for yet-to-be-developed projects. The platform uses the Stackblitz WebContainer API to run full applications in the browser for interactive learning.

Thus far, we have developed a version of the platform that walks learners through the process of developing UI components of a web application as well as containers that contain multiple UI components and are responsible for fetching data from the backend and handling events and updates to the database. We’d like to extend the content to cover defining the database schema and entire applications, and to other topics beyond web development like AI/ML. We’d like to add quizzes to the experience and explore ways to use Generative AI to augment the learning experience, e.g. to support planning, reflection, and assessment. Finally, we’d like to instrument the application with logs and analytics so we can better measure impact and learning outcomes, and develop a stronger CI/CD pipeline.

Causeway / Improving the Core Infrastructure

The proposed work includes adding logging, analytics, and a production-level CI/CD pipeline, adding a robust testing framework, and refactoring some of our code into seperate modules. Both roles will also contribute to running usability studies and documenting the platform.

Topics: Web Development, Educational Technologies, Angular
Skills: Web development experience, HTML, CSS, Javascript, Angular, RxJS, NgRx, Firebase
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: David Lee

Causeway / Quizzes and Generative AI

The proposed work includes extending the application to support quizzes, adding quizzes for the existing tasks, and exploring the use of generative AI to support the quizzes feature. Both roles will also contribute to running usability studies and documenting the platform.

Topics: Web Development, Educational Technologies, Angular
Skills: Web development experience, HTML, CSS, Javascript, Angular, RxJS, NgRx, Firebase, Generative AI
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: David Lee

OpenROAD - An Open-Source, Autonomous RTL-GDSII Flow for Chip Design

Sun, 19 Jan 2025 00:00:00 +0000

The OpenROAD project is a non-profit project, originally funded by DARPA with the aim of creating open-source EDA tools; an Autonomous flow from RTL-GDSII that completes < 24 hrs, to lower cost and boost innovation in IC design. This project is now supported by Precision Innovations.

OpenROAD massively scales and supports EWD (Education and Workforce Development) and supports a broad ecosystem making it a vital tool that supports a rapidly growing Semiconductor Industry.

OpenROAD is the fastest onramp to gain knowledge, skills and create pathways for great career opportunities in chip design. You will develop important software and hardware design skills by contributing to these interesting projects. You will also have the opportunity to work with mentors from the OpenROAD project and other industry experts.

We welcome a diverse community of designers, researchers, enthusiasts, software engineers and entrepreneurs to use and contribute to OpenROAD and make a far-reaching impact in the rapidly growing, global Semiconductor Industry.

Improving Code Quality in OpenROAD

Topics: Coding Best Practices in C++, Code Quality Tooling, Continuous Integration
Skills: C++
Difficulty: Medium
Size: Medium (175 hours)
Mentors: Matt Liberty & Arthur Koucher

OpenROAD is a large and complex program. This project is to improve the code quality through resolving issues flagged by tools like Coverity and clang-tidy. New tools like the clang sanitizers ASAN/TSAN/UBSAN should also be set up and integrated with the Jenkins CI.

GUI Testing in OpenROAD

Topics: Testing, Continuous Integration
Skills: C++, Qt
Difficulty: Medium
Size: Large (350 hours)
Mentors: Matt Liberty & Peter Gadfort

The OpenROAD GUI is a crucial set of functionality for users to see and investigate their design. GUI testing is specialized and rather different from standard unit testing. The GUI therefore needs improvements to its testing to cover both interaction and rendering. The GUI uses the Qt framework. An open-source testing tool like https://github.com/faaxm/spix will be set up and key tests developed. This will provide the framework for all future testing.

Rectilinear Floorplans in OpenROAD

Topics: Electronic Design Automation, Algorithms
Skills: C++, data structures and algorithms
Difficulty: Medium
Size: Large (350 hours)
Mentors: Eder Monteiro & Augusto Berndt

OpenROAD supports block floorplans that are rectangular in shape. Some designs may require more complex shapes to fit. This project extends the tool to support rectilinear polygon shapes as floorplans. This will require upgrading data structures and algorithms in various parts of OpenROAD including floor plan generation, pin placement, and global placement.

LEF Reader and Database Enhancements in OpenROAD

Topics: Electronic Design Automation, Database, Parsing
Skills: Boost Spirit parsers, Database, C++
Difficulty: Medium
Size: Medium (175 hours)
Mentors: Osama Hammad & Ethan Mahintorabi

LEF (Library Exchange Format) is a standard format for describing physical design rules for integrated circuits. OpenROAD has support for many constructs but some newer ones for advanced process nodes are not supported. This project is to support parsing such information and storing in the OpenDB for use by the rest of the tool.

ORAssistant - LLM Data Engineering and Testing

Topics: Large Language Model, Machine Learning, Data Engineering, Model Deployment, Testing, Full-Stack Development
Skills: large language model engineering, database, evaluation, CI/CD, open-source or related software development, full-stack
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Jack Luar & Palaniappan R

This project is aimed at enhancing robustness and accuracy for OR Assistant, the conversational assistant for OpenROAD through comprehensive testing and evaluation. You will work with members of the OpenROAD team and other researchers to enhance the existing dataset to cover a wide range of use cases to deliver accurate responses more efficiently. This project will focus on data engineering and benchmarking and you will collaborate on a project on the LLM model engineering. Tasks include: creating evaluation pipelines, building databases to gather feedback, improving CI/CD, writing documentation, and improving the backend and frontend services as needed (non-exhaustive). You will gain valuable experience and skills in understanding chip design flows and applications. Open to proposals from all levels of ML practitioners.

ORAssistant - LLM Model Engineering

Topics: Large Language Model, Machine Learning, Model Architecture, Model Deployment
Skills: large language model engineering, prompt engineering, fine-tuning
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Jack Luar & Palaniappan R

This project is aimed at enhancing robustness and accuracy for OR Assistant, the conversational assistant for OpenROAD through enhanced model architectures. You will work with members of the OpenROAD team and other researchers to explore alternate architectures beyond the existing RAG-based implementation. This project will focus on improving reliability and accuracy of the existing model architecture. You will collaborate on a tandem project on data engineering for OR assistant. Tasks include: reviewing and understanding the state-of-the-art in retrieval augmented generation, implementing best practices, caching prompts, improving relevance and accuracy metrics, writing documentation and improving the backend and frontend services as needed (non-exhaustive). You will gain valuable experience and skills in understanding chip design flows and applications. Open to proposals from all levels of ML practitioners.

RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Wed, 15 Jan 2025 00:00:00 +0000

Topics: bioinformatics, spatial transcriptomics, gene expression generation, retrieval-augmented generation, large models
Skills:
- Programming Languages:
  - Proficient in Python, and familiarity with machine learning libraries such as PyTorch.
- Data Analysis:
  - Experience with spatial transcriptomics datasets and statistical modeling.
- Machine Learning:
  - Understanding of vision models, retrieval-based systems, and MLP architectures.
- Bioinformatics Knowledge (preferred):
  - Familiarity with scRNA-seq data integration and computational biology tools.
Difficulty: Advanced
Size: Large (350 hours). Given the scope of integrating RAG models, building a robust database, and ensuring interpretable predictions, this project involves substantial computational and data preparation work.
Mentors: Ziheng Duan (contact person)

Project Idea Description

Spatial transcriptomics (ST) is a revolutionary technology that provides spatially resolved gene expression measurements, enabling researchers to study cellular behaviour within tissues with unprecedented detail. This technology has transformed our understanding of complex biological systems, such as disease progression, tissue development, and cellular heterogeneity. However, the widespread adoption of ST is limited by its high cost and technical requirements.

Histology imaging, on the other hand, is far more accessible and cost-effective. If gene expression could be accurately predicted from histology images, it would enable researchers to leverage these abundant images for high-resolution biological insights without the need for expensive spatial transcriptomics experiments. This task has immense potential to democratize spatial transcriptomics research and significantly reduce costs.

Challenges in Current Approaches

Current methods for predicting gene expression from histology images typically involve:

Using large vision models to encode histology image patches into embeddings.
Employing Multi-Layer Perceptrons (MLPs) to map these embeddings to gene expression profiles.

While these approaches have shown promise, they suffer from two critical limitations:

Accuracy: The MLP-based mappings often fail to fully capture the biological complexity encoded in the histology images, leading to suboptimal predictions.
Interpretability: These models act as black boxes, providing no insight into the underlying biological rationale for the predictions. Researchers cannot determine why a specific gene expression profile was generated, limiting trust and utility in biological contexts.

Project Motivation

To overcome these limitations, this project proposes a novel Retrieval-Augmented Generation (RAG) framework for spatial transcriptomics. Instead of relying solely on black-box MLPs, RAG-ST will:

Retrieve relevant examples from a curated database of paired histology images, scRNA-seq data, and gene expression profiles.
Use these retrieved examples to inform and enhance the generation process, resulting in predictions that are both more accurate and biologically interpretable.

This approach not only grounds predictions in biologically meaningful data but also provides transparency by revealing which database entries influenced the results.

Project Objectives

Database Construction:
- Curate a large and diverse database of histology images paired with scRNA-seq and gene expression data.
Model Development:
- Develop a RAG framework combining vision-based encoders and retrieval-enhanced generation techniques.
- Incorporate interpretability mechanisms to link predicted gene expressions to retrieved examples.
Evaluation and Benchmarking:
- Assess RAG-ST against state-of-the-art methods, focusing on accuracy, interpretability, and biological validity.

Project Deliverables

Curated Database:
- A publicly available, well-documented database of histology images and gene expression profiles.
RAG-ST Framework:
- An open-source Python implementation of the RAG-ST model, with retrieval, generation, and visualization tools.
Benchmark Results:
- Comprehensive evaluations demonstrating the benefits of RAG-ST over conventional pipelines.
Documentation and Tutorials:
- User-friendly guides to facilitate adoption by the spatial transcriptomics research community.

Impact

By integrating retrieval-augmented generation with large models, RAG-ST represents a paradigm shift in spatial transcriptomics. It offers a cost-effective, accurate, and interpretable solution for gene expression prediction, democratizing access to high-quality spatial transcriptomic insights and fostering advancements in biological research.

Writing a blog about your OSRE 2025 project

Mon, 21 Oct 2024 00:00:00 +0000

OSRE participants are required to blog three times during their summer program. The first blog is a chance to introduce yourself and your project. The second blog occurs around the mid-point of the project and a final blog post is expected as part of you final project delverable. The organization administrator will send emails with specific dates. Instructions for the blog are indicated below. All blogs should include links to proposals, presentations, links to any deliverables/products as well as an overview of the student’s experience. Check out the student pages from previous years to get an idea of content / size.

We will also ask students and contributors to provide regular status updates which will help track your activities. The organization administrator will provide more details once the program work begins.

Making a pull request for your blog

Fork the git repository
If you haven’t already done so, add your profile using these instructions
- IMPORTANT: Under user_groups: add - 2025 Contributors (as opposed to any of the two mentor groups)
- The short bio and any other information goes below the frontmatter
Post your blog
- Add /content/report/osre25/ORGANIZATION/PROJECTNAME/DATE-USERNAME/index.md
- Add a frontmatter to index.md, using the labels below
- Blog text goes below the frontmatter
- In that same directory include a picture and call it featured.png (also supports .jpg, .jpeg)
Commit to your fork and make a pull request. Email OSRE Admins with questions.

Example frontmatter and text body

---
title: "YOUR TITLE"
subtitle: "YOUR SUBTITLE (OPTIONAL)"
summary:
authors:
 - USERNAME1
 - USERNAME2
tags: ["osre25"]
categories: []
date: YYYY-MM-DD
lastmod: YYYY-MM-DD
featured: false
draft: false

# Featured image
# To use, add an image named `featured.jpg/png` to your page's folder.
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
image:
 caption: ""
 focal_point: ""
 preview_only: false
---

As part of the [PROJECTNAME](/project/osre25/ORGANIZATION/PROJECTNAME) my [proposal](https://...) under the mentorship of MENTOR aims to ...

Improving Usability and Performance in cc-snapshot: My Midterm Update

Wed, 24 Jul 2024 00:00:00 +0000

Hi! I’m Zahra Temori, a rising junior studying Computer Science at the University of Delaware. This summer, I’ve had the exciting opportunity to participate in the Chameleon Summer Reproducibility Program, where I’ve been working under the mentorship of Paul Marshall. In this blog post, I’d love to share a midterm update on my project cc-snapshot and highlight what I’ve accomplished so far, what I’ve learned, and what’s coming next. It’s been a challenging but rewarding experience diving into real-world research and contributing to tools that help make science more reproducible!

Project Overview

CC-Snapshot is a powerful tool on the Chameleon testbed that enables users to package their customized environments for reproducibility and experiment replication. In research, reproducibility is essential. It allows scientists to run experiments consistently, share complete setups with others, and avoid environment-related errors. However, the current snapshotting mechanism has limitations that make it unreliable and inefficient, particularly in terms of usability and performance. These issues can slow down workflows and create barriers for users trying to reproduce results. Our goal is to improve both the usability and performance of the cc-snapshot tool. A more user-friendly and optimized system means that users can create and restore snapshots more quickly and easily, without needing to manually rebuild environments, ultimately saving time and improving reliability in scientific computing.

Progress So Far

To structure the work, we divided the project into two main phases:

Improving usability, and
Optimizing performance.

I’ve nearly completed the first phase and have just started working on the second.

Phase One – Usability Improvements

The original version of the cc-snapshot tool had several usability challenges that made it difficult for users to interact with and for developers to maintain. These issues included a rigid interface, lack of flexibility, and limited testing support. All of which made the tool harder to use and extend. To address these, I worked on the following improvements:

Problem: The command-line interface was limited and inflexible. Users couldn’t easily control features or customize behavior, which limited their ability to create snapshots in different scenarios.

Solution: I enhanced the CLI by adding:

A flag to disable automatic updates, giving users more control.
A –dry-run flag to simulate actions before actually running them which is useful for testing and safety.
Support for a custom source path, allowing snapshots of specific directories. This makes the tool much more useful for testing smaller environments.

Problem: The code lacked automated tests. Without tests, developers have to manually verify everything, which is time-consuming and error-prone.

Solution: I implemented a basic test suite and integrated it with GitHub Actions, so the tool is automatically tested on every pull request.

Problem: The tool didn’t follow a modular design. The logic was tightly coupled, making it hard to isolate or extend parts of the code.

Solution: I refactored the code by extracting key functions. This makes the code cleaner, easier to understand, and more maintainable in the long term.

Next Steps – Phase Two: Performance Optimization

After improving the usability of the cc-snapshot tool, the next phase of the project focuses on addressing key performance bottlenecks. Currently, the snapshotting process can be slow and resource-intensive, which makes it less practical for frequent use especially with large environments.

Problem 1: Slow Image Compression The current implementation uses the qcow2 image format with zlib compression, which is single-threaded and often inefficient for large disk images. This leads to long snapshot creation times and high CPU usage.

Solution: I will benchmark and compare different compression strategies, specifically:

qcow2 with no compression
qcow2 with zstd compression, which is faster and multi-threaded
raw image format, which has no compression but may benefit from simpler processing

These tests will help determine which method provides the best tradeoff between speed, size, and resource usage.

Problem 2: Suboptimal Storage Backend Snapshots are currently uploaded to Glance, which can be slow and unreliable. Uploading large images can take several minutes, and this slows down the user workflow.

Solution: I will compare Glance with a faster alternative, the Object Store. Smaller, compressed images may upload significantly faster to the Object Store e.g. 30 seconds vs. 2 minutes. By measuring upload speeds and reliability, I can recommend a better default or optional backend for users.

How I will Measure Performance

To understand the impact of different strategies, I will try to collect detailed metrics across three stages:

Image creation: How long it takes to build the image, depending on compression and format
Image upload: How quickly the snapshot can be transferred to Glance or Object Store
Instance boot time: How fast a new instance can start from that image (compressed formats must be decompressed)

I will run multiple tests for each scenario and record performance metrics like CPU usage, memory usage, disk throughput, and total time for each step. This will help identify the most efficient and practical configuration for real-world use.

Conclusion

Addressing the current usability and performance issues in cc-snapshot is essential to improving the overall user experience. By making the tool easier to use, faster, and more flexible, we can support researchers and developers who depend on reproducible computing for their work. So far, I’ve worked on enhancing the tool’s interface, adding testing support, and refactoring the codebase for better maintainability. In the next phase, I’ll be focusing on benchmarking different compression methods, image formats, and storage backends to improve speed and efficiency. These improvements will help make cc-snapshot a more powerful and user-friendly tool for the scientific community.

Stay tuned for the next update and thank you for following my journey!

osre25 | UCSC OSPO

Final Report for Smart Environments

Introduction

Method

EnvAgent

EnvEval Benchmark

Evaluation

Conclusion

Thank you!

Final Blog: Rectilinear Floorplans in OpenROAD

Final Progress: Enabling Rectilinear Floorplanning in OpenROAD

Project Overview

Pull Requests made

Key Contributions

Phase 1: Init Floorplan (IFP) Module Support

1. Polygonal Die Definition

2. Standard Cell Row Generation

3. Testing and Validation

Demo: U-Shaped Die Row Generation

Phase 2: Pin Placement (PPL) Module Support

1. Core Data Structure Migration

2. Pin Slot Calculation

3. Pin Orientation Algorithm

4. Hungarian Matching and Simulated Annealing for Polygons

Demo: T-Shaped Die Pin Placement

Code Quality

Testing and Validation

Future Work

Acknowledgements

Scenic-RoboSuite Integration: Building the First Working Prototype

Major Achievements

MJCF XML Injection

Complex Mesh Object Support

Custom Arena Definition

Multi-Robot Support

Built-in Manipulation Behaviors

Extended Environment Configuration

Example: Probabilistic Pick-and-Place

Challenges Overcome

Understanding Dual Architecture Paradigms

Discovering and Extending ManipulationEnv

XML to 3D Mesh Pipeline

File Path Resolution Discrepancies

Impact and Applications

Documentation and Resources

Current Status and Future Work

Conclusion

Final Report — RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Introduction

Methods

Results

Future Work

Acknowledgments

Links

Final Update: Building Intelligent Observability for NRP

How Our Novel InfoAgent Architecture Advances the Observability Mission

1. Prometheus Metrics Analysis Agent

2. Query Refinement Agent (CROQ)

3. Explanation Generation Agent (AIS)

Completed Integration: The Novel InfoAgent Pipeline

Hardware Testing Results

Learning Journey and Novel Contributions

Ongoing Work: Continuing Beyond OSRE

Acknowledgments

[Final] Building PeerSky’s Extensions System

Project Overview

Key Design Goals

Highlights

Preinstalled MV3s

Electron Integration

Toolbar & Puzzle Menu

Security Highlights

Example: Installing from the Web Store

Reflection

Final Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows

Final Report: Applying MLOps to Overcome Reproducibility Barriers in ML

Background

Progress Since Mid-Report

Migration from Cookiecutter to Copier

Support for Multiple Setup Modes