Reports | UCSC OSPO

Final Report for Smart Environments

Wed, 05 Nov 2025 00:00:00 +0000

Introduction

The process of creating the necessary software environment for code to run is a significant challenge in software development. Given a piece of open-source software intended for research, setting up the environmental dependencies to run the software could take significant manual effort. Existing automation methods struggle due to the complexity of managing diverse languages, dependencies, and hardware. In Smart Environments, I have created ENVAGENT, a general multi-agent framework designed to automate the construction of executable environments for reproducing research prototypes from top-tier conferences and journals. While reproducibility has become a growing concern in the research community, the process of setting up environments remains time-consuming, error-prone, and often poorly documented.

To assess this capability, a new benchmark, ENVBENCH, was created, containing 54 popular projects across seven languages. Results show ENVAGENT dramatically improves environment construction compared to current agents (+16.2%). Furthermore, the system shows initial promise in dynamically adjusting cloud-based hardware resources based on the code’s needs.

Method

EnvAgent

The EnvAgent I created during my time at OSRE utilizes a multi-agent workflow to automatically build software execution environments. The process is structured into three phases: preparation, construction, and refinement.

Phase 1 (Preparation): Specialized agents collect information about the software repository – its structure, relevant files, and the host system’s hardware specifications (CPU, memory, etc.). This data is then used by a planning agent to generate a detailed, step-by-step instruction set for creating a functional Dockerfile.

Phase 2 (Construction): Two agents work in tandem: one generates or modifies the Dockerfile based on the plan, while the other executes the Dockerfile within an isolated container, capturing any errors.

Phase 3 (Refinement): A final agent analyzes the container execution data, identifying areas for improvement in the Dockerfile. This process repeats until a stable, executable environment is achieved.

To improve efficiency, EnvAgent incorporates rule-based tools for predictable tasks like directory setup and log management, reducing the need for complex agent reasoning. This combination of intelligent agents and automated routines (“scaffolding”) ensures a robust and adaptive system.

EnvEval Benchmark

In addition to the agent, one significant contribution is the manual curation of a benchmark that measures the quality of generated environments. EnvEval is a benchmark specifically designed to assess environment setup qualities across 54 carefully curated open-source repositories. They are chosen from both Chameleon reproducible artifacts and Multi-SWE-bench dataset. EnvEval contains json rubrics that can be used to automatically determine the quality of constructed environments.

Each rubric is divided into three parts, corresponding to three major objectives that a successfully constructed environment should have:

Structure: Checks for basic directory structure, file presence, and environment variables.
Configuration: Asks the question “Is this configured?”, checks for whether dependencies have been correctly configured.
Functionality: Asks the question “Is this usable?”, runs actual tests to see if the functionalities are present.

There are many tests in each category, and their weights are adjusted based on their importance.

Evaluation

Baseline Systems:

The study compared EnvAgent to two established automated code generation systems: one utilizing Anthropic’s advanced reasoning models and the other employing OpenAI’s code-focused models. These systems were chosen for their strong performance in creating software code and their prevalence in automated engineering processes. Both baselines were given full access to the target software repositories and complete details about the host system’s hardware.

Evaluation Metrics:

The performance of EnvAgent was assessed using three key metrics. These included the ability to create working environments, the quality of those environments, and a single combined score. Results showed EnvAgent significantly outperformed the baselines, achieving a 33.91% improvement in the final overall score – reaching 74.01, which was higher than the best baseline score of 30.10. This suggests EnvAgent produced both more functional environments and ensured greater accuracy through extensive testing.

Conclusion

The process of creating the necessary software environments for code agents is a major hurdle in scaling up research and development. Currently, this task relies heavily on manual labor. To address this, a new system, ENVAGENT, was created to automatically build these environments using intelligent agents and by understanding dependencies. A new benchmark, ENVBENCH, was also developed to assess this system’s effectiveness. Preliminary results demonstrate a significant improvement – ENVAGENT achieved a 33.91% increase in success rates compared to existing automated agents, representing a substantial step towards more efficient and reproducible research.

Thank you!

Autofill

; 20251105-Sam_Huang

Final Blog: Rectilinear Floorplans in OpenROAD

Wed, 22 Oct 2025 00:00:00 +0000

Final Progress: Enabling Rectilinear Floorplanning in OpenROAD

Hello! I’m excited to share my final progress on implementing rectilinear (polygonal) die support in OpenROAD’s floorplanning flow as part of Google Summer of Code 2025. Under the guidance of my mentors Eder Monteiro and Augusto Berndt, we’ve made significant strides in extending OpenROAD to handle non-rectangular die shapes.

Here’s a link to my original proposal

Project Overview

This project aims to add support for rectilinear floorplans in OpenROAD, an open-source EDA tool used for digital chip design. Currently, OpenROAD only supports rectangular floorplans, which limits its use in modern designs that often require more complex shapes, especially in advanced packaging, chiplet architectures, and 3D ICs.

The project enables users to define floorplans using arbitrary rectilinear shapes made of $90^{\circ}$ corners. It involves three main components:

Accepting polygonal input during floorplan setup
Generating standard cell rows and routing tracks that follow the shape boundaries
Updating pin placement logic to work with irregular outlines

By enabling these capabilities, OpenROAD becomes more flexible and suitable for real-world designs where blocks may need to fit together like puzzle pieces. This can lead to better area utilization and potentially shorter interconnects.

The core challenge is maintaining robustness and backward compatibility while introducing this major new feature that touches multiple aspects of the design flow.

Pull Requests made

Support for Rectilinear dies in PPL (Pin Placement) - https://github.com/The-OpenROAD-Project/OpenROAD/pull/8182
Support for Rectilinear dies in IFP (Init Floorplan) - https://github.com/The-OpenROAD-Project/OpenROAD/pull/7893

Key Contributions

Phase 1: Init Floorplan (IFP) Module Support

The first half of the project focused on adding support for rectilinear floorplans in the IFP (Init Floorplan) module. This foundational work established the infrastructure for handling non-rectangular die shapes.

1. Polygonal Die Definition

Implemented support for accepting polygon vertices as input to define rectilinear die shapes.
Modified the TCL interfaces to accept a list of vertices of the rectilinear die in the -die_area and -core_area parameters and automatically switch to rectilinear flow.
Developed validation logic to ensure polygons are rectilinear and valid.

2. Standard Cell Row Generation

Developed a scanline-based algorithm to generate standard cell rows that conform to complex polygonal boundaries.
The algorithm sweeps horizontally across the die area and identifies valid row regions within the polygon.
Routing Track generation logic could directly be used for rectilinear die shapes.

3. Testing and Validation

Created comprehensive test cases for L-shaped, T-shaped, and other rectilinear configurations for floorplan creation and row generation.
Ensured backward compatibility with rectangular floorplans.
Added error handling for edge cases like invalid polygon specifications.

Demo: U-Shaped Die Row Generation

One of our test cases involved generating rows for a U-shaped die. Here is a snapshot from the OpenROAD GUI displaying perfectly laid out rows:

Phase 2: Pin Placement (PPL) Module Support

1. Core Data Structure Migration

Leveraged the odb::Line class instead of the simple Edge enum (which only handled 4 rectangular edges) to store edge data of rectilinear dies. This allows the system to handle an arbitrary number of polygon edges.
This required refactoring nearly every function in the pin placement pipeline, as the Edge enum was deeply embedded throughout the codebase.
The new representation is more flexible and can handle N-sided polygons while maintaining clean abstractions.

2. Pin Slot Calculation

Rewrote the defineSlots() function family to work with polygon edges while maintaining compatibility with existing die shapes.
Ensured slots are generated only within valid polygon boundaries.

3. Pin Orientation Algorithm

One of the most challenging aspects was determining the correct orientation for pins on polygon edges. For rectangular dies, this is trivial, but for complex, concave polygons, it’s non-trivial.
Leveraged the ray tracing algorithm to determine the correct pin orientations. The algorithm casts rays from edge midpoints to determine which side faces the interior of the polygon.
This ensures pins are correctly oriented and handles complex cases like concave polygons.

4. Hungarian Matching and Simulated Annealing for Polygons

Successfully extended both Hungarian Matching and Simulated Annealing to work with rectilinear dies as well as regular dies.
The flow now checks if the provided die is rectilinear and intelligently switches the flow accordingly.

Demo: T-Shaped Die Pin Placement

The image below shows pins placed on a rectilinear, T-shaped die:

Code Quality

Followed OpenROAD coding standards and conventions
Comprehensive error handling and validation
Extensive code reviews with multiple rounds of refinements
Well-documented functions and algorithms

Testing and Validation

Created multiple test cases covering various rectilinear shapes
Regression testing to ensure backward compatibility
Edge case handling (concave polygons, tight geometries, etc.)
Integration testing with downstream OpenROAD flows

Future Work

Supporting constraints for pin placement in PPL - currently in progress
Improved GUI support for viewing and editing polygonal floorplans
Further optimizing algorithms for very large polygons

Acknowledgements

I would like to thank my mentors, Eder Monteiro and Augusto Berndt for their patience, support and guidance throughout the project. Thanks to Stephanie and the entire UC OSPO team as well as Google Summer of Code for providing me with this incredible opportunity.

Scenic-RoboSuite Integration: Building the First Working Prototype

Mon, 29 Sep 2025 00:00:00 +0000

I’m Sahil, presenting the first working prototype of the Scenic-RoboSuite integration. This project is being mentored by Daniel Fremont and Eric Vin.

After months of development, we have achieved a functional prototype of the Scenic-RoboSuite interface. Researchers can now write basic declarative robotic manipulation scenarios in Scenic that execute with physics simulation in RoboSuite. While still in development, the prototype demonstrates the feasibility and potential of bridging probabilistic scenario generation with detailed robot control.

Major Achievements

MJCF XML Injection

The interface introduces direct MJCF XML support, allowing Scenic to build RoboSuite-native manipulable objects from raw XML definitions. Users can define custom objects with complex mesh geometries, textures, and physics properties directly in their Scenic scenarios:

dragon_xml = '''
<mujoco>
 <asset>
 <mesh file="dragon.stl" scale="0.01 0.01 0.01"/>
 <texture file="dragon_texture.png"/>
 </asset>
 <worldbody>
 <body name="object">
 <geom mesh="dragon_mesh" type="mesh"/>
 </body>
 </worldbody>
</mujoco>
'''

dragon = new CustomObject with mjcfXml dragon_xml

The system automatically handles collision geometry generation, joint creation for physics, and asset file resolution.

Complex Mesh Object Support

Import and manipulate arbitrary 3D models (STL, OBJ) with automatic mesh repair and texture mapping. The interface resolves file paths relative to Scenic files, copies assets to temporary directories for MuJoCo, and converts textures (JPG to PNG) when needed. This enables using custom robotic tools, industrial parts, or any 3D model in manipulation scenarios.

Custom Arena Definition

Define complete custom environments using MJCF XML, extending beyond RoboSuite’s built-in arenas:

custom_arena = new CustomArena with arenaXml localPath("warehouse.xml")

This allows creating specialized workspaces, factory floors, or research-specific environments while maintaining full physics simulation.

Multi-Robot Support

The interface handles multiple robots operating in the same workspace:

robot1 = new Panda at (-0.5, 0, 0)
robot2 = new UR5e at (0.5, 0, 0)
table = new Table at (0, 0, 0.425)

Each robot maintains independent control and can execute coordinated or individual behaviors.

Built-in Manipulation Behaviors

Ready-to-use behaviors for immediate testing and development:

MoveToPosition - Precise end-effector positioning
PickObject - Automated grasping with approach and closure
LiftToHeight - Controlled lifting to target heights
PickAndLift - Complete pick-and-place sequence

These behaviors use Operational Space Control (OSC) for intuitive 3D movement commands.

Extended Environment Configuration

The interface extends RoboSuite’s configurability through Scenic’s parameter system:

param controller_config = {'type': 'OSC_POSITION', 'impedance': 'low'}
param camera_view = 'robot0_eye_in_hand'
param lite_physics = True # Faster simulation for testing

Example: Probabilistic Pick-and-Place

model scenic.simulators.robosuite.model

# Randomly position cube on table
table = new Table at (0.6, 0, 0.425)
cube = new Box on table,
 with color (1, 0, 0, 1),
 with position (Uniform(-0.2, 0.2), Uniform(-0.2, 0.2), _)

# Robot adapts to random cube position
behavior AdaptivePickup():
 do PickAndLift(cube, height=1.1)

ego = new Panda at (0, 0, 0),
 with behavior AdaptivePickup()

Each scenario run generates a different cube position, testing the robot’s adaptive capabilities.

Challenges Overcome

Understanding Dual Architecture Paradigms

RoboSuite and Scenic operate on fundamentally different principles. RoboSuite builds environments imperatively through MuJoCo XML composition, expecting complete scene specification upfront. Scenic generates scenes probabilistically through constraint solving, requiring geometric knowledge before simulation. Bridging these required developing a two-pass system where we first extract geometry from a temporary RoboSuite environment, update Scenic’s understanding, then create the final simulation. This architectural mismatch touched every aspect of the integration, from object creation to property updates.

Discovering and Extending ManipulationEnv

RoboSuite’s documentation focuses on using pre-built tasks, not creating custom environments. Through extensive source code analysis, we discovered that ManipulationEnv was the key - it accepts robots as configuration while allowing customizable arenas and objects as components. This class became our foundation, but required significant extension. We implemented ScenicManipulationEnv to intercept Scenic’s object configurations, handle dynamic arena selection (EmptyArena vs MultiTableArena based on scene content), and manage the complex initialization sequence where robots, arenas, and objects must be assembled in specific order for MuJoCo compilation.

XML to 3D Mesh Pipeline

Converting MJCF XML to usable 3D meshes proved complex. MuJoCo uses XML to describe geometry, but Scenic needs actual mesh data for collision checking. We built a multi-stage pipeline: First, ElementTree parses the XML to extract mesh references and primitive definitions. Then, we handle two paths - for mesh files, we load STL/OBJ files with trimesh and apply XML-specified transformations; for primitives (boxes, cylinders), we generate meshes programmatically. The challenge intensified with composite objects - a table might have a box tabletop and four cylinder legs. We developed ComponentExtractor to analyze the MuJoCo scene graph, identify related geometries through naming patterns and hierarchy, and export each component as a separate GLB file with proper world transforms preserved.

File Path Resolution Discrepancies

Scenic and RoboSuite handle file paths completely differently. Scenic uses localPath() for paths relative to the scenario file, while RoboSuite expects paths relative to its package structure or absolute paths. MJCF XML compounds this - mesh references can be relative to the XML file location, not the calling code. We implemented a sophisticated path resolution system: detect whether paths come from embedded XML (relative to Scenic file) or external XML files (relative to XML location), copy all referenced assets (meshes, textures) to temporary directories accessible to MuJoCo, and handle texture format conversion (JPG to PNG) when needed. This system transparently manages assets whether they’re in the Scenic project, RoboSuite package, or absolute paths, making the interface truly portable.

Impact and Applications

This bridge enables:

Research: Generate diverse manipulation scenarios for robot learning algorithms
Testing: Validate robotic systems against probabilistic task variations
Development: Rapid prototyping of manipulation tasks without manual scene setup
Education: Teach robotics concepts through declarative scenario specification

The integration makes complex robotic simulations accessible through Scenic’s intuitive language while preserving RoboSuite’s detailed physics and control capabilities.

Documentation and Resources

The project includes:

example scenarios demonstrating all features
Comprehensive STATUS.md tracking working features and known issues
Technical documentation in docs/ covering architecture and troubleshooting
Mesh extraction utilities for pre-processing and caching

Current Status and Future Work

This prototype demonstrates that the Scenic-RoboSuite bridge is viable and functional. Basic features are working reliably:

Single-robot manipulation scenarios execute successfully
MJCF XML injection creates custom objects
Pick-and-place behaviors operate consistently
Multi-robot support functions in controlled scenarios

However, significant work remains:

Stability improvements: Some features work intermittently and need refinement
Velocity tracking: Full implementation awaits framework updates
Multi-robot coordination: Advanced synchronization primitives needed
Performance optimization: Mesh extraction and caching can be streamlined
Extended testing: More diverse scenarios and edge cases need validation

The prototype serves as a proof of concept, showing that probabilistic scenario specification can successfully drive physics-based robot simulation. The architecture is sound, the core features function, and the path forward is clear.

Conclusion

This working prototype of the Scenic-RoboSuite integration represents significant progress toward bridging probabilistic programming with robotic simulation. We’ve successfully demonstrated that declarative scenario specification can control detailed physics simulation, opening new possibilities for robotic system development and testing.

While not yet production-ready, the prototype provides a solid foundation for future development. Researchers can begin experimenting with basic manipulation scenarios, developers can test the interface with their use cases, and the community can contribute to making this bridge more robust and feature-complete.

The challenges overcome - from understanding dual architectures to implementing XML-to-mesh pipelines - have resulted in a functional system that validates our approach. This prototype proves that Scenic’s elegant scenario language and RoboSuite’s detailed physics can work together, setting the stage for a powerful new tool in robotics research and development.

Final Report — RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Sun, 28 Sep 2025 00:00:00 +0000

Hello! I’m Zeyu Zou! I have been contributing to the RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics project under the mentorship of Ziheng Duan. My project focuses on developing a framework that predicts spatial gene expression from histology images by combining vision encoders with single-cell RNA-seq references. The goal is to make spatial transcriptomics more affordable, interpretable, and scalable for the research community.

Introduction

RAG-ST is designed to reduce the cost and complexity of spatial transcriptomics by leveraging existing histology images and scRNA-seq priors. This work integrates computer vision with retrieval-augmented generation to improve prediction accuracy and interpretability.

Methods

The project used a two-stage pipeline:

Vision encoder (ResNet50/ViT) to map histology patches to cell type distributions.
Retrieval-augmented generation guided by scRNA-seq profiles to predict gene expression.

Datasets included HEST-1K (paired histology and expression) and CellxGene Census as the reference database. Training and evaluation pipelines were implemented in PyTorch.

Results

Implemented a complete pipeline from histology preprocessing to expression prediction.
Achieved higher correlation scores (Pearson/Spearman) and lower errors (MSE/MAE) compared to baseline models.
Produced spatial gene expression maps with interpretable retrieval traces and attention weights.
Released open-source code, preprocessing scripts, and analysis notebooks for reproducibility.

Future Work

Extend experiments to additional tissues (lung, liver, tumor samples).
Test cross-dataset generalization and robustness.
Explore integration into clinical pathology workflows for affordable spatial inference.

Acknowledgments

Thanks to my mentor Ziheng Duan, the UC OSPO team, the HEST-1K dataset contributors, and the CellxGene Census project. This work was conducted under OSRE 2025.

Final Update: Building Intelligent Observability for NRP

Thu, 25 Sep 2025 00:00:00 +0000

I’m excited to share the completion of my OSRE 2025 project, “Intelligent Observability for NRP: A GenAI Approach” and the significant learning journey it has been. We’ve successfully developed a novel InfoAgent architecture that delivers on our core goal: building an ML-powered service for NRP that analyzes monitoring data, detects anomalies, and provides trustworthy GenAI explanations.

How Our Novel InfoAgent Architecture Advances the Observability Mission

Through extensive development and testing, I’ve learned tremendously about building production-ready AI systems and have implemented a novel InfoAgent architecture that orchestrates our specialized agents:

1. Prometheus Metrics Analysis Agent

Function: Continuously ingests and processes NRP’s Prometheus metrics
Progress: Fully implemented data pipelines handling multiple metric types with optimized latency
Purpose: Provides the foundation for anomaly detection by establishing normal behavior baselines

Function: Clarifies ambiguous metrics or patterns before generating explanations
Progress: Completed implementation of Conformal Revision of Questions for disambiguation
Purpose: Ensures explanations address the right system behaviors (e.g., distinguishing CPU saturation from memory pressure)
Deliverable Impact: Successfully improved accuracy of GenAI explanations by eliminating misinterpretations

3. Explanation Generation Agent (AIS)

Function: Creates human-readable explanations and root-cause analysis
Progress: Finalized the Automated Information Seeker with a complete Plan→Validate→Execute→Assess→Revise cycle
Purpose: Transforms technical anomalies into actionable insights for operators
Deliverable Impact: Delivers GenAI explanations with uncertainty quantification

Completed Integration: The Novel InfoAgent Pipeline

We’ve successfully integrated all agents into a unified observability pipeline that represents our novel contribution:

Data Collection: Prometheus metrics → Analysis Agent (comprehensive metrics support)
Anomaly Detection: With statistical confidence bounds using conformal prediction
Query Refinement: Resolving ambiguities before explanation
Explanation Generation: Human-readable analysis with uncertainty awareness
Feedback Loop: System learning from operator interactions (implemented and tested)

Hardware Testing Results

This project taught me valuable lessons about optimizing AI workloads on specialized hardware. We successfully tested our observability framework on Qualcomm Cloud AI 100 Ultra hardware:

Achieved significant performance improvements over baseline CPU implementation
Successfully ported and optimized GLM-4.5 for observability-specific tasks
Validated that specialized AI hardware significantly enhances real-time anomaly detection

Learning Journey and Novel Contributions

Throughout OSRE 2025, I’ve learned extensively about:

Building hierarchical agent coordination systems for complex reasoning
Implementing conformal prediction for trustworthy AI outputs
Creating self-correcting explanation pipelines
Developing adaptive learning systems from operator feedback

The novel InfoAgent architecture demonstrates promising results in our testing environment, with evaluation metrics and benchmarks still being refined as work in progress.

Ongoing Work: Continuing Beyond OSRE

While OSRE 2025 is concluding, I’m actively continuing to contribute to this project:

Preparing the InfoAgent framework for open-source release with comprehensive documentation
Running extended evaluation tests on the Nautilus platform (work in progress)
Writing a research paper detailing our novel architecture
Creating tutorials to help others implement intelligent observability

Project Updates and Code: You can follow my ongoing contributions and access the latest code at https://mreddy10.pages.nrp-nautilus.io/gsocnrp/

Acknowledgments

I’m deeply grateful to my lead mentor Mohammad Firas Sada for his exceptional guidance throughout this transformative learning experience. His insights have been invaluable in helping me develop the novel InfoAgent architecture and navigate the complexities of building production-ready AI systems.

The OSRE 2025 program has been an incredible journey of growth and discovery. I’ve learned not just how to build AI systems, but how to make them trustworthy, explainable, and genuinely useful for real-world operations. The novel InfoAgent architecture we’ve developed serves the original mission: creating an intelligent observability tool that helps NRP operators solve problems faster and keep complex research systems running smoothly.

I’m excited to continue contributing to this project and look forward to seeing how the community adopts and extends these ideas. Check out my contributions and ongoing updates at https://mreddy10.pages.nrp-nautilus.io/gsocnrp/!

[Final] Building PeerSky’s Extensions System

Tue, 23 Sep 2025 00:00:00 +0000

Hi everyone, I’m Hanzhong Liu. Over the summer I worked on building the peersky://extensions system for PeerSky browser, a decentralized and privacy-first browser built on Electron.

This post is my final GSoC 2025 update — covering how the extensions manager was designed, the security model behind IPC, the UI for managing extensions, and what’s next for PeerSky.

Project Overview

The new extensions system makes PeerSky behave like a modern browser: you can install extensions from the Chrome Web Store or from local files, enable/disable them, update or uninstall, and interact with their toolbar actions through a puzzle-menu UI.

Key Design Goals

Secure preload-based API exposure via contextBridge
Support for preinstalled, Web Store, and local packages
Toolbar integration with pin/unpin support (up to six)
Robust validation: MV3-only, size caps, zip-slip prevention

Highlights

Preinstalled MV3s

PeerSky now ships with three trusted extensions out of the box:

Dark Reader
Linguist (web page translator)
uBlock Origin Lite

They remain installed by default but can be disabled at any time. This ensures users always have a working baseline without needing to browse an extension store.

Electron Integration

Instead of injecting scripts, the system uses preload + IPC. Each operation is routed through validated IPC channels:

listExtensions, installFromWebStore, toggleExtension, etc.
All methods are scoped to peersky://extensions only.
Rate limiting and size caps are enforced per renderer.

This design makes the surface auditable and prevents privilege leaks.

Browser actions appear in a puzzle menu and can be pinned for quick access:

Up to six pins are allowed
Pinned state persists across sessions.
Popups (e.g., for translators or wallets) open in isolated windows, with OAuth flows preserved via popup guards.

Security Highlights

Installs capped at 60 MB, with early rejection on oversized payloads
5 installs/minute per renderer to prevent abuse
ZIP/CRX extraction hardened against path traversal
MV3 required; permissions validated at install with warnings for risky hosts
Web Store installs use Google-signed CRX verification via electron-chrome-web-store

Example: Installing from the Web Store

Adding a new extension is simple:

Paste a Chrome Web Store URL or ID into the install bar.
PeerSky downloads and validates the CRX.
On success, the extension appears in the grid with toggle, update, and remove options.

Reflection

This project was both challenging and rewarding. Designing an extension system meant grappling with security, IPC design, and user experience at the same time. I learned to think carefully about security management, UI/UX positioning, and design APIs that are auditable.

I’m grateful to my mentor Akhilesh Thite and the UC OSPO team for their guidance and feedback. Their support pushed me to make deliberate technical decisions and communicate them clearly.

You can explore the project here: https://github.com/p2plabsxyz/peersky-browser

Final Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows

Thu, 18 Sep 2025 00:00:00 +0000

Final Report: Applying MLOps to Overcome Reproducibility Barriers in ML

Background

Hello! I’m Ahmed Alghali, and this is my final report the project Applying MLOps to Overcome Reproducibility Barriers in ML under the mentorship of Professor Fraida Fund and Mohamed Saeed.

This project aims to address the reproducibility problem in machine learning—both in core ML research and in applications to other areas of science.

The focus is on making large-scale ML experiments reproducible on Chameleon Cloud. To do this; we developed ReproGen, a template generator that produces ready-to-use, reproducible ML training workflows. The goal: is to make the cloud easy for researchers setting up experiments without the worry about the complexity involved in stitching everything together.

Progress Since Mid-Report

Migration from Cookiecutter to Copier

we initially used Cookiecutter for template generation as a templating engine, but it lacked features we were interested in (e.g., conditional questions). we switched to Copier, which provides more flexibility and better matches our use case.

Support for Multiple Setup Modes

We now offer two setup modes, designed to serve both beginners and users who want advanced options/customization:

Basic Mode – minimal prompts (project name, repository link, framework).
Advanced Mode – detailed control (compute site, GPU type, CUDA version, storage site, etc.).

this ensures accessibility for new users, while still enabling fine-grained control for users.

Automated Credential Generation

previously, users had to manually generate application credentials (via Horizon OpenStack UI). now, we provide scripts that can generate two types of credentials programmatically—Swift and EC2—using Chameleon JupyterHub credentials with python-chi and the openstack-sdk client.

Automatic README.md Generation

each generated project includes a customized README.md, containing setup guidance and commands tailored to the user’s configuration.

Bug Fixes and UX Enhancements

Alongside major features, we implemented numerous smaller changes and fixes to improve the reliability and user experience of the tool.

Deliverables

ReproGen GitHub Repository: source code for the template generator.
mlflow-replay branch: explore a past experiment, artifacts, and logged insights.
LLM-Demo branch: hands-on demo to track fine-tuning of an LLM using infrastructure generated by ReproGen.

Next Steps

Compatibility Matrix
- the tool and the generated setup both depend on software dependencies that required paying attention to compatibility. in all level Hardware, OS, Drivers, Computing Platforms, core and 3rd-party libraries. writing a documentation as a start to help future debugging and adding pieces without breaking what is there. .
Maintain Docker Images

so far we have a cpu and GPU docker images for multiple most frequently used framework.
- CPU based image: for data science workload (Scikit-Learn)
- GPU-Nvidia Variant: for Deep Learning workload on Nvidia Machines (Pytorch, Lightning, TensorFlow)
- GPU-AMD Variant: for Deep Learning workload on AMD Machines (Pytorch, Lightning, TensorFlow) adding more variants for more frameworks + Enhancing the experience of the existing images is recommended.

Reflection

When I first joined SoR 2025, I had a problem crystallizing the idea of how I can practically achieve reproducibility and package a tool that would maximizes the chance of reproducing experiment build using it. throughout the journey my mentors took me under their wings and helped me to understand the reproducibility challenges in ML, my Mentor Professor Fraida Fund wrote materials that saved me a lot of time to familiarize my self with the testbed,important Linux tools and commands, and even getting to have hand on practice how large model training happen with MLflow tracking server system is done in the cloud. and Mohamed Saeed. who took the time reviewing my presentation pushing me to do my best. I’m forever thankful in the way they shaped the project and my personal growth. this hands-on experience help me viewing MLOps , cloud APIs, and workflow design in different lenses, and I’m proud to have contributed a tool that can simplify help reproducible research for others.

Final Report: CarbonCast — An end-to-end consumption-based Carbon Intensity Forecasting service

Mon, 15 Sep 2025 00:00:00 +0000

Hi everyone—this is my final report for CarbonCast, mentored by Professor Abel Souza. Back in June, my goal was simple to say and harder to pull off: help people see when the grid is cleaner and make it easy to act on that information. Over the summer I turned CarbonCast from a research prototype into something you can open, click, and rely on: a containerized backend, a clean API, and a fast, friendly map UI.

Background

CarbonCast forecasts the carbon intensity of electricity (gCO₂e/kWh) using grid data and weather. Earlier versions were accurate but difficult to run and even harder to use outside a research context. My OSRE focus was to make CarbonCast usable for real people: provide a standard API, build a web UI that feels responsive, and package everything so it starts quickly and keeps itself healthy.

Goals

I centered the work around four goals. First, I wanted to ship an end-to-end containerized stack—data collection, validation, storage, API, and UI—that someone else could run without digging through my notes. Second, I aimed to expand coverage beyond a handful of regions so the map would be genuinely useful. Third, I needed to make it reliable, with retries, monitoring, and graceful fallbacks so the system could run for weeks without babysitting. Finally, I wanted to lay the groundwork for a consumption-based signal, because imports from neighboring regions also shape a region’s true emissions picture.

What I built

By the end of the program, CarbonCast runs as a containerized backend + API + web app that you can bring up with Docker. The pipelines now reach 85+ regions, and the UI currently exposes 58+ while we finish integrating the rest. The API offers straightforward endpoints for current conditions and multi-day views, plus region metadata so clients can discover what’s available. The UI presents an interactive choropleth map with a side panel for the energy mix and a simple timeline to move between past, now, and the next few days. To keep things feeling snappy, I tuned caching so “now” data updates quickly while historical and forecast views load instantly from cache. I also added a small “mission control” dashboard that shows what updated, what failed, and how the system recovered, which makes maintenance far less mysterious.

How it works

Fresh weather and grid data arrive on a regular schedule. The system checks each file for sanity, stores it, and serves it through a clean API. The React app calls that API and paints the map. Hovering reveals regional details; clicking opens a richer panel with the energy mix and trends; the timeline lets you scrub through hours naturally. In short, the path is fresh data → API → map, and each step is designed to be obvious and quick.

Behind the scenes, I extended the existing Django backend with a SQLite path so the UI works out of the box on a laptop. For production, you can point the same code at Postgres or MySQL without changing the UI. This choice made local testing easy while leaving room for scale later.

Highlights

A few moments stand out. The first time the dashboard flipped from red to green on its own—after the system retried through a wave of timeouts—was a turning point. Clicking across the map and getting instant responses because the right data was cached felt great too. And packaging everything so another person can run it without asking me for help might be the biggest quality-of-life win for future contributors.

Challenges

The first big hurdle was refactoring the old vanilla-JS interface. The original UI worked, but it was dated and hard to extend. I rebuilt it as a modern React + TypeScript app with a cleaner component structure and a fresh look—think glassmorphic panels, readable color scales, and a layout that feels consistent on both laptops and smaller screens. Moving to this design system made the codebase far easier to maintain, theme, and iterate on.

The next challenge was performance under real-time load. With dozens of regions updating, it was easy to hit API limits and make the UI feel jittery. I solved this by adding a smart caching layer with short, volatility-aware timeouts, request de-duplication, and background prefetching. That combination dramatically reduced round-trips, essentially eliminated rate-limit hits, and made the map feel responsive even as you scrub through time. The result is a UI that can handle many simultaneous updates without hiccups.

Finally, there were plenty of stubborn UI bugs. Some regions wouldn’t color even when data was available, certain charts refused to render, and a few elements flickered or never showed up. Most of this came down to learning React state management in a real project: taming race conditions, canceling in-flight requests when users navigate, and making sure state only updates when fresh data actually arrives. Fixing those issues taught me a lot about how maps re-paint, how charts expect their data, and how to keep components simple enough that they behave the way users expect.

What didn’t make the cut (yet)

I designed—but did not finish—per-region plug-in models so each grid can use the approach that fits it best. We decided to ship a stable, deployable service first and reserve that flexibility work for the next phase. The design is written down and ready to build.

Links and resources:

Project page: CarbonCast
Proposal: https://ucsc-ospo.github.io/report/osre25/ucsc/carboncast/20250710-tanushsavadi/
Midterm blog: https://ucsc-ospo.github.io/report/osre25/ucsc/carboncast/20250803-tanushsavadi/
Backend/API (branch): https://github.com/carbonfirst/CarbonCast/tree/django_apis_sqlite
Frontend/UI: https://github.com/carbonfirst/CarbonCastUI/tree/main

What’s next

My next steps are clear. I want to finish the per-region model plug-ins so grids can bring their own best forecasting logic. I also plan to carry the consumption-based signal end-to-end, including imports and interconnects surfaced directly in the UI. Finally, I’ll harden the system for production by enabling auth and throttling and by moving to a production-grade database where appropriate.

Thank you

Huge thanks to Professor Abel Souza for steady mentorship and to the OSRE community for thoughtful feedback. The most rewarding part of this summer was watching a research idea become something people can click on—and use to make cleaner choices.

Final Report: A Systematic Investigation into the Reproducibility of RAG Systems

Fri, 05 Sep 2025 00:00:00 +0000

I’m Baiqiang, and this is the final report for the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project, mentored by Luanzheng “Lenny” Guo and Dongfang Zhao. This project successfully developed a novel framework to quantitatively measure reproducibility in AI systems, yielding several surprising and impactful results.

The Challenge: The Need for Systematic Measurement

Retrieval-Augmented Generation (RAG) is a cornerstone of AI for science, but its reliability is often compromised by non-determinism. While this issue was a known concern, a fundamental challenge was the lack of standardized tools and methodologies to systematically measure and quantify the sources of this inconsistency. Without a rigorous way to analyze the problem, it was difficult to move beyond ad-hoc tests and establish the true root causes, hindering the development of truly trustworthy AI systems for science.

Our Contribution: The ReproRAG Framework

To address this gap, the central contribution of this project is ReproRAG, a comprehensive, open-source benchmarking framework. ReproRAG is designed to systematically investigate sources of uncertainty across the entire RAG pipeline by:

Isolating Variables: It allows for controlled experiments on embedding models, numerical precision, retrieval algorithms, hardware configurations (CPU/GPU), and distributed execution environments.
Quantifying Uncertainty: It employs a suite of metrics—including Exact Match Rate, Jaccard Similarity, and Kendall’s Tau—to precisely measure the impact of each variable on the final retrieved results.

Key Findings: A New Hierarchy of Uncertainty

Our large-scale empirical study using ReproRAG challenged common assumptions and established a clear hierarchy of what actually impacts reproducibility.

Core Algorithms Are Not the Problem: Our most surprising finding is that modern retrieval libraries like FAISS are perfectly reproducible out-of-the-box. Across all tested index types (including approximate ones like HNSW and IVF) and execution environments (single-node CPU/GPU and multi-node distributed systems), we achieved perfect run-to-run reproducibility (1.000 scores on all metrics) when environmental factors like random seeds were controlled. This falsifies the common hypothesis that approximate nearest neighbor algorithms are a primary source of randomness.
Embedding Model Choice is a Dominant Source of Variation: We found that the choice of the embedding model is a dominant factor driving result variation. When comparing outputs from different state-of-the-art models (BGE, E5, Qwen) for the same query, the agreement was very low (e.g., Overlap Coefficient of ~0.43-0.54). This means a scientific conclusion drawn with one model may not be reproducible with another, as they are fundamentally “seeing” different evidence.
Environmental Factors Introduce Measurable “Drift”:
- Numerical Precision: Changing floating-point precision (e.g., FP32 vs. FP16) was a guaranteed source of variation, but it caused a small and quantifiable “embedding drift” rather than chaotic changes.
- Data Insertion: Incrementally adding new data to an index caused a predictable “displacement” of old results, not a re-shuffling. The relative ranking of the remaining original documents was perfectly stable (Kendall’s Tau of 1.000).
Common Determinism Flags Can Be Ineffective: Our tests showed that popular software-level controls, like cudnn.deterministic flags in PyTorch, had no observable effect on the output of modern transformer-based embedding models. This underscores the necessity of empirical validation over assuming that framework settings work as advertised.

Conclusion

This project successfully shifted the focus of the RAG reproducibility problem. The key challenge is not to fix supposedly “random” algorithms, but to rigorously control the entire experimental environment. We delivered ReproRAG, a framework that empowers researchers to do just that. Our findings provide actionable insights for the community: efforts to improve reproducibility should focus less on the retrieval algorithms themselves and more on disciplined management of embedding models, data versioning, and numerical precision.

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Fri, 05 Sep 2025 00:00:00 +0000

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 2

Hello all! 👋 This is Tharit again. I want to share this blog post about my Part 2 of Google Summer of Code projects. In case you miss it, you can take a look at Part1 as well. Without further introduction, these following supports were added since last time.

Complex Number Support PR #148

Between this PR and the previous PR, there are lots of debugging and testing to make sure that all existing MPILinearOperator works under NCCL as they do with mpi4py PR #141, #142 #145

Most of the PyLops-MPI users are scientists and engineers working on the scientific problems - and most of the scientific problem involves complex numbers (Fourier Transform touches many things). NCCL does not support the complex number out-of-the-box.

It turned out that adding complex-number support was not the big issue. The complex number is simply the contiguous array of, says, float64. Unlike typical float64, one element of complex128 number is then represented by two float64. Things get more complicate if we start to talk about the complex number arithmatic. Luckily, NCCL semantics only supports element-wise ncclSum, ncclProd, ncclMin, ncclMax, ncclAvg. Wrapping element-wise operations for complex number is straightforward.

The change to PyLops-MPI _nccl.py itself is minimal. We simply added the function below and this hides the complexity of buffer-size management from users.

def _nccl_buf_size(buf, count=None):
 """ Get an appropriate buffer size according to the dtype of buf
 if buf.dtype in ['complex64', 'complex128']:
 return 2 * count if count else 2 * buf.size
 else:
 return count if count else buf.size

The conceptual is quite simple. But mechanically, to get it right in the general case required some extensive bug fixing, particularly in the call to _allgather as noted earlier in the “Core Change” section. The array needs some preprocessing (to align with NCCL semantics) and post-processing so that the result from Pylops-MPI’s NCCL allgather matches with the PyLops-MPI allgather. This is because Pylops-MPI must be able to switch between mpi4py and NCCL seamlessly from the user’s perspective. To make it concrete, here is how we do the _allgather() with NCCL

 def _allgather(self, send_buf, recv_buf=None):
 """Allgather operation
 """
 if deps.nccl_enabled and self.base_comm_nccl:
 if isinstance(send_buf, (tuple, list, int)):
 return nccl_allgather(self.base_comm_nccl, send_buf, recv_buf)
 else:
 send_shapes = self.base_comm.allgather(send_buf.shape)
 (padded_send, padded_recv) = _prepare_nccl_allgather_inputs(send_buf, send_shapes)
 raw_recv = nccl_allgather(self.base_comm_nccl, padded_send, recv_buf if recv_buf else padded_recv)
 return _unroll_nccl_allgather_recv(raw_recv, padded_send.shape, send_shapes)
 # < snip - MPI allgather >

After this feature was added, the PyLops-MPI with NCCL now catches up with its original MPI implementation, i.e., the test coverage is now the same 306 tests passed !

Benchmark Instrumentation PR #157

Profiling distributed GPU operations is critical to understanding performance bottlenecks. To make this easier, we added a lightweight benchmark instrumentation framework in PyLops-MPI. The goal was to allow developers to mark execution points in a function and collect timing information for these markers.

The core of the implementation is a @benchmark decorator. Inside a decorated function, developers can call mark(label) to record the time at specific points. After the function completes, the timings are reported in a human-readable format. This design is inspired by C++-style instrumentation, letting developers place markers directly in the code where they are most informative.

But because we are in Python, to handle nested function calls, we collect the timing information as a stack (bottom-up call graph) and parse the result at the end of the decorated function. Here’s is the illustration:

@benchmark
def outer_func_with_mark(par):
 mark("Outer func start")
 inner_func_with_mark(par) # <- this does `dot` and is also decorated
 dist_arr = DistributedArray(global_shape=par['global_shape'],
 partition=par['partition'],
 dtype=par['dtype'], axis=par['axis'])
 dist_arr + dist_arr
 mark("Outer func ends")

The text output is

[decorator]outer_func_with_mark: total runtime: 0.001206 s
 [decorator]inner_func_with_mark: total runtime: 0.000351 s
 Begin array constructor-->Begin dot: 0.000026 s
 Begin dot-->Finish dot: 0.000322 s
 Outer func start-->Outer func ends: 0.001202 s

Benchmarking is controlled via the environment variable BENCH_PYLOPS_MPI. It defaults to 1 (enable) but can be set to 0 to skip benchmarking for clean output. This means users can leave the decorated code unchanged and disable the benchmark through the environment variable. This is inspired by the C++ debug flag set during the compilation. Moreover, careful attention had to be made on concurrency issue of benchmarking because the time is recorded by CPU while the NCCL issues the operation in an async manner to CUDA stream PR #163 is an example of this.

Benchmark Result

This was the moment of truth. Our 12-week hardwork would be judged by a set of hard cold numbers. Our expectation was that

If the system does not have proprietary NVLink for GPU-GPU communication but is NCCL-compatible, the communication using CuPy + NCCL should still be faster than NumPy + MPI (and possiblyCuPy + MPI) in PyLops-MPI i.e., there should be a benefit from using NCCL from communication-related optimizations enabled by this project.

The result below was from NCSA UIUC Delta system 4-Way NVIDIA A40 GPU (no NVLink) with the allreduce operation.

That meets our expection. One thing to note here is: we see that actually the CuPy + MPI communication being slower than the NumPy + MPI. This is because the current implementation of PyLops-MPI uses non-buffered calls of mpi4py - see detail here. The choice was made due to its simplicity as it allowed send and receiving generic Python object wrapped in a list and thus allowed fast development process. However, These require the memory copy from GPU to CPU, do communication, and copy memory from CPU to GPU (pickle protocol) - see our discussion with mpi4py community here. This leads us to “Things left to do” section (later).

If the system has an NVLink for GPU-GPU communication, we will be able to see a significant gain in performance of PyLops-MPI with NCCL.

The result below is also from NCSA UIUC Delta system 8-Way NVIDIA H200 GPU (with NVLink) but we only use 4 GPUs to compare with previous result. This is also with the allreduce operation.

Here we unleash the true power of NCCL and its infrasture as you can see that the bandwidth of PyLops-MPI with NCCL is 800x of the MPI implementation !. It may not make much sense to compare the number with NumPy+MPI because there is drastic hardware infrastructure upgrade involved.

To top things off, we also ran the experiment trying to saturate the communication with the array size going up to 32 GB in total. We can see that we have the linear scaling i.e. time vs. data size grows linearly.

Finally, we ran an experiment with the application of Least-squares Migration, which is an iterative inversion scheme:

Each iteration applies a forward A and an adjoint A.T operation to form residuals and gradients.
A gradient accumulation requires a global reduction across processes with allreduce. Note that the computation is not trivial and so the total run-time of CPU and GPU is not fairly comparable (notice that in H200, the CuPy+MPI is not the slowest anymore). But we want to give the idea of how things piece together in the real application.

The impact of this GSoC project is clear:

With our NCCL-enabled PyLops-MPI,

if you don’t have access to the state-of-the-art infrastructure, PyLops-MPI with NCCL can still 10x the communication bandwith (A40 case)
if you do, we allow you to get the most out of the system (H200 case).

And the best thing is to use NCCL with PyLops-MPI, it requires minimal code change as shown in this LSM Tutorial and illustrated below. Only two change require from the code that run on MPI: the array must be allocated in GPU and nccl has to be passed to the DistributedArray. And that’s it !

nccl_comm = pylops_mpi.utils._nccl.initialize_nccl_comm()

# <snip - same set-up as running with MPI>

lsm = LSM(
 # <snip>
 cp.asarray(wav.astype(np.float32)), # Copy to GPU
 # <snip>
 engine="cuda",
 dtype=np.float32
)
lsm.Demop.trav_srcs = cp.asarray(lsm.Demop.trav_srcs.astype(np.float32)) # Copy to GPU
lsm.Demop.trav_recs = cp.asarray(lsm.Demop.trav_recs.astype(np.float32)) # Copy to GPU

x0 = pylops_mpi.DistributedArray(VStack.shape[1],
 partition=pylops_mpi.Partition.BROADCAST,
 base_comm_nccl=nccl_comm, # Explicitly pass nccl communicator
 engine="cupy") # Must use CuPy
# <snip - the rest is the same>

Things left to do

CUDA-Aware MPI: As we pointed out in the A40 experiment that current implementation of PyLops-MPI use non-buffered calls of mpi4py and thus introduces the memory copying from GPU to CPU. We aim to optimize this by introducing the buffered calls. However, this is not a trivial task because some of the MPI-related code was developed based on the semantics that the communication return the list object while the buffered call will return the array instead.

Wrapping Up KALLM

Wed, 03 Sep 2025 00:00:00 +0000

Large language models today look complicated, but if you peel back the layers, most of what you see is old technology: stacks of linear transformations. The Transformer architecture, the engine behind GPTs and their cousins, is often described as revolutionary. Yet the majority of its parameters are standard linear layers, the same kind of matrix multiplications you would find in a simple multilayer perceptron from the 1980s. For years these layers have gone unchallenged. They are fast, they scale, and they work. But maybe the time has come to ask: can we do better than linear?

This project explored exactly that. Instead of leaving those layers untouched, we tried replacing them with a more mathematically structured alternative: Kolmogorov–Arnold Networks (KANs). The result is a working language model—SmolLM2, a 135-million-parameter Transformer—where the final feedforward blocks no longer consist of brute-force linear weights, but of compact polynomial-based functions. And the striking fact is that performance remained within the baseline range. Smaller KANs managed to match larger linear layers, showing that smarter mathematics can stand shoulder to shoulder with the workhorse of deep learning.

Transformers

To understand the significance, let’s revisit what a Transformer actually is.
A Transformer block has two main components: attention and feedforward. The attention mechanism computes how each word in a sentence relates to every other word. That is the clever part, and it is what made Transformers famous. But once attention finishes its work, the output is passed into a feedforward network. And this feedforward network is essentially two large linear layers, stacked with a nonlinearity between them.

Now stacking thirty such blocks yields a complete model like SmolLM2. Look at the parameter counts and you see a pattern: attention is not the main consumer. It’s the feedforward layers. They dominate memory and computation, making them the primary target for efficiency gains.

What Are Kolmogorov–Arnold Networks?

So what happens if, instead of a giant matrix multiplication, we try something more structured? Enter Kolmogorov–Arnold Networks.

KANs are built on a mathematical theorem from the mid-20th century, which proved that any multivariate function can be decomposed into sums of univariate functions. Instead of mixing all inputs together at once, you treat each input dimension separately, applying a small nonlinear function, and then recombine. The beauty is that these univariate functions can be simple but expressive—like splines or polynomials—and yet, when summed, they approximate very complex mappings.

Think of a KAN layer as a set of individual univariate modules. Each one takes a single variable, bends it according to a chosen basis (polynomials, splines, etc.), and then all those bent versions are added up to produce the output. The richness of the final function depends on two factors:

Choice of basis: You can bend with Chebyshev polynomials, with Legendre polynomials, with B-splines, or with other families.
Degree: This is how many bends you allow. A degree-1 polynomial is just a line. Degree-2 can capture curves. Higher degrees capture higher-order oscillatory components.

A Chebyshev polynomial of the second kind, degree 2, is one such basis. Unlike a simple quadratic, it has roots and oscillations that make it particularly good at spanning function space efficiently. This efficiency explains its favorable performance in our experiments: low degree means fewer parameters, but Chebyshev’s properties let it approximate more than you might expect from so few numbers.

Why Small Can Beat Big

Linear layers require many parameters because they treat every input–output mapping as arbitrary. KANs assume smoothness: each input passes through a compact polynomial basis before recombination. This structure captures useful patterns with fewer parameters.

A degree-2 Chebyshev basis, for example, encodes curvature and oscillation efficiently. While a linear layer of the same size must spend parameters to approximate these effects, the polynomial basis includes them inherently. The result is comparable expressivity with fewer parameters. In language tasks where patterns are often smooth or compositional, this structured efficiency translates into competitive accuracy at lower cost.

Baselines, Modifications, and Comparisons

Here’s what we actually tested, in plain language:

The untouched baseline: a pretrained SmolLM2, with all thirty blocks intact.
Linear restart: the same pretrained model, but the last five feedforward modules were thrown away and replaced with freshly initialized linear ones. These then had to be trained again.
KAN replacement: again, take the pretrained model, cut off the last five feedforward modules, and put in new KAN modules instead—specifically, Chebyshev of the second kind, degree 2.

In all three cases, the backbone of the model—the embeddings, the attention layers, and the first twenty-five blocks—was left untouched. Only the tail was modified. This design allowed us to test transfer learning: would the pretrained parts of the model still play nicely with the new pieces? The answer is yes. The attention layers and other linear projections adapted seamlessly, proving that KANs can be swapped in without destabilizing the whole system.

Training was done on smol-smoltalk dataset, a small-scale dialogue corpus used for both pretraining and fine-tuning. After training, all models were evaluated on the same subset of BIG-Bench Hard tasks.

Results

The baseline was the pretrained SmolLM2 without modification. It achieved an average accuracy of 22.5%, using 134M parameters. This experiment has a single measurement because no training was applied. The rest of the experiments was done using 3 random seeds.

When retrained with linear replacements, the model reached an average accuracy of 43.8%, with 46M trainable parameters (only last 5 blocks are active) and 5.87 GB VRAM total usage.

Replacing the last five feedforward blocks with Kolmogorov–Arnold Networks produced an average accuracy of 44.1%, with 39M parameters and 5.86 GB VRAM usage. The memory consumption of KAN layers is a subject that requires further optimization.

In short, KANs matched or slightly exceeded the reinitialized linear baseline in accuracy, while using fewer parameters and slightly less memory. This demonstrates that structured polynomial layers can substitute for large linear layers without degrading reasoning performance.

Why Transfer Learning Works So Well

One of the surprising outcomes is how cleanly the pretrained Transformer integrates with KANs. Remember: only the feedforward modules in the last five blocks were replaced. All the other linear layers—embedding projections, attention queries, keys, and values, output heads—remained untouched. They continue to function as before. The new KAN blocks slot right in, adapt during training, and the system as a whole behaves coherently.

That tells us something important. The standard Transformer does not depend on linearity per se in those positions. What it depends on is a nonlinear transformation with enough expressive power. KANs provide that power, just in a different mathematical form. Which means: any pretrained Transformer can, in principle, be retrofit with KANs in the feedforward slots, no need to start from scratch.

Looking Ahead: Mixing Polynomial Bases

So far we only tested one family, Chebyshev-2. But the architecture is more general. Each KAN block can in fact host multiple polynomial families in parallel, or stack them in sequence.

Parallel: imagine splitting the input across several channels, each processed by a different basis. The outputs are then recombined. This way, one basis covers the smooth global structure, while another captures edge effects or oscillations.
Sequential: here, the output of one polynomial transformation becomes the input of another. You can think of it as layering function approximations, where the second basis corrects the limitations of the first. For example, a spline might give you piecewise smoothness, then a Chebyshev layer on top could adjust the global shape.

Both strategies were implemented and promise to extract more expressivity per parameter. Instead of simply making the networks bigger, we can make them smarter, combining the strengths of different mathematical families. That will be the focus of future work.

Conclusion

The main lesson is this: language models do not need to be built entirely from massive linear matrices. By replacing just a handful of those matrices with compact Kolmogorov–Arnold modules, we achieved the same reasoning accuracy with fewer parameters and less memory. Transfer learning works cleanly. The architecture adapts. And the door is now open to rethink what belongs inside a Transformer block.

KANs are not just a theoretical curiosity. They are practical, efficient, and compatible with modern large language models. This project showed that replacing linear with polynomial is not only possible, it is competitive. The next step is to push combinations, explore scaling, and see just how far this mathematical alternative can take us.

Final Report: MPI Appliance for HPC Research on Chameleon

Mon, 01 Sep 2025 00:00:00 +0000

Hi Everyone, This is my final report for the project I completed during my summer as a Summer of Reproducibility (SOR) student. The project, titled “MPI Appliance for HPC Research in Chameleon,” was undertaken in collaboration with Argonne National Laboratory and the Chameleon Cloud community. The project was mentored by Ken Raffenetti and was completed over the summer. This blog details the work and outcomes of the project.

Background

Message Passing Interface (MPI) is the backbone of high-performance computing (HPC), enabling efficient scaling across thousands of processing cores. However, reproducing MPI-based experiments remains challenging due to dependencies on specific library versions, network configurations, and multi-node setups.

To address this, we introduce a reproducibility initiative that provides standardized MPI environments on the Chameleon testbed. This is set up as a master–worker MPI cluster. The master node manages tasks and communication, while the worker nodes do the computations. All nodes have the same MPI libraries, software, and network settings, making experiments easier to scale and reproduce.

Objectives

The aim of this project is to create an MPI cluster that is reproducible, easily deployable, and efficiently configurable.

The key objectives of this project were:

Pre-built MPI Images: Create ready-to-use images with MPI and all dependencies installed.
Automated Cluster Configuration: Develop Ansible playbooks to configure master–worker communication, including host setup, SSH key distribution, and MPI configuration across nodes.
Cluster Orchestration: Develop orchestration template to provision resources and invoke Ansible playbooks for automated cluster setup.

Implementation Strategy and Deliverables

Openstack Image Creation

The first step was to create a standardized pre-built image, which serves as the base image for all nodes in the cluster.

Some important features of the image include:

Built on Ubuntu 22.04 for a stable base environment.
Spack + Lmod integration:
- Spack handles reproducible, version-controlled installations of software packages.
- Lmod (Lua Modules) provides a user-friendly way to load/unload software environments dynamically.
- Together, they allow users to easily switch between MPI versions, libraries, and GPU toolkits
MPICH and OpenMPI pre-installed for standard MPI support and can be loaded/unloaded.
Three image variants for various HPC workloads: CPU-only, NVIDIA GPU (CUDA 12.8), and AMD GPU (ROCm 6.4.2).

These images have been published and are available in the Chameleon Cloud Appliance Catalog:

MPI and Spack for HPC (Ubuntu 22.04) - CPU Only
MPI and Spack for HPC (Ubuntu 22.04 - CUDA) - NVIDIA GPU (CUDA 12.8)
MPI and Spack for HPC (Ubuntu 22.04 - ROCm) - AMD GPU (ROCm 6.4.2)

Cluster Configuration using Ansible

The next step is to create scripts/playbooks to configure these nodes and set up an HPC cluster. We assigned specific roles to different nodes in the cluster and combined them into a single playbook to configure the entire cluster automatically.

Some key steps the playbook performs:

Configure /etc/hosts entries for all nodes.
Mount Manila NFS shares on each node.
Generate an SSH key pair on the master node and add the master’s public key to the workers’ authorized_keys.
Scan worker node keys and update known_hosts on the master.
(Optional) Manage software:
- Install new compilers with Spack
- Add new Spack packages
- Update environment modules to recognize them
Create a hostfile at /etc/mpi/hostfile.

The code is publicly available and can be found on the GitHub repository: https://github.com/rohanbabbar04/MPI-Spack-Experiment-Artifact

Orchestration

With the image now created and deployed, and the Ansible scripts ready for cluster configuration, we put everything together to orchestrate the cluster deployment.

This can be done in two primary ways:

Python CHI(Jupyter) + Ansible

Python-CHI is a python library designed to facilitate interaction with the Chameleon testbed. Often used within environments like Jupyter notebooks.

This setup can be put up as:

Create leases, launch instances, and set up shared storage using python-chi commands.
Automatically generate inventory.ini for Ansible based on launched instances.
Run Ansible playbook programmatically using ansible_runner.
Outcome: fully configured, ready-to-use HPC cluster; SSH into master to run examples.

If you would like to see a working example, you can view it in the Trovi example

Heat Orchestration Template

Heat Orchestration Template(HOT) is a YAML based configuration file. Its purpose is to define/create a stack to automate the deployment and configuration of OpenStack cloud resources.

Challenges

We faced some challenges while working with Heat templates and stacks in particular in Chameleon Cloud

OS::Nova::Keypair(new version): In the latest OpenStack version, the stack fails to launch if the public_key parameter is not provided for the keypair, as auto-generation is no longer supported.
OS::Heat::SoftwareConfig: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.

To tackle these challenges, we designed an approach that is both easy to implement and reproducible. First, we launch instances by provisioning master and worker nodes using the HOT template in OpenStack. Next, we set up a bootstrap node, install Git and Ansible, and run an Ansible playbook from the bootstrap node to configure the master and worker nodes, including SSH, host communication, and MPI setup. The outcome is a fully configured, ready-to-use HPC cluster, where users can simply SSH into the master node to run examples.

Users can view/use the template published in the Appliance Catalog: MPI+Spack Bare Metal Cluster. For example, a demonstration of how to pass parameters is available on Trovi.

Conclusion

In conclusion, this work demonstrates a reproducible approach to building and configuring MPI clusters on the Chameleon testbed. By using standardized images, Ansible automation, and Orchestration Templates, we ensure that every node is consistently set up, reducing manual effort and errors. The artifact, published on Trovi, makes the entire process transparent, reusable, and easy to implement, enabling users/researchers to reliably recreate and extend the cluster environment for their own experiments.

Future Work

Maintaining these images and possibly creating a script to reproduce MPI and Spack on a different image base environment.

Final Update(Mid-Term -> Final): MPI Appliance for HPC Research on Chameleon

Sun, 31 Aug 2025 00:00:00 +0000

Hi everyone! This is my final update, covering the progress made every two weeks from the midterm to the end of the project MPI Appliance for HPC Research on Chameleon, developed in collaboration with Argonne National Laboratory and the Chameleon Cloud community. This blog follows up on my earlier post, which you can find here.

🔧 July 29 – August 11, 2025

With the CUDA- and MPI-Spack–based appliances published, we considered releasing another image variant (ROCm-based) for AMD GPUs. This will be primarily used in CHI@TACC, which provides AMD GPUs. We have successfully published a new image on Chameleon titled MPI and Spack for HPC (Ubuntu 22.04 - ROCm), and we also added an example to demonstrate its usage.

🔧 August 12 – August 25, 2025

With the examples now available on Trovi for creating an MPI cluster using Ansible and Python-CHI, my next step was to experiment with stack orchestration using Heat Orchestration Templates (HOT) on OpenStack Chameleon Cloud. This turned out to be more challenging due to a few restrictions:

OS::Nova::Keypair (new version): In the latest OpenStack version, the stack fails to launch if the public_key parameter is not provided for the keypair, as auto-generation is no longer supported.
OS::Heat::SoftwareConfig: Deployment scripts often fail, hang, or time out, preventing proper configuration of nodes and causing unreliable deployments.

To address these issues, we adopted a new strategy for configuring and creating the MPI cluster: using a temporary bootstrap node.

In simple terms, the workflow of the Heat template is:

Provision master and worker nodes via the HOT template on OpenStack.
Launch a bootstrap node, install Git and Ansible on it, and then run an Ansible playbook from the bootstrap node to configure the master and worker nodes. This includes setting up SSH, host communication, and the MPI environment.

This provides an alternative method for creating an MPI cluster.

We presented this work on August 26, 2025, to the Chameleon Team and the Argonne MPICH Team. The project was very well received.

Stay tuned for my final report on this work, which I’ll be sharing in my next blog post.

[Final Blog] Distrobench: Distributed Protocol Benchmark

Sat, 30 Aug 2025 00:00:00 +0000

Introduction

This is the final blog for our contribution to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges project under the mentorship of Fadhil Kurnia for the OSRE program.

Distrobench is a framework to evaluate the performance of replication/coordination protocols for distributed systems. This framework standardizes benchmarking by allowing different protocols to be tested under an identical workload, and supports both local and remote deployment of the protocols. The frameworks tested are restricted under a key-value store application and are categorized under different consistency models, programming languages, and persistency (whether the framework stores its data in-memory or on-disk).

All the benchmark results are stored in a data.json file which can be viewed through a webpage we have provided. A user can clone the git repository, benchmark different protocols on their own machine or in a cluster of remote machines, then view the results locally. We also provided a webpage that shows our own benchmark results which ran on 3 Amazon EC2 t2.micro instances.

How to run a benchmark on Distrobench

Before running a benchmark using Distrobench, the protocol that will be benchmarked must first be built. This is to allow the script to initialize the protocol instance for local benchmark or to send the binaries into the remote machine. The remote machine running the protocol does not need to store the code for the protocol implementations, but does require dependencies for running that specific protocol such as Java, Docker, rsync, etc. The following are commands used to build the ailidani/paxi project which does not need any additional dependency to be run inside of a remote machine:

# Clone the Distrobench repository 
git clone git@github.com:fadhilkurnia/distro.git

# Clone the Paxi repository and build the binary 
cd distro/sut/ailidani.paxi
git clone git@github.com:ailidani/paxi.git
cd paxi/bin/
./build.sh

# Go back to the Distrobench root directory & run python script 
cd ../../../..
python main.py

By default, the script will start 3 local instances of a Paxi protocol implementation that the user chose through the CLI. The user can modify the number of running instances and whether or not it is deployed locally or in a remote machine by changing the contents of the .env file inside the root directory. The following is the contents of the default .env file:

NUM_OF_NODES=3

SSH_KEY=ssh-key.pem
REMOTE_USERNAME=ubuntu

PUBLIC_IP1=127.0.0.1
PUBLIC_IP2=127.0.0.1
PUBLIC_IP3=127.0.0.1

PRIVATE_IP1=127.0.0.1
PRIVATE_IP2=127.0.0.1
PRIVATE_IP3=127.0.0.1

CLIENT_IP=127.0.0.1

OUTPUT=data.json

When running a remote benchmark, a ssh-key should also be added in the root directory to allow the use of ssh and rsync from within the python script. All machines must also allow TCP connection through port 2000-2300 and port 3000-3300 because that would be the port range for communication between the running instances as well as for the YCSB benchmark. Running the benchmark requires the use of at least 3 nodes because it is the minimum number of nodes to support most protocols (5 nodes recommended).

To view the benchmark result in the web page locally, move data.json into the docs/ directory and run python -m http.server 8000. The page is then accessible through http://localhost:8000.

Deep dive on how Distrobench works

The following is the project structure of the Distrobench repository:

distro/
├── main.py // Main python script for running benchmark
├── data.json // Output file for main.py
├── README.md
├── .env // Config for running the benchmark
├── docs/
│ ├── index.html // Web page to show benchmark results
│ ├── data.json // Output file displayed by web page
│ ├── README.md
├── src/
│ ├── utils/
│ └── ycsb/ // Submodule for YCSB
└── sut/ // Systems under test
 ├── ailidani.paxi/
 └── run.py // Protocol-specific benchmark script called by main.py
 ├── apache.zookeeper/
 ├── etcd-io.etcd/
 ├── fadhilkurnia.xdn/
 ├── holipaxos-artifect.holipaxos/
 ├── otoolep.hraftd/
 └── tikv.tikv/

main.py will automatically detect directories inside sut/ and will call the main function inside run.py. The following is the structure of run.py written in pseudocode style:

FUNCTION main(run_ycsb: Function, nodes: List of Nodes, ssh: Dictionary)
 node_data = map_ip_port(nodes)

 SWITCH user\_input
 CASE 0:
 start()
 RETURN
 CASE 1:
 stop()
 RETURN
 CASE 2:
 client_data = []
 FOR EACH item IN node_data
 ADD item.client_addr TO client_data
 END FOR
 run_ycsb(client_data)
 RETURN
 END SWITCH
END FUNCTION

FUNCTION start()
 // Start the protocol instance (local or remote)
END FUNCTION

FUNCTION stop()
 // Stop the protocol instance (local or remote)
END FUNCTION

FUNCTION map_ip_port(nodes: List of Nodes) -> List of Dictionary
 // Generate port numbers based on the protocol requirements
END FUNCTION

The .env file provides both public and private IP addresses to add versatility when running a remote benchmark. Private IP is used for communication between remote machines if they are under the same network group. In the case of our own benchmark, four t2.micro EC2 instances are deployed under the same network group. Three of them are used to run the protocol and the fourth machine acts as the YCSB client. It is possible to use your local machine as the YCSB client instead of through another remote machine by specifying CLIENT_IP in the .env file as 127.0.0.1. The decision to use the remote machine as the YCSB client is made to reduce the impact of network latency between the client and the protocol servers to a minimum.

The main tasks of the start() function can be broken down into the following:

Generate custom configuration files for each remote machine instance (May differ between implementations. Some implementations does not require a config file because they support flag parameters out of the box, others require multiple configuration files for each instance)
rsync binaries into the remote machine (If running a remote benchmark)
Start the instances

The stop() function is a lot simpler since it only kills the process running the protocol and optionally removes the copied binary files in the remote machine. The run_ycsb() function passed onto run.py is defined in main.py and currently supports two types of workload:

Read-heavy: A single-client workload with 95% read and 5% update (write) operations
Update-heavy: A single-client workload with 50% read and 50% update (write) operations

A new workload can be added inside the src/ycsb/workloads directory. Both workloads above only run 1000 operations for the benchmark which may not be enough operations to properly evaluate the performance of the protocols. It should also be noted that while YCSB does support a scan operation, it is never used for our benchmark because none of our tested protocols implement this operation.

How to implement a new protocol in Distrobench

Adding a new protocol to distrobench requires implementing two main components: a Python integration script (run.py) and a YCSB database binding for benchmarking.

Create the protocol directory structure
- Create a new directory under sut/ using format yourrepo.yourprotocol/.
Write run.py integration
- Put script inside yourrepo.yourprotocol/ directory
- Must have the main(run_ycsb, nodes, ssh) function.
- Add start/stop/benchmark menu options
- Handle local (127.0.0.1) and remote deployment
Create YCSB client
- Make Java class extending YCSB’s DB class
- Put inside src/ycsb/yourprotocol/src/main/java/site/ycsb/yourprotocol
- Implement read(), insert(), update(), delete() methods
Register your client
- Register your client to src/pom.xml, src/ycsb/bin/binding.properties, and src/ycsb/bin/ycsb.
Build and test
- Run cd src/ycsb && mvn clean package
- Run python main.py
- Select your protocol and test it

Protocols which have been tested

Distrobench has tested 20 different distributed consensus protocols across 7 different implementation projects.

ailidani/paxi
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability, Eventual
- Protocol : Paxos, EPaxos, SDpaxos, WPaxos, ABD, chain, VPaxos, WanKeeper, KPaxos, Paxos_groups, Dynamo, Blockchain, M2Paxos, HPaxos.
apache/zookeeper
- Programming Language : Java
- Persistency : On-Disk
- Consistency Model : Linearizability + Primary Integrity
- Protocol : Zookeeper implements ZAB (Zookeper Atomic Broadcast)
etcd-io/etcd
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft
fadhilkurnia/xdn
- Programming Language : Java, Rust
- Persistency : On-Disk
- Consistency Model : Linearizability, Linearizability + Primary Integrity
- Protocol : Gigapaxos
Zhiying12/holipaxos-artifect
- Programming Language : Go, Rust
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Holipaxos, Omnipaxos, Multipaxos
otoolep/hraftd
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft
tikv/tikv
- Programming Language : Rust
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft

Challenges

When attempting to benchmark HoliPaxos, the main challenge was handling versions that rely on persistent storage with RocksDB. Since some implementations are written in Go, it was necessary to find compatible versions of RocksDB and gRocksDB (for example, RocksDB 10.5.1 works with gRocksDB 1.10.2). Another difficulty was that RocksDB is resource-intensive to compile, and in our project we did not have sufficient CPU capacity on the remote machine to build RocksDB and run remote benchmarks.
Some projects did not compile successfully at first and required minor modifications to run.

Conclusion and future improvements

The current benchmark result shows the performance of all the mentioned protocols by throughput and benchmark runtime. The results are subject to revisions because it may not reflect the best performance for the protocols due to unoptimized deployment script. We are also planning to switch to a more powerful EC2 machine because t2.micro does not have enough resources to support the use of RocksDB as well as TiKV.

In the near future, additional features will be added to Distrobench such as:

Multi-Client Support: The YCSB client will start multiple clients which will send requests in parallel to different servers in the group.
Commit Versioning: Allows the labelling of all benchmark results with the commit hash of the protocol’s repository version. This allows comparing different versions of the same project.
Adding more Primary-Backup, Sequential, Causal, and Eventual consistency protocols: Implementations with support for a consistency model other than linearizability and one that provides an existing key-value store application are notoriously difficult to find.
Benchmark on node failure
Benchmark on the addition of a new node

Final Blog:Improving Usability and Performance in cc-snapshot

Sun, 24 Aug 2025 00:00:00 +0000

My name is Zahra Temori, and I’m thrilled to collaborate with mentor Paul Marshall during this summer on the cc-snapshot project.

Introduction

Reproducibility is an important concept in high performance computing and research. It ensures that experiments can be repeated, validated, and extended with confidence. Achieving a reproducible environment requires identical software stacks, with the exact same dependencies, and configuration. The Chameleon Cloud testbed provides the cc-snapshot tool to support reproducibility by capturing the complete state of a running system. This allows researchers to rerun experiments exactly as before, share setups among each other, and avoid potential environmental issues such as missing dependencies or version mismatches. In this work, we explore how to enhance snapshotting as a reproducible method and make it an effective strategy for HPC research.

Key Achievements

The project was divided into two phases.The first phase focused on usability, reorganizing the tool, and expanding its capabilities. The second phase was benchmarking to evaluate alternative image formats and compression methods to improve snapshotting performance.

Usability Enhancements: The original snapshotting tool had challenges including a limited command line, tightly coupled logic, and minimal testing support, which made it difficult for users to interact with and developers to maintain. To enhance the command line interface, we added a flag to disable automatic updates, giving users more control over when to pull the latest version. We also added a dry-run flag to simulate actions before running a snapshot, allowing developers to test and run safely. Moreover, we implemented support for a custom source path, enabling snapshots of specific directories. This helps developers test smaller directories rather than full snapshots, which can be more complicated when testing functionalities. To improve maintainability, we refactored the codebase into five modular functions, allowing developers to make future changes more easily. In addition, we added automated tests with GitHub Actions to validate new and existing features and ensure that changes work as expected.
Performance Optimization: The default format and compression on snapshotting was Qcow2 with zlib, which often resulted in long snapshot creation time. To address this performance issue, we benchmarked other alternatives such as QCOW2 with zstd compression, and RAW with no compression. We also chose three images of varying sizes: small 4.47 GiB, medium 7.62 GiB, and large 12.7 GiB. The medium size image was user created to demonstrate the snapshotting and compression works for both Chameleon-supported images and user-created images.

Results: We ran each image with different compression methods and recorded four key metrics: creation time, upload time, boot time, and final image size. We calculated the overall time of each compression method from experiments on three different image sizes to evaluate which performed better. The results revealed that zstd compression reduced the creation time around 80.6% across the three image sizes. The upload time for zstd was nearly equal to the zlib method, while RAW images, due to no compression and larger size, uploaded much slower compared to images compressed with zlib and zstd. The boot time was nearly the same across all images, confirming that zlib and zstd take about the same time to uncompress, while RAW images take longer to boot due to large size. Our work suggested that QCOW2 with zstd compression should be used instead of QCOW2 with zlib compression when creating a snapshot. This enables researchers to generate and share reproducible environments faster.

Conclusion and Future Work

Snapshotting is a practical way to support reproducibility in HPC, but to be effective, it should be easy to use and fast enough for real research workflows. Our results show that using zstd compression can drop the snapshot creation time by over 80% compared to the common default zlib compression, without affecting upload or boot performance. Looking ahead, we plan to integrate zstd , try it on more workloads and image types, and explore ways to improve snapshotting for even greater speedups and reliable results.

Deliverables

Repository: All comprehensive analysis code and source code can be found in the CC-SNAPSHOT GitHub Repository.

End-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing

Sat, 23 Aug 2025 00:00:00 +0000

Introduction

Hello everyone!
I am Debangi Ghosh from India, an undergraduate student at the Indian Institute of Technology (IIT) BHU, Varanasi. As part of the StatWrap: Cross-Project Searching and Classification using Local Indexing project, my proposal, under the mentorship of Luke Rasmussen, focuses on developing a full-text search service within the StatWrap user interface. This involves evaluating different search libraries and implementing a classification system to distinguish between active and past projects.

About the Project

As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for researchers to discover relevant projects, notes, and assets across both current and archived work, using information that is either user-entered or passively collected by StatWrap.

Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:

Evaluating open-source search libraries suitable for local indexing and retrieval
Building the full-text search functionality directly into the StatWrap UI to allow seamless querying across projects
Ensuring reliability through the development of unit tests and comprehensive system testing
Implementing a classification system to label projects as “Active,” “Pinned,” or “Past” within the user interface

This project offers a great opportunity to work at the intersection of software development, information retrieval, and user-centric design—while contributing to research reproducibility and collaboration within scientific workflows.

Deliverables

The project has reached the end of its scope after 12 weeks of work. Here’s a breakdown:

1. Descriptive Comparison of Open-Source Libraries

Compared various open-source search libraries based on evaluation criteria such as indexing speed, search speed, memory usage, typo tolerance, fuzzy searching, partial matching, full-text queries, contextual search, Boolean support, exact word match, installation ease, maintenance, documentation, and developer experience. Decided upon the weights to assign to each of the features and point out the best library to use. According to our weights assigned,

These results are after tuning the hyperparameters to give the best set of results For huge data, FlexSearch has the least memory usage, followed by MiniSearch. The examples we used were limited, so Minisearch had the better memory usage results. Along with the research and evaluation, I looked upon the Performance Benchmark of Full-Text-Search Libraries (Stress Test), available here

The benchmark was measured in terms per seconds, higher values are better (except the test “Memory”). The memory value refers to the amount of memory which was additionally allocated during search.

FlexSearch performs queries up to 1,000,000 times faster compared to other libraries by also providing powerful search capabilities like multi-field search (document search), phonetic transformations, partial matching, tag-search, result highlighting or suggestions. Bigger workloads are scalable through workers to perform any updates or queries to the index in parallel through dedicated balanced threads.

2. The Search User Interface

3. Complete Search Execution Pipeline

4. FlexSearch Features

1. Persistent Indexing with Automatic Loading

Index persistence: Search index automatically saves to disk and loads on startup
Fast restoration: Rebuilds FlexSearch indices from saved document store without re-scanning files
Incremental updates: Detects project changes and updates only modified content
Background processing: Index updates happen asynchronously without blocking the User Interface.

2. Multi-Document Type Support

Unified search: Single search interface for projects, files, people, notes, and assets
Type-specific indices: Separate FlexSearch indices optimized for each document type
Cross-reference capabilities: Documents can reference and link to each other
Flexible schema: Each document type has tailored fields for optimal search performance

3. Intelligent File Content Indexing

Configurable file size limits: Admin-controlled maximum file size for content indexing
Smart file detection: Automatically identifies text files by extension and filename patterns
Content extraction: Full-text indexing with snippet generation for search results
Performance optimization: Skips binary files and respects size constraints to maintain speed

4. Advanced Query Processing

Multi-strategy search: Combines exact matches, fuzzy search, partial matches, and contextual search
Query preprocessing: Removes stop words and applies linguistic filters
Relevance scoring: Custom scoring algorithm considering multiple factors:
- Exact phrase matches (highest weight)
- Individual word matches
- Term frequency with logarithmic capping
- Position-based scoring (earlier matches rank higher)
- Proximity bonuses for terms appearing near each other
- Completeness penalties for missing query terms

5. Real-Time Search Suggestions

Autocomplete support: Dynamic suggestions based on indexed document titles
Search history: Maintains recent searches for quick re-execution
Debounced input: Prevents excessive API calls during typing
Contextual suggestions: Suggestions adapt based on current filters and context

6. Comprehensive Filtering System

Type filtering: Filter by document type (projects, files, people, etc.)
Project scoping: Limit searches to specific projects
File type filtering: Filter files by extension
Advanced search panel: Collapsible interface for power users
Filter persistence: Maintains filter state across searches

7. Performance Monitoring & Analytics

Real-time metrics: Track search times, cache hit rates, and index statistics
Performance dashboard: Visual indicators for system health
Cache management: LRU cache with configurable size and TTL
Search analytics: Historical data on search patterns and performance

8. Index Management Tools

Export/Import functionality: Backup and restore search indices
Full reindexing: Complete index rebuild with progress tracking
Index deletion: Clean slate functionality for troubleshooting
File size adjustment: Modify indexing constraints and rebuild affected content
Index statistics: Detailed breakdown of indexed content by type and project

9. Robust Error Handling & Resilience

Graceful degradation: System continues operating even with partial index corruption
File system error handling: Handles missing files, permission issues, and path changes
Memory management: Prevents memory leaks during large indexing operations
Recovery mechanisms: Automatic fallback to basic search if advanced features fail

10. User Experience Enhancements

Keyboard shortcuts: Ctrl+K to focus search, Escape to clear
Result highlighting: Visual emphasis on matching terms in results
Expandable results: Drill down into detailed information for each result
Loading states: Clear feedback during indexing and search operations
Responsive tabs: Organized results by type with badge counts

5. Classification of Active and Past Projects

A classification system is added within the User Interface similar to “Add to Favorites” option. A new project added by default moves to “Active” section, unless explicitely marked as “Past”. Similarly, when a project is unpinned from Favorites, it goes to “Active” Section.

Conclusion and future Scope

Building a comprehensive search system requires careful attention to performance, user experience, and maintainability. FlexSearch provided the foundation, but the real value came from thoughtful implementation of persistent indexing, advanced scoring, and robust error handling. The result is a search system that feels instant to users while handling complex queries across diverse document types.

The key to success was treating search not as a single feature, but as a complete subsystem with its own data management, performance monitoring, and user interface considerations. By investing in these supporting systems, the search functionality became a central, reliable part of the application that users can depend on.

The future scope would include:

Using a database (for example, SQLite), instead of JSON, which is better for this use case than JSON due to better and efficient query performance and atomic (CRUD) operations.
Integrating any suggestions from my mentors, as well as improvements we feel are necessary.
Developing unit tests for further functionalities and improvements.

[Final]Reproducibility of Interactive Notebooks in Distributed Environments

Wed, 20 Aug 2025 00:00:00 +0000

I am sharing a overview of my project Reproducibility of Interactive Notebooks in Distributed Environments and the work that I did this summer.

Project Overview

This project aims at improving the reproducibility of interactive notebooks which are executed in a distributed environment. Notebooks like in the Jupyter environment have become increasingly popular and are widely used in the scientific community due to their ease of use and portability. Reproducing these notebooks is a challenging task especially in a distributed cluster environment.

In the distributed environments we consider, the notebook code is divided into manager and worker code. The manager code is the main entry point of the program which divides the task at hand into one or more worker codes which run in a parallel, distributed fashion. We utlize several open source tools to package and containerize the application code which can be used to reproduce it across different machines and environments. They include Sciunit, FLINC, and TaskVine. These are the high-level goals of this project:

Generate execution logs for a notebook program.
Generate code and data dependencies for notebook programs in an automated manner.
Utilize the generated dependencies at various granularities to automate the deployment and execution of notebooks in a parallel and distributed environment.
Audit and package the notebook code running in a distributed environment.
Overall, support efficient reproducibility of programs in a notebook program.

Progress Highlights

Here are the details of the work that I did during this summer.

Generation of Execution Logs

We generate execution logs for the notebook programs in a distributed environment the Linux utility strace which records every system call made by the notebook. It includes all files accessed during its execution. We collect separate logs for both manager and the worker code since they are executed on different machines and the dependencies for both are different. By recording the entire notebook execution, we capture all libraries, packages, and data files referenced during notebook execution in the form of execution logs. These logs are then utilized for further analyses.

Extracting Software Dependencies

When a library such as a Python package like Numpy is used by the notebook program, an entry is made in the execution log which has the complete path of the accessed library file(s) along with additional information. We analyze the execution logs for both manager and workers to find and enlist all dependencies. So far, we are limited to Python packages, though this methodology is general and can be used to find dependencies for any programing language. For Python packages, their version numbers are also obtained by querying the package managers like pip or Conda on the local system.

Extracting Data Dependencies

We utilze similar execution logs to identify which data files were used by the notebook program. The list of logged files also contain various configuration or setting files used by certain packages and libraries. These files are removed from the list of data dependencies through post-processing done by analyzing file paths.

Testing the Pipeline

We have conducted our experiments on three use cases obtained from different domains using between 5 and 10 workers. They include distributed image convolution, climate trend analysis, and high energy physics experiment analysis. The results so far are promising with good accuracy and with a slight running time overhead.

Processing at Cell-level

We perform the same steps of log generation and data and software dependency extraction at the level of individual cells in a notebook instead of once for the whole notebook. As a result, we generate software and data dependencies at the level of individual notebook cells. This is achieved by interrupting control flow before and after execution of each cell to write special instructions to the execution log for marking boundaries of cell execution. We then analyze the intervals between these instructions to identify which files and Python packages are accessed by each specific cell. We use this information to generate the list of software dependencies used by that cell only.

We also capture data dependencies by overriding analyzing the execution logs generated by overriding the function of the open function call used to access various files.

Distributed Notebook Auditing

In order to execute and audit workloads in parallel, we use Sciunit Parallel which uses GNU Parallel for efficient parallel execution of tasks. The user specifies the number of tasks or machines to run the task on which is then distributed across them. Once the execution completes, their containerized executions need to be gathered at the host location.

Efficient Reproducibility with Checkpointing

An important challenge with Jupyter notebooks is that sometimes they are unnecessarily time-consuming and resource-intensive, especially when most cells remain unchanged. We worked on NBRewind which is a lightweight tool to accelerate notebook re-execution by avoiding redundant computation. It integrates checkpointing, application virtualization, and content-based deduplication. It enables two kinds of checkpoints: incremental and full-state. In incremental checkpoints, notebook states and dependencies across multiple cells are stored once such that only their deltas are stored again. In full-state checkpoints, the same is stored after each cell. During its restore process, it restores outputs for unchanged cells and thus enables efficient re-execution. Our empirical evaluation demonstrates that NBRewind can significantly reduce both notebook audit and repeat times with incremental checkpoints.

I am very happy abut the experience I have had in this project and I would encourage other students to join this program in the future.

Midterm Report: Learning and Building ORB

Thu, 07 Aug 2025 00:00:00 +0000

Project Overview

UC ORB is an open-source platform developed to increase visibility and engagement with open source projects across the University of California system.

By providing a structured and searchable repository browser, ORB makes it easier for researchers, students, and collaborators to discover relevant open source initiatives, track their impact, and connect with contributors. It also helps campuses demonstrate the value of their open source output to potential funders and institutional partners.

Progress So Far

Significant progress has been made in building out core features of the ORB Showcase platform:

Searching and Filtering Options

Users can now search and filter repositories using multiple criteria:

Development Team / UC Campus
Programming Language
License Type
Topic / Domain Area

These filtering tools make it easy to explore the growing set of repositories in a meaningful and personalized way.

Pagination has been added to ensure scalability and smooth performance, even as the number of projects continues to grow.

Repository Details View

Each repository page now displays rich metadata and contextual information, including:

README preview – offering a quick look at the project’s purpose and usage

License – clearly indicating how the project can be used or adapted

Contributors and Funders – acknowledging the people and institutions behind the work

What’s Next

As we prepare UC ORB for public launch, we’re focused on improving the backend workflow and addressing some key challenges:

⚙️ GitHub Workflow Challenges Creating a GitHub-first workflow for adding repositories is powerful, but also tricky:

GitHub Actions cannot be triggered by API calls from a backend directly, which limits automation via server-side tools.

The GitHub bot has permission limitations, especially when it comes to interacting with PRs and validating submissions outside of standard GitHub UI flows.

I’m currently working on designing a more robust and maintainable workflow to handle these edge cases, including:

A standalone script that can add repositories directly to the database, bypassing the need for a pull request and enabling more flexible internal submissions.

Better logging and validation to ensure consistency between the file-based data model and the live PostgreSQL database.

Reflection

This project has been a great learning experience despite challenges with Frontend, Backend, GitHub Actions / Bots and APIs, it’s been exciting to build a platform that highlights open source work across the UC system.

I’m looking forward to what’s coming next as we get closer to launching ORB.

Midterm Report: Simulation, Comparison, and Conclusion of Cache Eviction

Wed, 06 Aug 2025 00:00:00 +0000

Project Overview

CacheBench is a benchmarking suite designed for comprehensive cache performance evaluation, with a particular focus on analyzing the miss ratios of various cache eviction algorithms.

At the core of CacheBench lie two key components: the high-performance cache simulator, libCacheSim, and the extensive open-source cache datasets, which collectively contain over 8,000 traces from diverse applications. This ensures broad coverage across a range of realistic workloads.

Our primary goal is to evaluate all major and widely-used cache eviction algorithms on thousands of traces, in order to gain insights into their behaviors and design trade-offs. Additionally, we aim to identify and distill representative workloads, making benchmarking more efficient and comprehensive for future cache research.

Progress and Pain Points

We began by benchmarking prevalent eviction algorithms, including FIFO, LRU, CLOCK, LFU, Random, Belady (BeladySize), CAR, ARC, LIRS, LHD, Hyperbolic, GDSF, W-TinyLFU, 2Q, SLRU, S3-FIFO, SIEVE, and LeCaR. As we developed the suite, we made progressive improvements to both the simulator and dataset infrastructure. Our progress can be summarized as follows:

Collected miss ratio results for all listed algorithms across 8,000+ traces.
Identified best- and worst-performing traces for each algorithm, and conducted feature analysis of these traces.
Developed Python bindings: To increase accessibility, we provided a Python package that allows users to easily download traces and run simulation analyses using libCacheSim and the cache datasets.

However, analysis remains challenging because there is no universally accepted metric or baseline for objectively comparing cache eviction algorithms’ performance across all workloads.

Next Steps

For the second half of the project, my focus will shift to:

Evaluating More Complex Eviction Algorithms: Having concentrated mainly on static eviction policies so far (which are generally more deterministic and understandable), I will now investigate learning-based eviction algorithms such as LRB and 3L-Cache. These models incorporate learning components and incur additional computational overhead, making simulations slower and more complex.
Detailed Trace Analysis: Since eviction algorithms can have highly variable performance on the same trace, I plan to analyze why certain algorithms excel on specific traces while others do not. Understanding these factors is crucial to characterizing both the algorithms and the workload traces.
Constructing Representative Workload Sets: Based on ongoing simulations and trace analyses, I aim to identify a minimal but representative subset of traces that can serve as a basic evaluation suite, simplifying testing and improving accessibility.

Reflection

This project has truly been the highlight of my summer. By evaluating a wide range of cache eviction algorithms, I’ve significantly deepened my understanding of cache design and its underlying principles.

I’m especially grateful to my mentors for their constant support, patience, and guidance throughout this journey. It’s been a privilege to learn from you!

I’m excited to see the final results of CacheBench!

Midterm Report: Learning, Building, and Documenting Brahma

Tue, 05 Aug 2025 00:00:00 +0000

Project Overview

Brahma-XR is an open-source WebXR framework designed for building collaborative virtual environments especially those involving spatial data and scientific visualization.

What makes Brahma powerful is that the same codebase runs seamlessly across both the browser and XR devices like the Apple Vision Pro, Meta Quest 3, and VARJO. This makes it ideal for rapid prototyping and creating cross-platform immersive experiences.

Some of Brahma’s built-in features include:

Grab-and-pull locomotion
Raycasting and interaction
Avatar embodiment
Spatial rendering
Support for geospatial and data-driven visualizations

Brahma is intentionally lightweight, optimized to run even on low-compute devices—making immersive collaboration more accessible to everyone.

What Worked (and What Didn’t)

As Brahma transitioned from a private research repo to a public open-source project, a lot of important foundational work had to be done around documentation, packaging, and example previews.

There are two aspects that make Brahma especially unique:

Bipartite npm package structure – which requires detailed and thoughtful documentation.
Immersive, real-time examples – unlike typical libraries, Brahma’s examples aren’t just static demos. They are live, multi-user XR apps designed to be interacted with.

The first half of the project focused on setting the stage—structuring and preparing the framework for broader use.

🔧 Key Accomplishments

Learning Three.js
I spent time learning the fundamentals of Three.js—how it handles 3D rendering, scene setup, materials, cameras, and animations. I also explored how large-scale Three.js projects are organized, which helped me understand how Brahma’s example apps are built.
Setting up the project structure
I looked at the architecture of various open-source projects and used that knowledge to shape Brahma’s structure. The goal was to align with community best practices while keeping things clean and modular for future contributors.
Understanding npm packaging (especially bipartite)
Since Brahma includes both client- and server-side logic, I spent time understanding how multi-part npm packages are published and maintained. I explored best practices around versioning, distribution, and separating internal vs public modules.
Creating a documentation system
After exploring different approaches (and with my mentor’s help), I set up a static documentation site using JSDoc with the Docdash theme. The current version includes guides, API references, and contribution instructions. This is just the beginning—the docs will evolve as the community grows.

What’s Next

In the second half of the project, I’ll be focusing on:

Building a routing system
For both documentation and example apps, so that users can easily browse through different components and use cases.
Setting up UI and 3D infrastructure
To make it easier for others to start building apps with Brahma by providing clean base layers for interface and spatial development.
Prepping for the first public release
Publishing the Brahma NPM package along with a curated set of featured examples and contributor-friendly documentation—making it easier for developers to get started and contribute.

Reflection

This project has truly been the highlight of my summer. Learning about WebXR, Three.js, and open-source workflows has been both exciting and rewarding. Every challenge taught me something new.

I am specially greatfull to my mentor Samir Ghosh for his constant support, patience, and guidance. It’s been a privilege learning from you!

I’m looking forward to what’s coming next as we get closer to the first public release of Brahma!

Mid-term Blog: Rectilinear Floorplans in OpenROAD - Progress Update

Sun, 03 Aug 2025 00:00:00 +0000

Mid-term Progress: Enabling Rectilinear Floorplanning in OpenROAD

Hello! I’m excited to share my mid-term progress on implementing rectilinear (polygonal) die support in OpenROAD’s floorplanning flow as part of Google Summer of Code 2025. Under the guidance of my mentors Eder Monteiro and Augusto Berndt, we’ve made significant strides in extending OpenROAD to handle non-rectangular die shapes.

Here’s a link to my original proposal

Project Overview

My project focuses on extending OpenROAD’s floorplanning capabilities to support rectilinear die shapes. This enhancement is crucial for modern VLSI design flows involving advanced packaging, 2.5D/3D ICs, and irregular chiplet-based designs where non-rectangular dies are increasingly common.

The core challenge is maintaining robustness and backward compatibility while introducing this major new feature that touches multiple aspects of the design flow.

Progress Made

1. Tcl Frontend modification and Input Parsing

Successfully extended the Tcl interface to accept rectilinear die specifications:

The initialize_floorplan command can now accept a list of coordinates specifying a polygon and automatically trigger the rectilinear floorplanning flow.
Added robust input validation for rectilinear coordinates
Ensured backward compatibility with existing rectangular floorplan flows

2. Die creation and Validation

Leveraged the internal structure odb::Polygon to store the vertices of the shape.
Implemented polygon validation to ensure shapes are valid rectilinear polygons
Added proper error handling and user feedback for invalid polygon inputs

3. Row Generation for rectilinear Dies - Major Milestone

The most significant achievement has been implementing the make_polygon_rows functionality, which generates standard cell rows that conform to the rectilinear die boundaries. This was one of the most challenging aspects of the project. To solve this, we developed an efficient scanline-based approach to fill rectilinear areas with rows by clipping the row lengths according the the boundary of the shape supplied.

4. Testing and Validation

Tests were added to the regression suite and were used to successfully test the entire initialize_floorplan flow. The changes made were merged into OpenROAD successfully. PR Link

makeRows demo :

One of our test cases involved generating rows for a U-shaped die. Here is a snapshot from the OpenROAD gui displaying perfectly laid out rows:

Next Steps

I am currently working on the pin placer (ppl) module and extending it to support rectilinear floorplans. This requires a careful re-evaluation of each step, including the cost function used to optimize pin placement.

Mid-Term Update: MPI Appliance for HPC Research on Chameleon

Sun, 03 Aug 2025 00:00:00 +0000

Hi everyone! This is my mid-term blog update for the project MPI Appliance for HPC Research on Chameleon, developed in collaboration with Argonne National Laboratory and the Chameleon Cloud community. This blog follows up on my earlier post, which you can find here.

🔧 June 15 – June 29, 2025

Worked on creating and configuring images on Chameleon Cloud for the following three sites: CHI@UC, CHI@TACC, and KVM@TACC.

Key features of the images:

Spack: Pre-installed and configured for easy package management of HPC software.
Lua Modules (LMod): Installed and configured for environment module management.
MPI Support: Both MPICH and Open MPI are pre-installed, enabling users to run distributed applications out-of-the-box.

These images are now publicly available and can be seen directly on the Chameleon Appliance Catalog, titled MPI and Spack for HPC (Ubuntu 22.04).

I also worked on some example Jupyter notebooks on how to get started using these images.

🔧 June 30 – July 13, 2025

With the MPI Appliance now published on Chameleon Cloud, the next step was to automate the setup of an MPI-Spack cluster.

To achieve this, I developed a set of Ansible playbooks that:

Configure both master and worker nodes with site-specific settings
Set up seamless access to Chameleon NFS shares
Allow users to easily install Spack packages, compilers, and dependencies across all nodes

These playbooks aim to simplify the deployment of reproducible HPC environments and reduce the time required to get a working cluster up and running.

🔧 July 14 – July 28, 2025

This week began with me fixing some issues in python-chi, the official Python client for the Chameleon testbed. We also discussed adding support for CUDA-based packages, which would make it easier to work with NVIDIA GPUs. We successfully published a new image on Chameleon, titled MPI and Spack for HPC (Ubuntu 22.04 - CUDA), and added an example to demonstrate its usage.

We compiled the artifact containing the Jupyter notebooks and Ansible playbooks and published it on Chameleon Trovi. Feel free to check it out here. The documentation still needs some work.

📌 That’s it for now! I’m currently working on the documentation, a ROCm-based image for AMD GPUs, and some container-based examples. Stay tuned for more updates in the next blog.

Midterm blog: CarbonCast Midpoint Update: From Vision to Reality

Sun, 03 Aug 2025 00:00:00 +0000

A few months ago, I shared my vision for making carbon intensity forecasts more accessible through the CarbonCast project. My proposal under the mentorship of Professor Abel Souza aims to build an API that makes carbon intensity forecasts more accessible and actionable. I had two main goals: expand CarbonCast to work with more regional electricity grids, and transform it from a research project into something that could actually run and be interacted with in the real world.

Today, I’m excited to share that we’ve not only hit those goals – we’ve exceeded them in ways I didn’t expect.

What We’ve Built So Far

Remember how I mentioned that CarbonCast needed to support more regional grids? Well, we’ve gone big. The system now covers 85+ regions across two continents. We’re talking about major US grid operators like ERCOT (Texas), CISO (California), PJM (Mid-Atlantic), MISO (Midwest), and NYISO (New York), plus we’ve expanded into European countries like Germany, France, Spain, and the UK.

But here’s the thing – collecting weather data for carbon intensity forecasting isn’t as simple as just downloading a few files. Each region needs four different types of weather data: solar radiation (for solar power predictions), wind patterns (for wind power), temperature and humidity (for energy demand), and precipitation (which affects both supply and demand). That means we’re managing data collection for over 340 different combinations of regions and weather variables.

The Automation Challenge

When I started this project, I quickly realized that manually managing data collection for this many regions would be impossible. We’re talking about thousands of data requests, each taking time to process, with various things that can go wrong along the way.

So we built something I’m really proud of: an intelligent automation system that handles 95% of the work without human intervention. That means 19 out of every 20 data collection tasks happen automatically, even when things go wrong.

The system is smart about it too. It knows when to speed up data collection, when to slow down to avoid overwhelming the servers, and how to recover when errors happen. We’ve achieved 99% data completeness, which means almost every piece of weather data we need actually makes it into our system successfully.

Making It Production-Ready

The biggest challenge was taking CarbonCast from a research project that worked on my laptop to something that could run reliably for weeks without me babysitting it. This meant building in all the boring but crucial stuff that makes software actually work in the real world.

We created a comprehensive error handling system that can automatically recover from 95% of the problems it encounters. Network hiccups, server timeouts, data format changes – the system handles these gracefully and keeps running.

There’s also a real-time monitoring dashboard that shows exactly what’s happening across all regions. I can see which areas are collecting data successfully, which ones might be having issues, and get alerts if anything needs attention. It’s like having a mission control center for carbon data.

The Dashboard: Mission Control for Carbon Data

Let me show you what this monitoring system actually looks like. We built a comprehensive web dashboard that gives us real-time visibility into everything that’s happening:

The main dashboard showing real-time system metrics and status across all regions

The dashboard shows key metrics at a glance – total requests, completion rates, and active regions. But it goes much deeper than that. You can drill down into individual requests to see their complete lifecycle:

Detailed view of individual data requests showing processing timelines and status

Each request card shows everything from the initial request time to when the data becomes available for download. This level of visibility is crucial when you’re managing hundreds of data requests across different regions and weather variables.

The regional analytics view shows how well we’re doing across different grid operators:

Regional breakdown showing completion status across different electricity grid operators

What I’m particularly proud of is the error handling dashboard. When things do go wrong (which they inevitably do with any large-scale data system), we can see exactly what happened and how the system recovered:

Error tracking and resolution system showing 100% success rate in region mapping

The fact that we’re showing “No unknown regions found” means our coordinate-based region detection system is working perfectly – every weather data request gets properly mapped to the right electricity grid.

The Technical Foundation

Under the hood, we’ve built what I’d call enterprise-grade infrastructure. The system can run autonomously for weeks, automatically organizing data by region and weather type, managing storage efficiently, and even optimizing its own performance based on what it learns.

We’ve also created comprehensive testing systems to make sure everything works reliably. When you’re dealing with data that people might use to make real decisions about when to charge their electric vehicles or run their data centers, reliability isn’t optional.

The architecture follows a modular, service-oriented design with clear separation between data collection, processing, monitoring, and user interfaces. This makes it much easier to maintain and extend as we add new features.

Why This Matters

All of this infrastructure work might sound technical, but it’s directly connected to the original vision: making carbon intensity forecasts accessible to everyone.

With this foundation in place, we can now provide reliable, up-to-date weather data for carbon intensity forecasting across major electricity grids in North America and Europe. That means developers building carbon-aware applications, companies trying to reduce their emissions, and individuals wanting to time their energy use for lower environmental impact all have access to the data they need.

What’s Next: Breaking Down CarbonCast

The next phase is where things get really exciting. Now that we have this solid data collection foundation, we’re going to break down CarbonCast itself into modular components. This will make it easier for developers to integrate carbon intensity forecasting into their own applications, whether that’s a smart home system, a cloud computing platform, or a mobile app that helps people make greener energy choices.

Looking Back

When I started this project, I knew we needed better infrastructure for carbon data. What I didn’t expect was how much we’d end up building – or how well it would work. We’ve created something that can reliably collect and organize weather data across two continents, handle errors gracefully, and run without constant supervision.

More importantly, we’ve built the foundation that will make it possible for anyone to access accurate carbon intensity forecasts. Whether you’re a developer building the next generation of carbon-aware applications or someone who just wants to know the best time to do laundry to minimize your environmental impact, the infrastructure is now there to support those decisions.

The vision of making carbon data accessible and actionable is becoming reality, one automated data collection at a time.

Impact Beyond Research

This work builds directly on the foundation of Multi-day Forecasting of Electric Grid Carbon Intensity using Machine Learning, transforming research into practical, real-world infrastructure. We’re not just making carbon intensity forecasts more accurate – we’re making them accessible to everyone who wants to reduce their environmental impact.

The open-source nature of CarbonCast means that anyone can run, contribute to, and benefit from this work. Whether you’re a developer building carbon-aware applications, a policymaker working on grid decarbonization strategies, or a sustainability-conscious individual looking to reduce your carbon footprint, the tools are now there to make informed, impactful choices.

Looking ahead, I’m excited to see how this infrastructure will enable the next generation of carbon-aware computing and smart energy decisions.

Mid-Term Report: Uncovering the True Sources of Non-Reproducibility in AI for Science

Fri, 01 Aug 2025 00:00:00 +0000

Hello, I’m Baiqiang. I’m excited to share a mid-term update from the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project. This journey, mentored by Luanzheng “Lenny” Guo and Dongfang Zhao, has taken a fascinating and unexpected turn, leading to a much deeper understanding of what it takes to build truly reliable AI for science.

The Search for an Invisible Bug

As a quick recap, our project tackles the critical problem of non-determinism in Retrieval-Augmented Generation (RAG) systems. For science to be trustworthy, it must be repeatable. If an AI system gives different answers to the same question, it fails this fundamental test. Our initial goal, outlined in my proposal, was to find and fix the sources of this inconsistency, which we believed lay within the retrieval algorithms themselves.

To do this, we built a comprehensive testing framework capable of running thousands of controlled experiments. We designed it to meticulously measure the consistency of retrieval results while varying everything from the indexing algorithm to the underlying hardware.

A Surprising Discovery: The Usual Suspect is Innocent

The common wisdom in the community is that high-performance, approximate search libraries like FAISS are a major source of randomness. We put this to the test, running repeated queries against various index types, including complex ones like HNSW and IndexIVF.

Our results were clear and surprising: FAISS is remarkably reproducible out of the box. When run on a consistent hardware and software stack, it returns the exact same results, every single time. The library appears to have robust internal seed management that ensures deterministic behavior.

This finding was a pivotal moment. The non-reproducibility that researchers observe in practice is real, but it doesn’t come from where we expected. The problem isn’t the algorithm itself, but the environment it runs in. Our investigation immediately shifted to find the real culprits.

Pinpointing the True Sources of Non-Determinism

Our framework quickly helped us identify the true sources of inconsistency:

Hardware-Induced Variation (CPU vs. GPU): This is the most significant factor. Running the exact same retrieval code can produce different document rankings and even different document sets when executed on a CPU versus a GPU. This is likely due to subtle differences in floating-point arithmetic and library optimizations in the hardware stack.
The Impact of Numerical Precision: We also confirmed that changing the floating-point precision of the data (e.g., from FP32 to FP16) can introduce small numerical variations that are just large enough to reorder the results, potentially changing the evidence the LLM receives.

Our Mission Refined: Building Tools for Environmental Control

This discovery has sharpened our project’s mission. The challenge is not to “fix” a supposedly random algorithm, but to develop the tools and best practices to control for the entire experimental environment. Our focus for the second half of the project is to:

Develop a Hardware-Aware Configuration Tracker: We are building a tool that goes beyond logging software versions. It will capture the critical details of the hardware environment—CPU/GPU model, CUDA version, etc.—and link them directly to an experiment’s results.
Create a Cross-Environment Validation Suite: Our open-source benchmarking suite will empower researchers to test their own pipelines. Crucially, it will help them identify and diagnose inconsistencies when moving workflows between different machines, such as from a local laptop to a cloud-based GPU.
Establish New Best Practices: We will distill our findings into clear, actionable guidance. The key recommendation is no longer just about choosing the right algorithm, but ensuring a consistent and well-documented hardware and software environment to guarantee reproducible outcomes.

By following the evidence, we’ve uncovered the root cause of a critical problem in AI-driven research. We are now developing the solutions needed to manage it, paving the way for a future where scientific discoveries powered by AI are built on a foundation of verifiable trust.

Midterm Update: Building Intelligent Observability for NRP

Fri, 01 Aug 2025 00:00:00 +0000

I’m pleased to share the progress we’ve made on my OSRE 2025 project, “Intelligent Observability for Seam: A GenAI Approach” since my initial announcement. We’re working toward our core goal: building an ML-powered service for NRP that analyzes monitoring data, detects anomalies, and provides trustworthy GenAI explanations.

How Our Agents Support the Observability Mission

We’ve been developing specialized agents and tools that work together to support our original project vision:

1. Prometheus Metrics Analysis Agent

Function: Continuously ingests and processes NRP’s Prometheus metrics
Progress: We’ve implemented initial data pipelines for key system metrics
Purpose: Provides the foundation for anomaly detection by establishing normal behavior baselines

Function: Clarifies ambiguous metrics or patterns before generating explanations
Progress: We’ve implemented a basic version of Conformal Revision of Questions to resolve metric ambiguities
Purpose: Aims to ensure explanations address the right system behaviors (e.g., distinguishing CPU saturation from memory pressure)
Deliverable Impact: We hope this will improve accuracy of GenAI explanations by eliminating misinterpretations

3. Explanation Generation Agent (AIS)

Function: Creates human-readable explanations and root-cause analysis
Progress: We’ve built a prototype of the Automated Information Seeker with a Plan→Validate→Execute→Assess→Revise cycle
Purpose: Transforms technical anomalies into actionable insights for operators
Deliverable Impact: Intended to directly deliver on the GenAI explanation component of our tool

Integration Progress

We’re working to connect our agents into a unified observability pipeline:

Data Collection: Prometheus metrics → Analysis Agent
Anomaly Detection: With statistical confidence bounds (in development)
Query Refinement: Resolving ambiguities before explanation
Explanation Generation: Human-readable analysis with uncertainty awareness
Feedback Loop: System learning from operator interactions (planned)

Hardware Testing Opportunity

This project has given us a valuable opportunity to test our observability framework on Qualcomm Cloud AI 100 Ultra hardware. We’re beginning to port different LLM architectures specifically for:

Exploring anomaly detection performance on specialized AI hardware
Testing explanation generation quality across different model architectures
Comparing GLM-4.5 against other models for observability-specific tasks

Next Phase: Completing the Observability Tool

For the remainder of OSRE 2025, we’re focused on:

Finalizing integration of all agents into a cohesive anomaly detection tool with matrix
Validating that our GenAI explanations help operators resolve issues faster for users, which we plan to test on the nautilus matrix platform
Optimizing performance on specialized hardware for NRP’s scale
Preparing the open-source release of our intelligent observability tool

Acknowledgments

I’m deeply grateful to my lead mentor Mohammad Firas Sada for his guidance in keeping our work focused on NRP’s observability needs. His insights have been invaluable in navigating the challenges of this project.

While we’ve developed several agents and frameworks, everything we’re building serves the original mission: creating an intelligent observability tool that helps NRP operators solve problems faster and keep complex research systems running smoothly.

I look forward to sharing more progress on our observability tool with GenAI explanations in the coming weeks!

Midterm Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows

Wed, 30 Jul 2025 00:00:00 +0000

Refresher about the Project

Hi everyone! for the last month I have been working with my mentors Professor Fraida Fund, and Mohamed Saeed on our Project Applying MLOps to overcome reproducibility barriers in machine learning research As a refresher, our goal is to build a template generator for a reproducible machine learning training workflows at the Chameleon testbed. We want to provide our users with the necessary environment configuration in a handy way. so they won’t be overwhelmed with all the intricate details of setting the environment. This will allow for validation and further development of their setup.

What we have done so far

The current workflow begins in JupyterHub, where the user provides basic details such as project name, site, and node type. the notebooks handle key setup tasks, like creating storage buckets, provisioning and configuring a server with GPU support, and mounting buckets locally via rclone. Once the host environment is ready, the user will SSH that machine, generates the necessary variables via a script and launches a containerized virtual lab that integrates Jupyter and MLflow. Inside the container, users authenticate with GitHub, connect or initialize their repositories, and can immediately begin training models, with all metrics, artifacts, and environment details logged for reproducibility.

The progress on the project so far is as follows:

We finalized the selection of frameworks and storage options.

Artifacts are now logged directly from the MLflow server to the Chameleon object store, without relying on a database backend or an intermediate MinIO S3 layer.

Different jupyter lab images for each framework.

We’ve started with the top ML frameworks — PyTorch Lightning, Keras/TensorFlow, and Scikit-Learn. Each framework now has its own image, which will later be tailored to the user’s selection.

Github CLI and Hugging Face integration inside the container.

The Jupyter container now integrates both the GitHub CLI and Hugging Face authentication. Users can manage their code repositories via GitHub CLI commands and authenticate with Hugging Face tokens to download/upload models and datasets. This eliminates the need for manual credential setup and streamlines ML experimentation within the environment.

Custom Logging Utility

To ensure robust tracking of code versioning and environment details, we added a custom logging utility.
These logs are stored alongside metrics and model artifacts in MLflow, ensuring every experiment is fully documented and reproducible. summary of the functionalities:

`log_git()` — Captures Code Versioning

Uses Git commands (via subprocess) to log:

Current branch name
Commit hash
Repository status (clean or dirty)

Example Output:

commit: a7c3e9d
branch: main
status: dirty (1 file modified)
# and git diff output

`log_python()`— Tracks the Python Environment

Platform information + Python environment info (version)
Exports a full pip freeze list to a .txt file
Saved as an MLflow artifact to guarantee exact package version reproducibility

Example Output (pip freeze extract):

numpy==1.26.4
pandas==2.2.1
scikit-learn==1.4.2
torch==2.2.0

`log_gpu()` - Records GPU Information

Detects available GPU devices
Collects details using NVIDIA’s pynvml or AMD’s ROCm tools
Logs:
GPU name
Driver version
CUDA/ROCm version
Captures gpu-type-smi output for deeper inspection

These utilities ensure that each run can be traced back with:

The exact code version
The full Python environment
The hardware details used

Initial customizable template

We’ve prototyped an initial customizable template using Cookiecutter. it provides an interactive CLI, users provide some key project details (e.g., project name, frameworks, GPU type and integrations if any). Cookiecutter then generates a ready-to-use project structure with pre-configured integrations, reducing manual setup and ensuring consistency across environments.

The user will have notebooks to communicate with chameleon testbed resources, containerized environment and custom training scripts to plug their code.

What’s Next

Template Generation via Config + interactive widgets
We are exploring different ways to generate experiment templates using configuration files and interactive widgets in jupyter notebooks. This would let users quickly customize logging setups and considered to be more user-friendly.
AMD-Compatible Images
Extend support by building and testing Docker images optimized for AMD GPUs. Up to now, our development efforts has focused on NVIDIA GPUs using CUDA-based images
End-to-End Lifecycle Example
Provide a larger example demonstrating the entire ML workflow:
- Data preparation
- Training with GPU logging
- Tracking metrics, artifacts, and environment info in MLflow
- Model evaluation and logging
- Reproducing results on different hardware backends

Working on this project so far has been both challenging and eye-opening. I’ve seen how many moving parts need to come together for a smooth workflow. The support from my mentors has been key in helping me turning challenges into real progress.

Thank you for following along — I’m looking forward to sharing more concrete results soon.

Robot Manipulation with Scenic-RoboSuite

Wed, 30 Jul 2025 00:00:00 +0000

We’re Sahil, continuing work on the Scenic-RoboSuite integration for GSoC 2025. This project is mentored by Daniel Fremont and Eric Vin.

Since the last update, the Scenic-RoboSuite interface has made significant progress. The bidirectional bridge is now functional - robots can read sensor data and execute behaviors based on observations. However, these features are still in early stages and we’re working on making them more stable and consistent.

We’ve integrated RoboSuite’s Operational Space Control into Scenic. This control method lets you command the robot’s hand directly in 3D space (like “move 10cm left”) instead of calculating complex joint rotations. While the integration works, it’s rough around the edges and we’re currently focused on stabilizing it across different scenarios.

The main challenge was architectural - RoboSuite expects all robot commands bundled together each timestep, while Scenic processes them one by one. We solved this with a pending actions system that collects everything first, then executes in one go. Time synchronization was another challenge, matching Scenic’s steps with MuJoCo’s physics.

We’ve implemented a basic pick-and-place behavior for basic testing. The robot reads sensor data, calculates where to move, and adjusts continuously. It can successfully grasp and lift objects, though consistency varies between runs. The system supports three robot models and works with RoboSuite’s pre-built environments.

Custom world building is currently on hold. We’ve decided to focus on integrating existing RoboSuite features into Scenic first, then build Scenic’s capabilities like dynamic scenario randomization on top. For our first prototype, we’re aiming to extend the pick-and-place behavior into a full randomization demo - Scenic will randomly position the cube each run, and the robot will adapt to find and grasp it regardless of location.

The next two weeks focus on stabilizing current features and preparing this randomized scenario prototype. Expanding the behavior library and supporting additional environments will come in future phases after we have a solid foundation.

The core bridge between Scenic and RoboSuite is operational, but there’s significant work ahead to make it reliable and user-friendly.

Type Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits

Tue, 29 Jul 2025 00:00:00 +0000

Hello! I’m Siva Sathyaseelan D N, a pre-final year B.Tech + M.Tech Engineering student at IIT BHU, Varanasi, India. With a deep-rooted passion for software development and scientific computing. I thrive at the intersection of code and real-world problem-solving. For two years, I’ve engaged in open-source work across scientific simulation, blockchain, and cloud-native technologies, through hobby projects, hackathons, internships, and an LFX mentee. I’m contributing toType Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits under the mentorship of Ben Greenman. My proposal can be viewed here!

Project Overview

Gradual typing enhances untyped languages like JavaScript and Python with static type checkers in systems like TypeScript, Flow, Mypy, Pyright, and Typed Racket, using type narrowing to refine types via runtime checks (e.g., typeof item[“price”] === “number”). Designs vary, TypeScript permits unverified predicates, Flow ensures soundness, and Typed Racket tracks types compositionally—prompting the If-T benchmark ift-benchmark to evaluate narrowing across five languages, though it omits tools like Sorbet, Hack, Luau, Pyre, Cinder/Static Python, Typed Clojure, and Elixir, and the risks of unsound narrowings remain unclear.

Objectives

Extend the If-T benchmark to Sorbet, Hack, Luau, Pyre, Cinder/Static Python, Typed Clojure, and potentially Elixir.
Analyze their type narrowing precision, expressiveness, and soundness.
Conduct a corpus study of TypeScript or Python code using GitHub or Software Heritage APIs.
Assess the prevalence and exploit potential of unsound narrowings.
Link corpus findings to benchmark results for broader insights.

Progress So Far

During the first half of the SoR 2025 period, I focused on lextending the If-T benchmark to Sorbet, Pyre, Cinder/Static Python, Typed Clojure. These are the PRs which extends If-T benchmark:

Sorbet -> https://github.com/utahplt/ifT-benchmark/pull/20
Pyre -> https://github.com/utahplt/ifT-benchmark/pull/26
Typed Clojure -> https://github.com/utahplt/ifT-benchmark/pull/27
Cinder -> https://github.com/utahplt/ifT-benchmark/pull/28

What’s Next

I will be working on Conduct a corpus study of TypeScript or Python code using GitHub or Software Heritage APIs. Assess the prevalence and exploit potential of unsound narrowings. Also Link corpus findings to benchmark results for broader insights TGUsage.

Final Thoughts

Working on Type Narrowing has been incredibly rewarding, it’s more than just code. It’s studying the type systems of different programming languages which is very important for the large scale software systems and softwware security, and I’m honored to be a part of that.

Big thanks to my mentors Ben Greenman for their support and thoughtful feedback throughout. I’ve learned a ton already, and I can’t wait to keep building.

AIDRIN Privacy-Centric Enhancements: Backend & UX Upgrades

Fri, 25 Jul 2025 00:00:00 +0000

⏱️ Reading time: 5–6 minutes

Hey everyone,

If you’ve ever wondered what it takes to make AI data pipelines not just smarter, but safer and more transparent, you’re in the right place. The last few weeks working on AIDRIN for GSoC have been a deep dive into the engine room of privacy and backend systems that power the AIDRIN project. My focus has been on building out the core privacy infrastructure and backend features that power AIDRIN’s ability to give users real, actionable insights about their data. It’s been challenging, sometimes messy, but incredibly rewarding to see these changes make a tangible difference.

Having Dr. Jean Luca Bez and Prof. Suren Byna as mentors, along with the support of the entire team, has truly made all the difference. Their guidance, encouragement, and collaborative spirit have been a huge part of this journey, whether I’m brainstorming new ideas or just trying to untangle a tricky bug.

Privacy Metrics: Making Data Safer

A major part of my work has been putting data privacy at the front and center in AIDRIN. I focused on integrating essential privacy metrics like k-anonymity, l-diversity, t-closeness, and more, making sure they’re not just theoretical checkboxes, but real tools that users can interact with and understand. Now, these metrics are fully wired up in the backend and visualized in AIDRIN, so privacy risks are no longer just a vague concern. They are something AI data preparers can actually see and act on. Getting these metrics to work seamlessly with different datasets and ensuring their accuracy took some serious backend engineering, but the payoff has been worth it.

Speeding Things Up (So You Don’t Have To Wait Around)

As AIDRIN started handling bigger datasets, some of the calculations can be time-consuming because data has to be accessed every time a metric is computed. To address this, I added caching for previously computed metrics, like class imbalance and privacy checks, and set up asynchronous execution with Celery and Redis. This should make the app super responsive. Rather than waiting for heavy computations to finish, one can start taking notes about other metrics or explore different parts of the app while their results are loading in the background. It’s a small change, but it helps keep the workflow moving smoothly.

Small Touch Ups That (Hopefully) Make a Big Difference

I also spent time on the details that make the app easier to use. Tooltips now explain what the privacy metrics actually mean, error messages are clearer, and there’s a new cache info page where you can see and clear your cached data. The sensitive attribute dropdown is less confusing now, especially if you’re working with quasi-identifiers. These tweaks might seem minor, but they add up and make the app friendlier for everyone.

Docs, Docs, Docs

I’m a big believer that good documentation is just as important as good code. I updated the docs to cover all the new features, added citations for the privacy metrics, and made the install process a bit more straightforward. Hopefully, this means new users and contributors can get up to speed without too much hassle.

Huge Thanks to My Mentors and the Team

I really want to shine a light on Dr. Bez, Prof. Byna, and the entire AIDRIN team here. Their encouragement, practical advice, and collaborative spirit have been a huge part of my progress. Whether I’m stuck on a bug, brainstorming a new feature, or just need a second opinion, there’s always someone ready to help me think things through. Their experience and support have shaped not just the technical side of my work, but also how I approach problem-solving and teamwork.

What’s Next?

Looking ahead, I’m planning to expand AIDRIN’s support for multimodal datasets and keep refining the privacy and fairness modules. There’s always something new to learn or improve, and I’m excited to keep building. If you’re interested in data quality, privacy, or open-source AI tools, I’d love to connect and swap ideas.

Thanks for reading and for following along with my GSoC journey. I’ll be back soon with more updates!

This is the second post in my 3-part GSoC series with AIDRIN. Stay tuned for the final update.

Halfway Blog - WildBerryEye: Mechanical Design & Weather-Resistant Enclosure

Fri, 25 Jul 2025 00:00:00 +0000

Hi everyone! My name is Teodor Langan, and I am an undergraduate studying Robotics Engineering at the University of California, Santa Cruz. I’m happy to share the progress I have been able to make over the last six weeks on my GSoC 2025 project. Over the last six weeks, I have been working on developing the hardware for the WildBerryEye project, mentored by Carlos Isaac Espinosa.

Project Overview

The WildBerryEye project enables AI-powered ecological monitoring using Raspberry Pi cameras and computer vision models. However, achieving this requires a reliable enclosure that can support long-term deployment in the wild. The goal for my project is to address this need by designing a modular, 3D-printable camera casing that protects WildBerryEye’s electronics from outside factors such as rain, dust, and bugs, while remaining easy to print and assemble. To achieve this, my main responsibilities for this project include:

Implementing a modular design and development-friendly features for ease of assembly and flexible use across hardware setups
Prototyping and testing enclosures outdoors to assess durability, water resistance, and ventilation—then iterating based on results
Developing clear documentation, assembly instructions, and designing with open-source tools
Exploring material options and print techniques to improve outdoor lifespan and environmental resilience

Designed largely with FreeCAD and tested in real outdoor conditions, the open-source enclosure will ensure WildBerryEye hardware can be deployed in natural environments for continuous, low-maintenance data collection.

Progress So Far

Over the past 6 weeks, great progress has been made on the design of the WildBerryEye camera enclosure. Some key accomplishments include:

Full 3D Assembly Model of Electronics: Modeled all core components used in the WildBerryEye system to serve as a reference for enclosure design. For parts without existing CAD models, accurate measurements were taken and custom models were created in FreeCAD.
Initial Enclosure Prototype: Designed and 3D-printed a first full prototype featuring a hinge-latch mechanism to allow tool-free easy access to internal electronics for development and maintenance.
Design Iteration Based on Testing: Based on the results of the first print, created an improved version with better electronics integration, port alignment, and more functionality.

Challenges & Next Steps

Field-Ready Integration: Preparing for field testing with upcoming prototypes by making sure that all internal electronics are securely mounted and fully accessible within the enclosure.
Latch Mechanism Refinement: Finalizing a reliable hinge-latch design that can keep the enclosure sealed during outdoor use while remaining easy to open for maintenance.
Balancing Modularity, Size, and Weatherproofing: Maintaining a compact form factor without compromising on modularity or weather resistance—especially when routing cables and mounting components.
Material Experimentation: Beginning test prints with TPU, a flexible filament that may provide improved seals or gaskets for added protection.
Ventilation Without Exposure: Exploring airflow solutions such as labyrinth-style vents to enable heat dissipation without letting in moisture or debris.

Final Thoughts

These past 6 weeks have helped me immensely to grow my skills in mechanical design, CAD modeling, and field-focused prototyping. The WildBerryEye system can help researchers monitor pollinators and other wildlife in their natural habitats without requiring constant in-person observation or high-maintenance setups. By enabling long-term, autonomous data collection in outdoor environments, it opens new possibilities for low-cost, scalable ecological monitoring.

I’m especially grateful to my mentor Carlos Isaac Espinosa and the WildBerryEye team for their ongoing support. Excited for the second half, where the design will face real-world testing and help bring this impactful system one step closer to field deployment!

Mid-term Blog: Building a Simulator for Benchmarking Replicated Systems

Fri, 25 Jul 2025 00:00:00 +0000

Introduction

Hello there, I’m Michael. In this report, I’ll be sharing my progress as part of the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges project under the mentorship of Fadhil Kurnia.

About the Project

The goal of the project is to build a language-agnostic interface that enables communication between clients and any consensus protocol such as MultiPaxos, Raft, Zookeeper Atomic Broadcast (ZAB), and others. Currently, many of these protocols implement their own custom mechanisms for the client to communicate with the group of peers in the network. An implementation of MultiPaxos from the MultiPaxos Made Complete paper for example, uses a custom Protobuf definition for the packets client send to the MultiPaxos system. With the support of a generalized interface, different consensus protocols can now be tested under the same workload to compare their performance objectively.

Progress

Literature Study: Reviewed papers and implementations of various protocols including GigaPaxos, Raft, Viewstamped Replication (VSR), and ZAB. Analysis focused on their log replication strategies, fault handling, and performance implications.
Development of Custom Protocol: Two custom protocols are currently under development and will serve as initial test subjects for the testbed:
- A modified GigaPaxos protocol
- A Primary-Backup Replication protocol with strict log ordering similar to ZAB (logs are ordered based on the sequence proposed by the primary)
Most of my time has been spent working on the two protocols, particularly on snapshotting and state transfer functionality in the Primary-Backup protocol. Ideally, the testbed should be able to evaluate protocol performance in scenarios involving node failure or a new node being added. In these scenarios, different protocol implementations often vary in their decision of whether to take periodic snapshots or to roll forward whenever possible and generate a snapshot only when necessary.

Challenges

Early in the project, the initial goal was to benchmark different consensus protocols using arbitrary full-stack web applications as their workload. Different protocols would replicate a full-stack application running inside Docker containers across multiple nodes and the testbed would send requests for them to coordinate between those nodes. In fact, the 2 custom protocols being worked on are specifically made to fit these constraints.

Developing a custom protocol that supports the replication of a Docker container is in itself already a difficult task. Abstracting away the functionality that allows communicating with the docker containers, as well as handling entry logs and snapshotting the state, is an order of magnitude more complicated.

As mentioned in the first blog, an application can be categorized into two types: deterministic and non-deterministic applications. The coordination of these two types of applications are handled in very different ways. Most consensus protocols support only deterministic systems, such as key-value stores and can’t easily handle coordination of complex services or external side effects. To allow support for non-deterministic applications would require abstracting over protocol-specific log structures. This effectively restricts the interface to only support protocols that conform to the abstraction, defeating the goal of making the interface broadly usable and protocol-agnostic.

Furthermore, in order to allow any existing protocols to support running something as complex as a stateful docker container without the protocol itself even knowing adds another layer of complexity to the system.

Future Goals

Given these challenges, I decided to pivot to using only key-value stores as the application being used in the benchmark. This aligns with the implementations of most of the existing protocols which typically use key-value stores. In doing so, now the main focus would be to implement an interface that supports HTTP requests from clients to any arbitrary protocols.

Midterm Blog: Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges

Fri, 25 Jul 2025 00:00:00 +0000

Hello! I’m Panji Sri Kuncara Wisma and I want to share my midterm progress on the “Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges” project under the mentorship of Fadhil I. Kurnia.

Project Overview

The goal of our project is to create an open testbed that enables fair, reproducible evaluation of different consensus protocols (Paxos variants, EPaxos, Raft, etc.) when deployed at network edges. Currently, researchers struggle to compare these systems because they lack standardized evaluation environments and often rely on mock implementations of proprietary systems.

XDN (eXtensible Distributed Network) is one of the important consensus systems we plan to evaluate in our benchmarking testbed. Built on GigaPaxos, it allows deployment of replicated stateful services across edge locations. As part of preparing our benchmarking framework, we need to ensure that the systems we evaluate, including XDN, are robust for fair comparison.

Progress

As part of preparing our benchmarking tool, I have been working on refactoring XDN’s FUSE filesystem from C++ to Rust. This work is essential for creating a stable and reliable XDN platform.

The diagram above illustrates how the FUSE filesystem integrates with XDN’s distributed architecture. On the left, we see the standard FUSE setup where applications interact with the filesystem through the kernel’s VFS layer. On the right, the distributed replication flow is shown: Node 1 runs fuselog_core which captures filesystem operations and generates statediffs, while Nodes 2 and 3 run fuselog_apply to receive and apply these statediffs, maintaining replica consistency across the distributed system.

This FUSE component is critical for XDN’s operation as it enables transparent state capture and replication across edge nodes. By refactoring this core component from C++ to Rust, we’re hopefully strengthening the foundation for fair benchmarking comparisons in our testbed.

Core Work: C++ to Rust FUSE Filesystem Migration

XDN relies on a FUSE (Filesystem in Userspace) component to capture filesystem operations and generate “statediffs” - records of changes that get replicated across edge nodes. The original C++ implementation worked but had memory safety concerns and limited optimization capabilities.

I worked on refactoring from C++ to Rust, implementing several improvements:

New Features Added:

Zstd Compression: Reduces statediff payload sizes
Adaptive Compression: Intelligently chooses compression strategies
Advanced Pruning: Removes redundant operations (duplicate chmod/chown, created-then-deleted files)
Bincode Serialization: Helps avoid manual serialization code and reduces the risk of related bugs
Extended Operations: Added support for additional filesystem operations (mkdir, symlink, hardlinks, etc.)

Architectural Improvements:

Memory Safety: Rust’s ownership system helps prevent common memory management issues
Type Safety: Using Rust enums instead of integer constants for better type checking

Findings

The optimization results performed as expected:

Statediff Size Reductions:

MySQL workload: 572MB → 29.6MB (95% reduction)
PostgreSQL workload: 76MB → 11.9MB (84% reduction)
SQLite workload: 4MB → 29KB (99% reduction)

The combination of write coalescing, pruning, and compression proves especially effective for database workloads, where many operations involve small changes to large files.

Performance Comparison: Remarkably, the Rust implementation matches or exceeds C++ performance:

POST operations: 30% faster (10.5ms vs 15ms)
DELETE operations: 33% faster (10ms vs 15ms)
Overall latency: Consistently better (9ms vs 11ms)

Current Challenges

While the core implementation is complete and functional, I’m currently debugging occasional latency spikes that occur under specific workload patterns. These edge cases need to be resolved before moving on to the benchmarking phase, as inconsistent performance could compromise the reliability of the evaluation.

Next Steps

With the FUSE filesystem foundation nearly complete, next steps include:

Resolve latency spike issues and complete XDN stabilization
Build benchmarking framework - a comparison tool that can systematically evaluate different consensus protocols with standardized metrics.
Run systematic evaluation across protocols

The optimized filesystem will hopefully provide a stable base for reproducible performance comparisons between distributed consensus protocols.

Reproducibility of Interactive Notebooks in Distributed Environments

Fri, 25 Jul 2025 00:00:00 +0000

I am sharing a overview of my project Reproducibility of Interactive Notebooks in Distributed Environments and an udpate at the midway mark.

Project Overview

Generate execution logs for a notebook program.
Generate code and data dependencies for notebook programs in an automated manner.
Utilize the generated dependencies at various granularities to automate the deployment and execution of notebooks in a parallel and distributed environment.
Audit and pacakge the notebook code running in a distributed environment.

Progress So Far

Here are the details of the progress made so far.

Generation of Execution Logs

Extracting Software Dependencies

When a library such as a Python package like Numpy is used by the notebook program, an entry is made in the execution log which has the complete path of the accessed library file(s) along with additional information. We analyze the execution logs for both manager and workers to find and enlist all dependencies. So far, we are limited to Python packages, though this methodology is general and can be used to find dependencies for any programing langauge. For Python packages, their version numbers are also obtained by querying the pacakge managers like pip or Conda on the local system.

Extracting Data Dependencies

Testing the Pipeline

Next Steps

The next steps in this project are as follows:

Generate the execution logs and dependencies in a notebook at the level of each cell of code.
Utilize the dependencies at multiple levels of granularities with the goal of automating the deployment and execution of notebooks in a parallel and distributed environment.
Audit notebook program execution in a distributed environment and package it into a container on a single node.

I am very happy abut the experience I have had so far in this project and I am excited about the milestones to come. Stay tuned!

[MidTerm] Building PeerSky’s Settings System

Thu, 24 Jul 2025 00:00:00 +0000

Hi everyone, I’m Hanzhong Liu. My project focuses on building a secure and extensible peersky://settings system for the PeerSky browser, a decentralized and privacy-first browser built on Electron.

This post is a midterm check-in covering what’s been implemented so far — from IPC architecture to real-time theme and wallpaper updates — and a preview of what’s coming next.

Project Overview

Peersky’s new settings system is designed to unify browser preferences (themes, search engine, appearance, extensions, etc.) into a single modular interface. It’s accessible via a protocol route (peersky://settings) and built using web-standard HTML/CSS, scoped APIs, and Electron’s context isolation model.

Key Design Goals:

Secure preload-based API exposure via contextBridge
Fast access to user preferences with zero-flicker wallpaper updates
Extensibility for bookmarks, future plugins, and privacy tools

Midterm Progress Highlights

Electron Integration

Rather than using webFrame.executeJavaScript(), I implemented preload-scoped APIs using contextBridge and ipcRenderer to prevent injection vulnerabilities and ensure synchronous availability during early page load. Each internal protocol (settings, home, bookmarks) is granted its own API access level.

Code: src/pages/unified-preload.js

Modular Settings Page

The UI lives in a single HTML file with sidebar-based navigation (Appearance, Search, Bookmarks, Extensions). Each section updates independently using event-driven IPC and live sync.

Wallpaper & Theme Switching

Supports both built-in wallpapers and custom uploads
Background applies instantly using sendSync() during preload
Themes (light, dark, system) are controlled using root-level CSS variables and real-time IPC events

Cache & Search Engine

Added IPC handler to clear both Electron session and P2P cache directories (ipfs/, hyper/)
Settings API allows switching between DuckDuckGo, Ecosia, and Startpage via dropdown

Example: Adding a New Setting (`autoSave`)

I also documented how developers can add new settings like autoSave using:

settings-manager.js for default values and validation
Preload event listeners (onAutoSaveChanged)
UI toggles and save logic in settings.js

Documentation link: Settings Guide

Reflection

I’m really thankful for the mentorship I’ve received from Akhilesh Thite. His guidance has been the perfect balance of autonomy and support. He challenged me to reason clearly about technical choices, especially when I thought some of them are minor and not worthing paying attention to. His feedback helped me write cleaner, better-scoped code. This project has helped me grow as a software engineer in ways I didn’t fully anticipate, but I’ve enjoyed it so so much.

You can explore the project here:
https://github.com/p2plabsxyz/peersky-browser

Midterm for Smart Environments

Thu, 24 Jul 2025 00:00:00 +0000

What is EnvGym?

EnvGym is a general multi-agent framework designed to automate the construction of executable environments for reproducing research prototypes from top-tier conferences and journals. While reproducibility has become a growing concern in the research community, the process of setting up environments remains time-consuming, error-prone, and often poorly documented.

EnvGym addresses this gap by leveraging LLM-powered agents to analyze project instructions, resolve dependencies, configure execution environments, and validate results—thereby reducing human overhead and improving reproducibility at scale.

Progress

New Tools

Initially, our agent had access to only one tool: the command line. This constrained the agent’s ability to decompose complex tasks and respond flexibly to failures. Over the last few weeks, we introduced a modular tool system, enabling the agent to handle specific subtasks more effectively.

The new toolset includes:

dockerrun: Executes Dockerfiles.
hardware_checking, hardware_adjustment: Tailor builds to available resources.
history_manager, stats: Tracks historical data for improvement and reproducibility.
planning: Generates high-level execution plans.
summarize: Interprets build results to adjust subsequent iterations.
writing_docker_initial, writing_docker_revision: Generate and refine Dockerfiles.

While some of those tools, such as dockerrun, run programmatic scripts, other scripts such as planning are more complex and use LLMs themselves.

Agent Re-Architecture: Moving Beyond Codex

We transitioned away from OpenAI’s Codex agent implementation. While powerful, Codex’s framework was overly reliant on its CLI frontend, which added unnecessary complexity and limited customizability for our research context.

We implemented our own lightweight, customizable agent pipeline that integrates LLM-based planning with iterative execution. Conceptually, the agent executes the following loop:

Repo Scanning
Hardware Check
Planning & Initial Dockerfile Generation
Docker Execution
Progress Summarization & Adjustment
Iterative Dockerfile Refinement (up to 20 rounds)
Success Check & Logging

This new agent design is easier to control, extend, and debug—aligning better with the needs of reproducibility research.

Prompt Engineering

For each tool that requires LLMs to function, we created a set of custom prompts that outline the task and breaks down the goals. For instance, the prompt used in summarize differs from the one in planning, allowing us to optimize the behavior of LLM agents per context.

Performance Gains

With these improvements, EnvGym now successfully replicates 9 repositories, surpassing our baseline Codex agent which struggled with the same set. We’ve observed more reliable planning, better handling of edge-case dependencies, and faster convergence in iterative Dockerfile revisions.

Next Steps

Granular Evaluation Metric

We plan to adopt a tree-structured rubric-based evaluation, inspired by PaperBench. Instead of binary success/failure, each repo will be assigned a reproducibility score from 0–100.

Key tasks include:

Rubric Design: Define a hierarchical rubric with criteria like dependency resolution, test success rate, runtime match, etc.
Manual Annotation: Build a dataset of ground-truth rubrics for a subset of repos to calibrate our automatic judge.
Judge Implementation: Develop an LLM-based judge function that takes (i) rubric and (ii) environment state, and returns a reproducibility score.

Source: Starace, Giulio, et al. “PaperBench: Evaluating AI’s Ability to Replicate AI Research.” arXiv preprint arXiv:2504.01848 (2025).

This will make EnvGym suitable for benchmarking. We will run our new method and obtain a score to compare with baseline methods!

Conclusion

EnvGym has made strong progress toward automating reproducibility in computational research. Through modularization, agentic design, and prompt optimizations, we’ve surpassed existing baselines and laid the groundwork for even more improvement.

The upcoming focus on metrics and benchmarking will elevate EnvGym from a functional prototype to a standardized reproducibility benchmark tool and also quantitatively prove that our new agentic method is better than existing tools such as Codex. Excited for what’s to come!

Autofill

; 20250724-Sam_Huang

Midway Through GSoC: ENTS

Thu, 24 Jul 2025 00:00:00 +0000

Midway Through GSoC

Hi everyone! I’m Devansh Kukreja, and I’m excited to share a midterm update on my Google Summer of Code 2025 project with the University of California, Santa Cruz Open Source Program Office (UC OSPO) under the Open Source Research Experience (OSRE). I’m contributing to ENTS, a platform that supports real-time monitoring and visualization of environmental sensor networks.

Project Overview

The Environmental NeTworked Sensor (ENTS) platform is an open-source web portal designed to collect, visualize, and analyze data from distributed sensor networks. It’s used by researchers and citizen scientists to monitor field-deployed sensors measuring soil moisture, temperature, voltage, and current—supporting critical research on sustainability and environmental change.

My project focuses on improving the platform’s stability, usability, and extensibility through:

Fixing bugs in the data visualization components.
Enhancing real-time chart synchronization and data point selection.
Improving overall system error handling and reliability.
Building a Logger Registration System that enables users to register and configure their logging devices.
Exploring integration with The Things Network (TTN) to support LoRaWAN-based wireless sensor connectivity.

Progress So Far

During the first half of the GSoC period, I focused on laying the groundwork for a more robust and user-friendly system. Highlights include:

Enhanced date range logic: Improved the way the dashboard selects time periods by automatically choosing a recent two-week window with valid sensor data. This ensures charts always display meaningful insights and avoids showing blank states.
Improved chart rendering: Refined how charts behave when there’s no data or when unusual values (like negatives) are present. This includes smoother axis alignment and fallback messaging when data is unavailable.
Refactored cell management UI: Cleaned up and improved the modals used to manage cells and sensors, fixing several UI/UX issues and bugs to make interactions more intuitive and consistent.
Enabled smart URL syncing: The dashboard state now stays in sync with the URL, making it easier to share specific views or navigate back to previous states without losing context.

What’s Next

In the second half of the program, I’ll be focusing on:

Building out and polishing the Logger Registration UI based on the backend schema and wireframes.
Finalizing the onboarding flow for field loggers, linking registration data to ingestion and dashboard views.
Continuing work on LoRaWAN support with TTN, aiming to enable basic OTA provisioning for future deployments.
Exploring an admin dashboard that helps visualize device health, sync status, and alert on any anomalies.

Final Thoughts

Working on ENTS has been incredibly rewarding—it’s more than just code. It’s about making tools that help scientists and conservationists understand our changing environment, and I’m honored to be a part of that.

Big thanks to my mentors Colleen Josephson, John Madden, and Alec Levy for their support and thoughtful feedback throughout. I’ve learned a ton already, and I can’t wait to keep building.

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Wed, 23 Jul 2025 00:00:00 +0000

Enabling NCCL GPU-GPU Communication in PyLops-MPI - Google Summer of Code Project (2025) - Part 1

Hello all! 👋 My name is Tharit, and I’m a computer science student at the University of Texas at Austin. This summer, I am fortunate to participate in the Google Summer of Code (GSoC) 2025 program, hosted by UC OSPO and the PyLops team. My project focuses on enabling NCCL GPU-to-GPU communication in PyLops-MPI, under the guidance of mentors Matteo Ravasi and Yuxi Hong.

You might have come across this post if you’re a PyLops user interested in scaling PyLops-MPI with GPU/NCCL support, or if you’re exploring GSoC projects and wondering what we are up to. Either way, I hope this post gives you useful insights.

What is PyLops-MPI?

If you’ve worked with inverse problems, you’ve likely come across PyLops. It’s a Python library that provides an extensive suite of linear operators and solvers. Operators are designed with a clear focus on the forward and adjoint pair (A and A.T) whilst solvers take operators and data to solve the associate inverse problem. In fields such as geophysics, astrophysics, or medical imaging,, inverse problem are solved routinely to image the Earth, the space, or the human body from remote measurements. In all cases, real-life problems tend to consume a lot of computing and require a lot of memory. PyLops allows users to express these problems in an abstract manner that is reminiscent of the underlying equations whilst not compromising on efficiency.

PyLops-MPI is the distributed extension of PyLops, introduced during GSoC 2023. It enables users to scale their computations over CPU and GPU clusters via MPI. However, up until now, even GPU-based communications were routed through MPI, introducing potential performance bottlenecks.

The Goal of the Project

Our goal is to take PyLops-MPI to the next level by enabling GPU-to-GPU collective communications directly using NVIDIA NCCL. This allows full utilization of high-bandwidth interconnects like NVLink, and avoids unnecessary memory transfers through the host CPU. This blog marks the midpoint of the program (week 6 of 12), and I’d like to reflect on the progress so far, challenges faced, and what’s coming next.

What is a Collective Communication anyway?

In PyLops-MPI, distributed computations require nodes to exchange information, for example, during gradient computations or reductions in iterative solvers. A naive implementation (useful for a thought experiment) would involve each node taking turns broadcasting data, which can be quite slow. NVIDIA’s NCCL abstracts away the complexity of topology-aware communication. For example, in the image below, if GPUs are most effective by communicating in a ring fashion for all-reduce operation, NCCL will automatically pick that layout and not use GPU 01-GPU 04 and GPU 02-GPU 03 communication links.

Example of a compute node with 4 GPUs attached, directly connected to each other with NVLink

What we achieved, so far

It is probably best to tell stories through the sequence of pull requests.

Core Changes in DistributedArray (PR #130)

This PR introduces NCCL support into the DistributedArray class. The design allows users to optionally pass both a NcclCommunicator and a MPI.Comm. By doing so, small control data (e.g., shape, dtype) is still exchanged via MPI, leveraging Python’s flexibility and minimizing performance impact. As you will see, this decision to keep two communicators turns out to be a good call. This is how the __init__ method of DistributedArray looks like with the new addition in bold:

 def __init__(self, global_shape: Union[Tuple, Integral],
 base_comm: Optional[MPI.Comm] = MPI.COMM_WORLD,
 base_comm_nccl: Optional[NcclCommunicatorType] = None, # Added to this line
 partition: Partition = Partition.SCATTER, axis: int = 0,
 local_shapes: Optional[List[Union[Tuple, Integral]]] = None,
 mask: Optional[List[Integral]] = None)

The CuPy’s NCCL API

NCCL’s API (mirroring its C++ origins) is minimalistic, requiring manual memory management. One prominent example is the implementation of allGather() Previously, using mpi4py we could leverage the Python dynamic typing nature (everything is an object, so one just sends the object). This means mpi4py allows different ranks to send different sizes of arrays. NCCL requires every rank in the communicator to send the same size. To work around this, we implemented padding and reshaping logic in multi-dimensional arrays. NCCL treats arrays as contiguous byte streams, so padding must be handled carefully ¹.

Moreover, we had to accommodate NCCL’s lower-level API, lacking conveniences like communicator’s split variants. Internally, we introduced unified abstractions such as _allgather(), _allreduce(), send(), recv(). etc to DistributedArray and modified the communication model to work seamlessly whether MPI or NCCL is used. By doing this, other developers can focus on developing new operators (that suit their needs), and abstract away the existence of different communicators.

Example of a challenge coming from having an unevenly distributed array

Keep things small: Dependency management (PR #132 and PR #135)

Despite adding this new capability, we are fully aware that not every user has access to a cluster of GPU and therefore we don’t make NCCL and CuPy mandatory dependencies. The first time someoneinstalls and experiments with PyLops-MPI is likely to run it on a single-node desktop. And we don’t want to introduce such complexity early on. This means that our code has to accommodate “optional dependency” or have a “protected import”. If we have import cupy as cp at the beginning of DistributedArray, users without GPU will encounter an error before doing anything useful at all. In other words, our library should import CuPy and NCCL automatically when both the system enables it and users also ask for it. The pattern looks like this:

nccl_test = util.find_spec("cupy") is not None and int(os.getenv("NCCL_PYLOPS_MPI", 1)) == 1
if nccl_test:
 # try import CuPy and then check for NCCL
 if nccl is available:
 # success
 else:
 # unable to import but the package is installed
else:
 # package is not installed or the environment variable disables it
# Finally, set nccl_enabled flag for other module to use for protected import

This helps preserve PyLops-MPI’s minimal base installation. This required carefully isolating imports and adapting the module resolution logic using a backend dispatching mechanism. This is something I never thought of or took it into consideration before.

The Basic Operator with NCCL PR 137

We chose MPIVStack as the first operator to implement NCCL support due to its simplicity. Several design choices emerged:

Implicit Communicator Propagation

We updated forward and adjoint calls to propagate the base_comm_nccl from input to output automatically. This way, if x is NCCL-enabled, then y = A @ x or A.H @ x will also be NCCL-enabled. This avoids mismatches and keeps operator pipelines consistent.

Interestingly, the operator itself did not need to explicitly take base_comm_nccl as an argument, reducing complexity for developers extending PyLops-MPI. This point is contrary to our initial expectation. The operator does not have to take base_comm_nccl like DistributedArray did. This is good news. This reduces the potential that other developers may have to deal with different cases of communication when adding new operators.

Optional Dual-Communicator Design

As with DistributedArray, the ability to pass both an MPI communicator and an NCCL communicator proved to be a sound decision. By maintaining NCCL as an optional backend, we gain fine-grained control over which communication paths use NCCL versus MPI. This flexibility allowed us to optimize performance-critical paths while retaining MPI for control messages and small metadata transfers.

In particular, in the communication of ghost cells, which are used for computation around the boundary, like in derivative calculation, small metadata, such as cell_fronts (typically lists of rank-sized integers) continues to be efficiently transmitted via MPI. These metadata are needed for send/receiver buffer allocations. This leverages Python’s object serialization model (list[int]) without incurring GPU synchronization costs. But the actual cell array itself is communicated with NCCL since these arrays can be large.

What’s Next?

Aside from enabling NCCL support for the remaining operators and their full test coverage, some more exciting upcoming updates are

Complex-number type support for NCCL
Benchmarking results on a real HPC system

Stay tuned for Part 2, and thanks for reading!

For the best performance mpi4py would require the buffer memory allocation as well. The mpi4py package provides two interface: buffered and non-bufferred. Currently PyLops-MPI takes the non-buffered approach. This suggest a room of optimization. ↩︎

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Tue, 22 Jul 2025 10:15:56 -0700

Midway Through OSRE

My Journey with LLMSeqRec

Hello from the Midpoint!

Hi everyone! I’m Connor Lee, a student at NYU studying Computer Science and Mathematics, and I’m excited to share the progress I’ve made halfway through the Open Source Research Experience (OSRE) with my project: LLMSeqRec – a large language model-enhanced sequential recommender system.

Over the past several weeks, I’ve had the opportunity to explore the intersection of recommender systems and large language models (LLMs), and it’s been a deep, challenging, and rewarding dive into building smarter, more contextual recommendation engines.

What is LLMSeqRec?

LLMSeqRec stands for LLM-Enhanced Contextual Sequential Recommender. Traditional sequential recommendation systems like SASRec are great at capturing patterns from user-item interactions, but they often fall short in two areas: understanding semantic context (e.g., item descriptions, reviews) and dealing with cold-start problems.

LLMSeqRec aims to address this by incorporating pretrained LLM embeddings into the recommendation pipeline. The goal is to enhance models like SASRec with semantic signals from text (like product reviews or titles), allowing them to better model user intent, long-range dependencies, and generalize to new items or users.

Progress So Far

✅ Baseline SASRec Runs

To establish a benchmark, I successfully ran the original SASRec implementation (in PyTorch) using both the MovieLens 1M and Amazon Beauty datasets. After debugging initial data formatting issues and adjusting batch sizes for local CPU/GPU compatibility, I automated training with scripts that let me scale to 200+ epochs to acheive the best performance in both Colab and on my MacBook via CPU.

Note: At this stage, we have not yet integrated LLMs into the model. These baseline runs (SASRec) serve as the control group for evaluating the future impact of LLM-based enhancements.

What’s Next

As I enter the second half of the OSRE, I’ll be shifting gears toward LLM integration, model evaluation, and running LLM-powered sequential recommendations using product metadata and contextual information. Here’s what’s ahead:

Designing pipelines to extract and align textual metadata with item sequences
Integrating LLM-generated embeddings into the recommender model
Evaluating performance changes across different dataset characteristics

📊 Experimental Results

We have not yet utilized LLMs in our current experiments. The results below reflect our reproduced baseline performance of SASRec across datasets.

Below are the performance curves on different test sets, where we evaluate model performance every 20 epochs during training:

Beauty Dataset Performance

Hit@10 performance on the test set for the Beauty dataset (every 20 epochs)

Training loss for the Beauty dataset

NDCG@10 performance on the test set for the Beauty dataset (every 20 epochs)

ML-1M Dataset Performance

Training loss for the ML-1M dataset

Hit@10 performance on the test set for the ML-1M dataset (every 20 epochs)

NDCG@10 performance on the test set for the ML-1M dataset (every 20 epochs)

These results demonstrate that our baseline SASRec reproductions are converging as expected and will serve as a solid foundation for comparison once LLM integration is complete.

Closing Thoughts

This project has been an exciting journey into both research and engineering and I’m excited to explore LLM-powered embedding integration in the upcoming phase.

I’m incredibly grateful to my mentors Dr. Linsey Pang and Dr. Bin Dong for their support and guidance throughout the project so far. I’m looking forward to sharing more technical results as we work toward building smarter, more adaptable recommender systems.

Midterm Report: KAN Integration into LLMs

Fri, 18 Jul 2025 00:00:00 +0000

Imagine if we could make neural networks that are not just more efficient, but smarter in how they learn. That’s the promise behind Kolmogorov–Arnold Networks (KANs)—a fascinating new architecture that replaces the usual “weighted sums and activation functions” with more mathematical finesse. Instead of processing all inputs in one big lump, KANs treat each input dimension individually, transforming them with elegant functions like B-splines or simpler polynomials. The idea is simple but powerful: do more with less.

For my project, I set out to explore what happens when we integrate these KAN layers into a lightweight language model called SmolLM2, training and testing it on a smol-smoltalk dataset.

Setting the Stage: SmolLM2 Meets KAN

The original SmolLM2 has 135 million parameters and 30 transformer blocks—plenty of moving parts. To keep things manageable during the initial phase, I created a mini version of the model with just 3 blocks and a trimmed-down vocabulary. This setup let me test dozens of KAN variations quickly, using a simple text classification task (AGNews) as a playground before moving on to full-scale language modeling.

Despite working with a simplified model, I managed to successfully train a full 30-block KAN-based SmolLM2. That model even passed challenging language benchmarks with flying colors—matching the performance of the original, linear-layer version. That’s a big win.

What Worked (and What Didn’t)

Along the way, I tried out a variety of KAN flavors: spline-based, radial basis functions (RBF), rational functions, and no fewer than eight types of orthogonal polynomials—like Chebyshev, Legendre, and Hermite. Each one brings its own quirks, strengths, and training times.

Some key takeaways:

Chebyshev (second kind) with a low polynomial degree (just 2!) delivered the best speed/accuracy trade-off.
Jacobi and Gegenbauer polynomials edged slightly ahead in raw accuracy but required much longer training times.
Replacing each linear layer with a KAN version (keeping parameter count similar) worked fine—but layering them in parallel or sequence didn’t add much.
A baseline with regular linear layers still performed slightly better (60.8% vs. 60.3%), but KANs showed they can come close with room for optimization.

Why This Matters

What’s compelling is not just that KANs can work, but that they bring some appealing properties:

Parameter efficiency: Good performance with fewer or similarly-sized layers.
Flexibility: They adapt well to existing hyperparameters—less fine-tuning needed.
Stability: They run smoothly in fp16 (a lower-precision format), which is critical for efficient training.
Potential for richer activations: Some existing projects still rely on activations like ReLU or SiLU alongside KANs. But I found KANs alone could learn well without them, opening up more dynamic architectures in the future.

What’s Next

With the heavy lifting done, code written, models trained, ideas tested, the remainder of the project is focused on refinement. That means more training on generative tasks, better tuning of polynomial degrees, smarter initialization strategies, and potentially making KAN-based layers more plug-and-play.

The fact that a fully KAN-powered SmolLM2 can hold its own on tough language benchmarks is more than just a proof of concept. It’s a hint that we might not have to keep scaling models indefinitely to get better performance. Instead, we can get more from each parameter, by changing how the model thinks.

Halfway Through GSoC: My Experience and Progress

Thu, 17 Jul 2025 00:00:00 +0000

As part of the RAG-ST project, my proposal under the mentorship of Ziheng Duan aims to build a retrieval-augmented generation framework to predict spatial gene expression from histology images.

🚀 Achievements

✅ Ran the HEST-1K Pipeline

I successfully ran gene expression prediction models on the HEST-1K dataset, reproducing baseline image-to-expression workflows and setting up data loaders, evaluation metrics, and visual inspection of outputs.

✅ Explored Tangram’s Alignment Code

I studied and ran Tangram, a well-known scRNA-seq to ST alignment method, gaining key insights into cross-modality mapping. These ideas will inform our strategy to align histology images to scRNA-seq data.

✅ Designed the RAG-ST Architecture

I drafted the architecture for the RAG-ST pipeline, including:

Vision encoder to process image patches.
Retrieval module to find relevant examples from a curated database.
Generation head that conditions predictions on the retrieved examples — allowing transparency and context-aware outputs.

🧠 Challenges

Data Alignment: Spatial transcriptomics datasets often lack perfect alignment between histology, gene expression, and scRNA-seq, requiring custom preprocessing and normalization.
Trade-off Between Interpretability and Accuracy: Retrieval-augmented designs allow us to trace the origin of predictions but require care to avoid overfitting or performance drops.
Computation: High-resolution images and large-scale retrieval can be computationally expensive. I’ve begun exploring downsampling and vector database indexing strategies.

🔜 What’s Next

🔧 Build the end-to-end retrieval-generation pipeline
🧬 Prototype histology-to-scRNA-seq alignment using adapted Tangram ideas
📊 Benchmark RAG-ST vs. MLP baselines
👁️ Develop interpretability visualizations to show which samples were retrieved for each prediction

🧾 Deliverables Progress

Deliverable	Status
HEST-1K Baseline Pipeline	✅ Completed
Tangram Exploration	✅ Completed
Data Curation	🟡 In Progress
RAG-ST Architecture	✅ Drafted
Full Pipeline	⏳ Planned
Evaluation & Comparison	⏳ Planned

🙌 Closing Thoughts

It’s been a rewarding first half of GSoC. I’ve gained hands-on experience with spatial transcriptomics datasets, explored state-of-the-art tools like Tangram, and laid the groundwork for a new interpretable gene prediction model.

I’m excited to continue building RAG-ST and look forward to sharing more results soon. Huge thanks to my mentor Ziheng Duan for the guidance and support throughout!

If you have questions or want to discuss spatial modeling, feel free to reach out.

Midterm Blog - WildBerryEye User Interface

Wed, 16 Jul 2025 00:00:00 +0000

Hi, my name is Sophie Tao, I am an alumn at the University of Washington, with majoring in Electrical and Computer Engineering, I’m happy to share the progress I have been able to make over the last six weeks on my GSoC 2025 project, WildBerryEye, mentored by Carlos Isaac Espinosa.

Project Overview

WildBerryEye is an open-source initiative to support ecological monitoring of pollinators such as bees and hummingbirds using edge computing and computer vision. The project leverages a Raspberry Pi and YOLO for object detection and aims to provide an accessible, responsive, and real-time web interface for researchers, ecologists, and citizen scientists.

This project specifically focuses on building the frontend and backend infrastructure for WildBerryEye’s user interface, enabling:

Real-time pollinator detection preview
- Real-time image capture
- Real time video capture
Responsive, User-friendly UI
Object detection
Researcher-friendly configuration and usability

Progress So Far

✅ Phase 1: Setup

Frontend: Completed React + TypeScript project initialization with routing and base components. Pages include:
- Home page (with image preview)
- Dashboard page (pollinator image & video)
Backend: Flask server initialized with modular structure. Basic API endpoints stubbed as per the proposal.

✅ Phase 2: Core Features

Real-Time Communication: Frontend successfully receives image stream using WebSocket.
UI Components:
- Implemented image carousel preview on homepage.
- Image Capture (Image download)
- Video Capture (Video Preview, Video Recording)
- Sidebar-based navigation and page structure fully integrated.
API Development:
- Implemented core endpoints such as /home, and/dashboard routes.
- Backend handlers structured for image and video capture.

Challenges Encountered

⚠️ Real-time Image Testing: Lack of consistent live camera input made local testing inconsistent.
⚠️ Allocate the camera module for both capture image and capture video.
⚠️ Obtain the proper format of the video.

Next Steps

Enable more features for video capture
Integrated with Machine Learning Model
Conduct at least one usability test (self + external user) and incorporate feedback.
Final Testing & Docs

Summary

At this midterm stage, the WildBerryEye UI project is on track with core milestones completed, including real-time communication, component setup, and backend API structure. The remaining work focuses on refinement, visualizations, testing, and documentation to ensure a polished final product by the end of GSoC 2025.

Mid-term Blog: StatWrap: Cross-Project Searching and Classification using Local Indexing

Tue, 15 Jul 2025 00:00:00 +0000

Introduction

About the Project

As part of the project, I am working on enhancing the usability of StatWrap by enabling efficient cross-project search capabilities. The goal is to make it easier for investigators to discover relevant projects, notes, and assets—across both current and archived work—using information that is either user-entered or passively collected by StatWrap.

Given the sensitivity of the data involved, one of the key requirements is that all indexing and search operations must be performed locally. To address this, my responsibilities include:

Evaluating open-source search libraries suitable for local indexing and retrieval
Building the full-text search functionality directly into the StatWrap UI to allow seamless querying across projects
Ensuring reliability through the development of unit tests and comprehensive system testing
Implementing a classification system to label projects as “Active,” “Pinned,” or “Past” within the user interface

Progress

It has been more than six weeks since the project began, and significant progress has been made. Here’s a breakdown:

1. Descriptive Comparison of Open-Source Libraries

2. The Libraries

Lunr.js
A small, client-side full-text search engine that mimics Solr capabilities.
- Field-based search, boosting
- Supports TF-IDF, inverted index
- No built-in fuzzy search (only basic wildcards)
- Can serialize/deserialize index
- Not designed for large datasets
- Moderate memory usage and indexing speed
- Good documentation
- Best for: Static websites or SPAs needing simple in-browser search
ElasticLunr.js
A lightweight, more flexible alternative to Lunr.js.
- Dynamic index (add/remove docs)
- Field-based and weighted search
- No advanced fuzzy matching
- Faster and more customizable than Lunr
- Smaller footprint
- Easy to use and maintain
- Best for: Developers wanting Lunr-like features with simpler customization
Fuse.js
A fuzzy search library ideal for small to medium datasets.
- Fuzzy search with typo tolerance
- Deep key/path searching
- No need to build index
- Highly configurable (threshold, distance, etc.)
- Linear scan = slower on large datasets
- Not full-text search (scoring-based match)
- Extremely easy to set up and use
- Best for: Fuzzy search in small in-memory arrays (e.g., auto-suggest, dropdown filters)
FlexSearch
A blazing-fast, modular search engine with advanced indexing options.
- Extremely fast search and indexing
- Supports phonetic, typo-tolerant, and partial matching
- Asynchronous support
- Multi-language + Unicode-friendly
- Low memory footprint
- Configuration can be complex for beginners
- Best for: High-performance search in large/multilingual datasets
MiniSearch
A small, full-text search engine with balanced performance and simplicity.
- Fast indexing and searching
- Fuzzy search, stemming, stop words
- Field boosting and prefix search
- Compact, can serialize index
- Clean and modern API
- Lightweight and easy to maintain
- Best for: Balanced, in-browser full-text search for moderate datasets
Search-Index
A persistent, full-featured search engine for Node.js and browsers.
- Persistent storage with LevelDB
- Real-time indexing
- Fielded queries, faceting, filtering
- Advanced queries (Boolean, range, etc.)
- Slightly heavier setup
- Good for offline/local-first apps
- Browser usage more complex than others
- Best for: Node.js apps, not directly compatible with the Electron + React environment of StatWrap

3. Developer Experience and Maintenance

We analyzed the download trends of the search libraries using npm trends, and also reviewed their maintenance statistics to assess how frequently they are updated.

4. Comparative Analysis After Testing

Each search library was benchmarked against a predefined set of queries based on the same evaluation criteria.
We are yet to finalize the weights for each criterion, which will be done during the end-term evaluation.

5. The User Interface

The user interface includes options to search using three search modes (Basic, Advanced, Boolean operators) with configurable parameters. Results are sorted based on relevance score (highest first), and also grouped by category.

6. Overall Functioning

Indexing Workflow
- Projects are processed sequentially
- Metadata, files, people, and notes are indexed (larger files are queued for later)
- Uses a “brute-force” recursive approach to walk through project directories
  - Skips directories like node_modules, .git, .statwrap
  - Identifies eligible text files for indexing
  - Logs progress every 10 files
Document Creation Logic
- Reads file content as UTF-8 text
- Builds searchable documents with filename, content, and metadata
- Auto-generates tags based on content and file type
- Adds documents to the search index and document store
- Handles errors gracefully with debug logging
Search Functionality
- Uses field-weighted search
- Enriches results with document metadata
- Supports filtering by type or project
- Groups results by category (files, projects, people, etc.)
- Implements caching for improved performance
- Search statistics are generated to monitor performance

Challenges and End-Term Goals

In-memory Indexing Metadata Storing
Most JavaScript search libraries (like Fuse.js, Lunr, MiniSearch) store indexes entirely in memory, which can become problematic for large-scale datasets. A key challenge is designing a scalable solution that allows for disk persistence or lazy loading to prevent memory overflows.
Deciding the Weights Accordingly
An important challenge is tuning the relevance scoring by assigning appropriate weights to different aspects of the search, such as exact word matches, prefix matches, and typo tolerance. For instance, we prefer exact matches to be ranked higher than fuzzy or partial matches.
Implementing the Selected Library
Once a library is selected (based on speed, features, and compatibility with Electron + React), the next challenge is integrating it into StatWrap efficiently—ensuring local indexing, accurate search results, and smooth performance even with large projects.
Classifying Active and Past Projects in the User Interface
To improve navigation and search scoping, we plan to introduce three project sections in the interface: Pinned, Active, and Past projects. This classification will help users prioritize relevant content while enabling smarter indexing strategies.

Stay tuned for the next blog!

Midway Through GSoC

Mon, 14 Jul 2025 00:00:00 +0000

Midway Through GSoC

Hello everyone! I’m Pratham Devadiga, and I’m thrilled to share a midterm progress update on my GSoC 2025 project with the Open Source Research Experience (OSRE). My project is focused on building the first open-source billion-scale vector embeddings dataset from real-world open source code to support benchmarking of Approximate Nearest Neighbor (ANN) algorithms and facilitate research in Retrieval-Augmented Generation (RAG).

Project Overview

The goal of this project is to address a critical gap in the ecosystem: existing ANN benchmarks are either synthetic or limited in scale. With the explosion of code-focused LLMs and embedding models, there’s a pressing need for:

High-volume, high-dimensional vector datasets built from real-world data (open-source codebases).
Open, reproducible benchmarks that reflect realistic RAG workloads.
A dataset that can be used to evaluate ANN libraries like FAISS, HNSW, and Annoy on massive and practical retrieval tasks.

Our approach is to use high-quality open-source code repositories to extract meaningful code chunks, encode them into vector embeddings using open models, and make these datasets publicly available with metadata for downstream benchmarking and analysis.

Progress So Far

We’ve made substantial foundational progress in the first half of the coding period. Key highlights:

Tested multiple embedding models such as codeBERT, MiniLM-L6-v2, and all-mpnet-base-v2, evaluating trade-offs in speed, dimensionality, and GPU memory.
Selected codebert-base (768d) as the current model for phase one due to its stable performance and manageable resource footprint.
Implemented and validated a complete script pipeline to:
- Traverse large open-source repositories.
- Extract and chunk code intelligently (functions, classes, modules).
- Encode code into embeddings and attach metadata (repo, file path, license).
- Store results efficiently in parquet and NumPy formats.
Tested all components of the pipeline on sample datasets using multi-GPU setups, ensuring compatibility and robustness.

Challenges and Learnings

Building a billion-scale dataset from real-world codebases is no small task. Here’s what we’ve encountered and learned along the way:

1. Multi-GPU Pipeline Design

Naively parallelizing the embedding process caused memory overflow and deadlocks due to model reloading across processes. We refactored the code using torch.multiprocessing and pinned GPU contexts to avoid such issues, improving throughput on multi-GPU machines.

2. Embedding Trade-offs

We experimented with larger models but found that their generation time and memory use were too high to be practical in early phases. This helped us narrow down to scalable configurations for initial dataset generation.

3. Preparing for Scale

Although the embeddings are not generated yet, all scripts are now modular, parallelized, and reproducible, ensuring a smooth transition to billion-scale data generation in the second half.

What’s Next

The second half of the project will focus on:

Scaling up embedding generation to >1B code chunks across hundreds of open-source repositories.
Running benchmarks using FAISS, HNSW, and Annoy on these embeddings.
Releasing the dataset on Hugging Face and AWS S3 with sharded access and metadata.
Writing a detailed benchmarking report comparing speed, accuracy, and memory trade-offs across ANN algorithms.

Final Thoughts

This journey so far has taught me a lot about building large-scale ML pipelines, managing real-world compute constraints, and ensuring reproducibility for research-grade datasets. I’m grateful to my mentor Jayjeet Chakraborty and the OSRE team for their continuous support and guidance.

Excited for the next half, where the real scale begins!

Stay tuned for updates. You can find more about the project on my OSRE project page.

CarbonCast

Thu, 10 Jul 2025 00:00:00 +0000

As part of the CarbonCast project, my proposal under the mentorship of Professor Abel Souza aims to build an API that makes carbon intensity forecasts more accessible and actionable.

Under the mentorship of Professor Abel Souza, my proposal is centered around building upon CarbonCast to create an API to enable user access and utilization of energy data in optimizing their electricity consumption. Before diving into the details of the project, I’d like to share a bit about my background.

About Me

Hi, I’m Tanush—a rising senior at the University of Massachusetts Amherst, majoring in Computer Science and Mathematics and graduating in Spring 2026. Currently, I’m an AI Intern for the Commonwealth of Massachusetts Department of Unemployment Assistance, where I’m developing an end-to-end retrieval-augmented generation (RAG) chatbot on AWS.

In the past, I’ve contributed to CarbonCast in a different capacity, designing a user interface to help visualize carbon intensity forecasts. I also worked at MathWorks as a Machine Learning Intern, where I collaborated in an AGILE environment to design and deploy predictive models that improved precision torque control and dynamic responsiveness in motor-driven robotic and industrial systems.

I’m excited to bring these experiences to this year’s GSoC project, where I’ll be building tools to make carbon data more accessible and actionable for everyone.

What is CarbonCast?

CarbonCast is a Python-based machine-learning library designed to forecast the carbon intensity of electrical grids. Carbon intensity refers to the amount of carbon emitted per kilowatt-hour (kWh) of electricity consumed. Developed in Python, the current version of CarbonCast delivers accurate forecasts in numerous regions by using historical energy production data of a particular geographical region, time of day/year, and weather forecasts as features.

However, there is no easy way to access, visualize, and utilize the data through a standard interface. In addition, much important information is left out and is not available to users. For instance, electricity grids often import electricity from neighboring regions, and so electricity consumption depends on both electricity generation and imports. Moreover, it is imperative for each energy source to utilize a tailored predictive mechanism. Consequently, any carbon optimization solution trying to reduce carbon emissions due to its electricity consumption will benefit more from following a consumption-based carbon intensity signal.

Unlike other third-party carbon services, CarbonCast’s model is open-sourced, allowing users to study, understand, and improve its behavior. This transparency invites public collaboration and innovation. It also contrasts sharply with proprietary services that often withhold both the logic behind their models and the data they are trained on.

Why This Matters

Electricity usage is one of the largest contributors to carbon emissions globally. Carbon intensity—the amount of carbon emitted per kilowatt-hour of electricity consumed—varies based on how electricity is generated and demanded (for example, coal versus solar). With better visibility into when the grid is cleaner, individuals and organizations can shift their energy consumption to lower-carbon periods and lower prices. This enables everyday energy optimizations without compromising comfort or productivity.

By improving CarbonCast’s accessibility and functionality, we are helping people and institutions answer questions like:

When is the best time to charge my EV to reduce environmental impact?
Can I run my energy-hungry server jobs when the electricity is cheaper?
How do I actually reduce my emissions without guessing?

By providing clear, accurate forecasts of carbon intensity, CarbonCast can help users make informed decisions to optimize their energy footprint and reduce emissions without sacrificing convenience or productivity.

What I’m Building

The plan for this summer is to develop the backend API services for CarbonCast. This summer, I’m focused on two major goals:

Geographical Expansion

I am extending CarbonCast’s compatibility to support more regional electricity grids. Each model will be customized for local grid behavior and renewable energy characteristics. This involves tuning the model pipeline to adapt to each region’s energy mix, weather patterns, and reporting granularity.

System Refactoring and Modularity

The original CarbonCast system was built as a research artifact. To refine it into production-grade infrastructure, I am refactoring the codebase to improve modularity. This makes it easier to plug in new regions, update forecasting algorithms, and integrate new data sources.

Impact Beyond Research

The paper that inspired this project, Multi-day Forecasting of Electric Grid Carbon Intensity using Machine Learning, pioneered the idea of forecasting carbon intensity over multiple days using a hierarchical machine learning model. This goes beyond the typical 24-hour day-ahead models that are common in the industry and allows for better planning and longer-term decision-making.

CarbonCast builds directly on that foundation by transforming research into practical, real-world infrastructure. It is an open-source library that anyone can run, contribute to, and benefit from. Whether you’re a developer building carbon-aware applications, a policymaker working on grid decarbonization strategies, or a sustainability-conscious individual looking to reduce your carbon footprint, CarbonCast provides the tools to make informed, impactful choices.

Looking Ahead

I am excited to contribute to a project that blends machine learning, systems engineering, sustainability, and public impact. My goal is to help make it easier for everyone to see, understand, and act on their carbon footprint while also providing the “visibility” people need to take meaningful, informed actions.

Rectilinear Floorplans in OpenROAD

Thu, 10 Jul 2025 00:00:00 +0000

Google Summer of Code ‘25: Enabling Rectilinear Floorplanning in OpenROAD

This summer, under the guidance of my mentors Eder Monteiro and Augusto Berndt at the OpenROAD project, I am implementing support for polygonal (specifically rectilinear) die shapes in OpenROAD’s floorplanning flow.

Here’s a link to my original proposal

What is OpenROAD and why polygonal floorplans?

OpenROAD is a fully autonomous RTL-to-GDS digital layout toolchain. The OpenROAD flow delivers an Autonomous, No-Human-In-Loop (NHIL) flow, 24 hour turnaround from RTL-GDSII for rapid design exploration and physical design implementation.

Until now, OpenROAD primarily supported rectangular die shapes in its floorplanning. This limits its use for advanced packaging, 2.5D/3D ICs, or irregular chiplet-based designs, where non-rectangular dies are increasingly common.

By extending the floorplanner to handle rectilinear (non-manhattan, but still axis-aligned) dies, we open the door for a broader class of cutting-edge VLSI layouts.

Motivation and what gap does this fill?

From my background in electronics engineering, I’ve seen that advanced packaging hsd been unusual die shapes, whether for stacking, interposers, or simply to optimize area and thermal profiles.

Right now, OpenROAD’s inability to handle these non-rectangular dies is a blocker for certain modern flows. My project directly addresses this by:

Extending the Tcl and internal APIs to accept and validate polygonal die/core shapes.
Modifying the row generation and site placement algorithms to conform to polygonal boundaries.
Ensuring that all downstream modules (global placement, detailed placement, routing) can consume the new floorplan data structures without any issues.

My approach & engineering plan

My work focuses on maintaining robustness and backward compatibility, while introducing this major new feature.

Floorplan Input: Users can now specify die/core shapes as sequences of x/y coordinates (-die_polygon and -core_polygon) directly from the Tcl interface.
Data structures: Extend OpenROAD’s internal representations to store arbitrary rectilinear polygons and propagate these safely through the design pipeline.
Row Generation: Develop a new functions (make_polygon_rows) to fill polygonal die areas with standard cell rows, properly clipped to the die shape.
Verification: Build rigorous regression and sanity checks. This includes both internal checks and external DEF writeouts that can be visualized in other tools.
Testing & Benchmarks: Prepare a suite of testcases (including complex T-shaped and L-shaped dies) to validate correctness.

I’m especially careful to keep the existing rectangular flow untouched. The new features only engage when the user explicitly specifies polygonal options.

Acknowledgments

I’m deeply grateful to my mentors from the OpenROAD community who have given invaluable guidance - Eder Monteiro and Augusto Berndt. I’m also excited to contribute this to an open-source EDA project that’s shaping the future of accessible hardware design.

Auditing Skin Tone Bias in Text-to-Image Models

Wed, 09 Jul 2025 00:00:00 +0000

As part of the Stable Diffusion Bias Project, my proposal focuses on evaluating bias in visual outputs of generative AI models, particularly skin tone bias in Stable Diffusion.

The goal is to analyze how models render people based on prompts like “a doctor” or “a homeless person,” and whether certain prompts systematically result in lighter or darker skin tones—even when race isn’t explicitly mentioned.

🧪 What I’ve Done So Far

Designed a prompt template covering six social categories (e.g., criminal justice, profession, socioeconomic)
Generated image datasets using Stable Diffusion with varied seeds
Built a preprocessing pipeline to estimate melanin values from generated faces
Created early visualizations showing distributional trends in skin tone
Identified early evidence of bias in prompts linked to status or wealth

⚒️ Tools and Methods

Stable Diffusion for controlled image generation
BioSkin pipeline to extract melanin metrics
Fitzpatrick skin type approximation (in development as a validation method)
Python-based data analysis and prompt auditing
openai/CLIP and BLIP for optional image-text alignment scoring

🔍 What I’m Seeing

Preliminary results show that even neutral prompts like “a portrait of a professor” tend to favor lighter skin tones, while prompts such as “a manual laborer” or “a homeless person” skew toward darker tones. These trends are not always obvious to the human eye, which is why quantitative skin tone analysis is essential.

I’m now exploring whether prompt engineering (e.g., adding “fair,” “dark-skinned,” or “diverse” descriptors) can help mitigate these imbalances.

🚧 What’s Next

Expand dataset to 60 prompts across 6 categories
Incorporate alternate T2I models (Midjourney, DALL·E 3)
Write a technical report and reproducible evaluation framework
Submit a short paper or workshop proposal to a fairness or ethics venue

Benchmarking the Future: Exploring High-Speed Scientific Data Streaming

Sun, 06 Jul 2025 00:00:00 +0000

Hello! I’m Ankit Kumar, and although I’m a bit late with this introduction post due to a busy period filled with interviews and college formalities, I’m excited to share my journey with the OSRE 2025 program and the fascinating world of scientific data streaming.

About Me

I’m currently pursuing my BTech degree at the Indraprastha Institute of Information Technology Delhi (IIIT Delhi) and am based in New Delhi, India. As I approach graduation, I’m thrilled to be working on a project that perfectly aligns with my interests in systems and networking.

My passion for technology has led me through various experiences:

Software Developer at CloudLabs: I worked at a platform founded by Dr. Sumit J Darak that facilitates remote access to actual FPGA boards on a slot basis, making hardware experimentation accessible to students worldwide.
Data Mining Intern at TaskTracker.in: This experience gave me insights into large-scale data processing and analysis.
Undergraduate Researcher: Currently working under Dr. Mukulika Maity on benchmarking QUIC and TCP protocols across different environments including bare metal, virtual machines, and containers.

I chose this OSRE project because it represents an incredible opportunity to work with some of the best minds in the industry at Argonne National Laboratory (ANL) while diving deep into cutting-edge networking technologies.

My Project: SciStream Performance Analysis

As part of the SciStream project, I’m focusing on two critical aspects of high-performance scientific data streaming:

1. TCP/UDP Performace Benchmarking

I’m conducting comprehensive benchmarking of SSH and TLS tunnels using various open-source tools and parameters. This work is crucial for understanding how different protocols and their overhead impact the performance of real-time scientific data streaming. The goal is to provide researchers with evidence-based recommendations for moving/processing their high-speed data transfers without compromising performance.

2. QUIC Proxy Exploration

I’m exploring different QUIC proxy implementations to understand their potential advantages over traditional TCP+TLS proxies in scientific workflows. QUIC, the protocol that powers modern web applications like YouTube, offers promising features for scientific data streaming, but comprehensive benchmarking is needed to validate its benefits.

Working with Cutting-Edge Testbeds

Currently, I’m conducting experiments using both the FABRIC testbed and ESnet testbed. These platforms provide access to real high-speed network infrastructure, allowing me to test protocols and configurations under realistic conditions that mirror actual scientific computing environments.

The Team Experience

These past two weeks have been incredibly rewarding, working alongside:

Alain Zhang - my project mate from UC San Diego, cool guy.
Flavio Castro - My project mentor and manager, goto person for my issues. currently at anl as a research development software engineer.
Joaquin Chung - Super mentor, brains behind the project. His guidance on the project is super valubale.
Rajkumar Kettimuthu - Lead Scientist in our project whose comments on our paper critique are invaluable.
Seena Vazifedunn - Graduate Research Assistant at University of Chicago. He asks very relevant and important questions during our report presentation and his feedbacks are very insightful.

The collaborative nature of this project has been fantastic, combining perspectives from different institutions and backgrounds to tackle complex networking challenges.

Stay tuned for updates!

This work is part of the SciStream project at Argonne National Laboratory, reimagining how scientific data moves across modern research infrastructure.

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Wed, 25 Jun 2025 00:00:00 +0000

Hello, I’m Baiqiang. As part of the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project, I am excited to introduce my work on a crucial challenge in modern computational science. My proposal under the mentorship of Luanzheng “Lenny” Guo at Pacific Northwest National Laboratory and Dongfang Zhao at the University of Washington aims to enhance the reproducibility of AI-driven scientific workflows.

The Problem: A Crisis of Confidence in AI for Science

Large Language Models (LLMs) are transforming scientific research, from accelerating literature reviews to generating novel hypotheses. However, their power is matched by their pitfalls: a tendency to “hallucinate” facts and a lack of transparency. Retrieval-Augmented Generation (RAG) was developed as a powerful solution, grounding LLM outputs in factual evidence retrieved from a specific knowledge base (like a database of scientific papers).

But a hidden problem lurks within RAG: non-determinism. The very first step of a RAG system—the similarity search that finds relevant documents—can produce different results even when asked the same question. Variations in indexing algorithms, data updates, or even the underlying software can change which documents are retrieved. For science, this is a critical flaw. If an experiment cannot be repeated with the same results, its conclusions cannot be trusted. This project tackles that challenge head-on.

Our Mission: Forging a Path to Reproducible RAG

This project proposes a comprehensive solution to systematically identify, measure, and mitigate non-determinism in RAG frameworks. Our goal is to empower researchers to build and use AI tools with confidence.

Our approach is built on four key pillars:

Systematic Analysis: We will conduct a deep dive into popular RAG components (like FAISS, ScaNN, and HNSW) to pinpoint the exact sources of randomness and variability.
Rigorous Benchmarking: We will develop a public, open-source benchmarking suite using standardized scientific datasets (from PubMed, arXiv, etc.). This will allow anyone to quantitatively measure the reproducibility of their own RAG pipeline using clear metrics like retrieval overlap and rank correlation.
Targeted Enhancements: Based on our findings, we will implement practical solutions, including:
- Promoting deterministic algorithms and configurations.
- Building robust data versioning and provenance tracking tools (inspired by DVC and Git LFS).
- Creating tools for precise configuration management to capture the entire experimental setup.
Practical Guidance and Open Source Tools: We will distill our insights into comprehensive documentation, reusable code examples, and best practices. All tools and findings will be contributed back to the open-source community.

From Friction to Flow: Why I'm Building Widgets for Reproducible Research

Tue, 24 Jun 2025 00:00:00 +0000

This summer, I’m building Jupyter Widgets to reduce friction in reproducible workflows on Chameleon. Along the way, I’m reflecting on what usability teaches us about the real meaning of reproducibility.

Supercomputing Competition: Reproducibility Reality Check

My first reproducibility experience threw me into the deep end—trying to recreate a tsunami simulation with a GitHub repository, a scientific paper, and a lot of assumptions. I was part of a student cluster competition at the Supercomputing Conference, where one of our challenges was to reproduce the results of a prior-year paper. I assumed “reproduce” meant something like “re-run the code and get the same numbers.” But what we actually had to do was rebuild the entire computing environment from scratch—on different hardware, with different software versions, and vague documentation. I remember thinking: If all these conditions are so different, what are we really trying to learn by conducting reproducibility experiments? That experience left me with more questions than answers, and those questions have stayed with me. In fact, they’ve become central to my PhD research.

Summer of Reproducibility: Lessons from 100+ Experiments on Chameleon

I’m currently a PhD student and research software engineer exploring questions around what computational reproducibility really means, and when and why it matters. I also participated in the Summer of Reproducibility 2024, where I helped assess over 100 public experiments on the Chameleon platform. Our analysis revealed key friction points—especially around usability—that don’t necessarily prevent reproducibility in the strictest sense, but introduce barriers in terms of time, effort, and clarity. These issues may not stop an expert from reproducing an experiment, but they can easily deter others from even trying. This summer’s project is about reducing that friction—some of which I experienced firsthand—by improving the interface between researchers and the infrastructure they rely on.

From Psychology Labs to Jupyter Notebooks: Usability is Central to Reproducibility

My thinking shifted further when I was working as a research software engineer at Purdue, supporting a psychology lab that relied on a complex statistical package. For most researchers in the lab, using the tool meant wrestling with cryptic scripts and opaque parameters. So I built a simple Jupyter-based interface to help them visualize input matrices, validate settings, and run analyses without writing code. The difference was immediate: suddenly, people could actually use the tool. It wasn’t just more convenient—it made the research process more transparent and repeatable. That experience was a turning point for me. I realized that usability isn’t a nice-to-have; it’s critical for reproducibility.

Since that first experience, I’ve leaned into building better interfaces for research workflows—especially using Jupyter Widgets. Over the past few years, I’ve developed and taught tutorials on how to turn scientific notebooks into interactive web apps, including at the SciPy conference in 2023 and 2024. These tutorials go beyond the basics: I focus on building real, multi-tab applications that reflect the complexity of actual research tools. Teaching others how to do this has deepened my own knowledge of the widget ecosystem and reinforced my belief that good interfaces can dramatically reduce the effort it takes to reproduce and reuse scientific code. That’s exactly the kind of usability work I’m continuing this summer—this time by improving the interface between researchers and the Chameleon platform itself.

Making Chameleon Even More Reproducible with Widgets

This summer, I’m returning to Chameleon with a more focused goal: reducing some of the friction I encountered during last year’s reproducibility project. One of Chameleon’s standout features is its Jupyter-based interface, which already goes a long way toward making reproducibility more achievable. My work builds on that strong foundation by improving and extending interactive widgets in the Python-chi library — making tasks like provisioning resources, managing leases, and tracking experiment progress on Chameleon even more intuitive. For example, instead of manually digging through IDs to find an existing lease, a widget could present your current leases in a dropdown or table, making it easier to pick up where you left off and avoid unintentionally reserving unnecessary resources. It’s a small feature, but smoothing out this kind of interaction can make the difference between someone giving up or trying again. That’s what this project is about.

Looking Ahead: Building for People, Not Just Platforms

I’m excited to spend the next few weeks digging into these questions—not just about what we can build, but how small improvements in usability can ripple outward to support more reproducible, maintainable, and accessible research. Reproducibility isn’t just about rerunning code; it’s about supporting the people who do the work. I’ll be sharing updates as the project progresses, and I’m looking forward to learning (and building) along the way. I’m incredibly grateful to once again take part in this paid experience, made possible by the 2025 Open Source Research Experience team and my mentors.

Applying MLOps to overcome reproducibility barriers in machine learning research

Sun, 22 Jun 2025 00:00:00 +0000

About the Project

Hello! I’m Ahmed, an undergraduate Computer Science student at the University of Khartoum I’m working on making machine learning research more reproducible for open access research facilities like Chameleon testbed, under the project Applying MLOps to overcome reproducibility barriers in machine learning research, mentored by Prof. Fraida Fund and Mohamed Saeed. as part of this project my proposal aims to build a template generator that generates repositories for reproducible model training on the Chameleon testbed.

Reproducibility

We argue that unless reproducing research becomes as vital and mainstream part of scientific exploration as reading papers is today, reproducibility will be hard to sustain in the long term because the incentives to make research results reproducible won’t outweigh the still considerable costs

— Three Pillars of Practical Reproducibility Paper

By Reproducibility in science we refer to the ability to obtain consistent results using the same methods and conditions as the previous study. in simple words if I used the same data and metholodgy that was used before, I should obtain the same results. this principle is mapped to almost every scientific field including both Machine Learning research in science and core Machine Learning.

Challenges in Reproducibility

The same way the famous paper about the repoducibility crisis in science was published in in 2016, similar discussions have been published discussing this in machine learning research setting, the paper state of the art reproducibility in artificial intelligence after analayzing 400 hundereds papers from top AI conferences, it was found that around 6% shared code, approximately 33% shared test data. In contrast, 54% only shared a pseudocode (summary of the algorithm).

The lack of software dependency management, proper version control, log tracking, and effective artifacts sharing made it very difficult to reproduce research in machine learning.

Reproducibility in machine learning is largely supported by MLOps practices which is the case in the industry where the majority of researchers are backed by software engineers who are responsible of setting experimental environments or develop tools that streamline the workflow.However, in academic settings reproducibility remains a great challenge, researchers prefer to focus on coding, and worry a little about the the complexities invloved in configuring their experimental environment,As a result, the adaptation and standardization of MLOps practices in academia progress slowly. The best way to ensure a seamleas experience with MLOps, is to make these capabilities easily accessible to the researchers’ workflow. by developing a tool that steamlines the process of provisioning resources, enviornment setup, model training and artifacts tracking, that ensures reproducible results.

Proposed Solution

We want the researchers to spin up ML research instances/bare metal on Chameleon testbed while keeping the technical complexity involved in configuring and stitching everything together abstracted, users simply answer frew questions about their project info, frameworks, tools, features and integrations if there are any, and have a full generated,reproducible project. it contains a provisioning/infrastracture config layer for provisioning resources on the cloud, a dockerfile to spin up services and presistent storage for data,the ML tracking server system that logs the artifacts, metadata, environment configuration, system specification (GPUs type) and Git status using Mlflow, powered by a postgresSQL for storing metadata and a S3 Minio bucket to store artifacts.ML code at its core is a containarized training environment backed by persistent storage for the artifacts generated from the experiment and the datasets and containarization of all these to ensure reproducibility.we aim to make the cloud experience easier, by dealing with the configuration needed for setting up the environment having a 3rd party framework, enabling seamless access to benchmarking dataset or any necessary components from services like Hugging face and GitHub as an example will be accessible from the container easily. for more techincal details about the solution you can read my propsal here.

By addressing these challenges we can accelerate the scientific discovery. this not benefits those who are conducting the research but also the once building on top of it in the future. I look forward to share more updates as the project progresses and I welcome feedback from others interested in advancing reproducibility in ML research.

Building a Benchmarking Suite for Cache Performance Evaluation

Sat, 21 Jun 2025 00:00:00 +0000

Hi! I’m Haocheng Xia, a Computer Science student at the University of Illinois Urbana-Champaign, passionate about the intersection of machine learning and storage systems. Specifically, I’m keen on workload analysis and KV cache management for large language models.

This summer, I’m happy to be a part of SoR 2025 and OSRE 2025. I’m contributing to the CacheBench project. My initiative, ‘Building a Benchmarking Suite for Cache Performance Evaluation,’ will create a robust platform. This involves extensive simulation of existing eviction algorithms using libCacheSim, developing microbenchmarks, and building a user-friendly platform for researchers to effortlessly evaluate novel cache designs. The ultimate goal is to establish a competitive leaderboard.

My contributions will include a comprehensive dataset detailing simulated miss ratios and throughput of current cache eviction algorithms, an extension to libCacheSim for executing microbenchmarks both locally and on our online platform, and the creation and ongoing maintenance of a public web leaderboard. I’m grateful to be mentored by Juncheng Yang and Yazhuo Zhang.

I’m thrilled to be part of building tools that empower users and advance the vision of a more decentralized web. Looking forward to a productive summer!

RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Thu, 19 Jun 2025 00:00:00 +0000

Hi everyone! My name is Zeyu, and I will be working on a project for a retrieval-enhanced generative framework for spatial transcriptomics during Google Summer of Code 2025. My project is called RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics and is supervised by Ziheng Duan. The goal is to develop a retrieval-enhanced generative framework for predicting spatial gene expression from histological images, making spatial transcriptomics more affordable and easier to implement. You can view my full proposal here!

Spatial transcriptomics enables the capture of gene expression profiles with spatial resolution, providing unprecedented insights into cellular organization and the tissue microenvironment. However, its widespread application is limited by high costs and technical complexity. In contrast, histological imaging is inexpensive and widely accessible. If we can accurately predict gene expression from histology images, then high-resolution spatial information can be inferred without costly experiments.

My project will:

Create a large-scale paired dataset combining HEST histology images with reference gene expression profiles from CellxGene.
Design a novel RAG-ST architecture that enables both interpretable and controllable generation of spatial gene expression.
Benchmark RAG-ST against current state-of-the-art models for image-based gene expression inference.
Open-source the full codebase and provide comprehensive tutorials to support future research and development.

I am excited to contribute to this project and help broaden access to spatial transcriptomics insights through machine learning–powered predictions!

Zeyu Zou

University of Northeastern Graduate

Zeyu Zou is a graduate student at the University of Northeastern, where he is majoring in Analytics.

EnvGym – An AI System for Reproducible Custom Computing Environments

Mon, 16 Jun 2025 00:00:00 +0000

Hello, My name is Yiming Cheng. I am a Pre-doc researcher in Computer Science at University of Chicago. I’m excited to be working with the Summer of Reproducibility and the Chameleon Cloud community as a project leader. My project is EnvGym that focuses on developing an AI-driven system to automatically generate and configure reproducible computing environments based on natural language descriptions from artifact descriptions, Trovi artifacts, and research papers.

The complexity of environment setup often hinders reproducibility in scientific computing. My project aims to bridge the knowledge gap between experiment authors and reviewers by translating natural language requirements into actionable, reproducible configurations using AI and NLP techniques.

Project Overview

EnvGym addresses fundamental reproducibility barriers by:

Using AI to translate natural language environment requirements into actionable configurations
Automatically generating machine images deployable on bare metal and VM instances
Bridging the knowledge gap between experiment authors and reviewers
Standardizing environment creation across different hardware platforms

June 10 – June 16, 2025

Getting started with the project setup and initial development:

I began designing the NLP pipeline architecture to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into structured environment “recipes”
I set up the initial project repository and development environment
I met with my mentor Prof. Kexin Pei to discuss the project roadmap and technical approach
I started researching existing artifact descriptions from conferences and Trovi to understand common patterns in environment requirements
I began prototyping the backend environment builder logic that will convert parsed requirements into machine-image definitions
I explored Chameleon’s APIs for provisioning servers and automated configuration

Next Steps

Continue developing the NLP component for requirement parsing
Implement the core backend logic for environment generation
Begin integration with Chameleon Cloud APIs
Start building the user interface for environment specification

This is an exciting and challenging project that combines my interests in AI systems and reproducible research. I’m looking forward to building a system that will help researchers focus on their science rather than struggling with environment setup issues.

Thanks for reading, I will keep you updated as I make progress on EnvGym!

Smart Environments – An AI System for Reproducible Custom Computing Environments

Mon, 16 Jun 2025 00:00:00 +0000

Hi everyone, I’m Sam! I’m excited to be working with the Argonne National Laboratory and SoR this summer on Smart Environments. Have you ever encountered a great opensource project and wanted to run it or use it locally, only to find that it’s such a headache to set up all the dependencies? Maybe your system version wasn’t correct, or a piece of software was outdated, or the dependencies were incompatible with something you had already on your machine?

In comes EnvGym to save the day! We want EnvGym to be an agent that would help reproduce opensource projects by automatically setting up the environmental dependencies required to get them running. That’s what I will be working on for the rest of the summer! To make EnvGym work, we will be leveraging LLM agents to tackle the problem. We will use EnvGym to read documentations, understand code structures, run commands to set up environments, and reflectively react to any errors and warnings.

To build EnvGym, I have the following to-do’s in mind:

Building a dataset that includes repos to be reproduced
Establishing a baseline using current methods
Implementing the actual EnvGym algorithm
Testing EnvGym against baseline performance and iteratively improving it
Deploying EnvGym to real-world use cases and gathering feedback

Here is the repo that we are working on: https://github.com/EaminC/EnvGym/tree/main

More updates to come, thanks for reading!

Assessing and Enhancing CC-Snapshot for Reproducible Experiment Enviroments

Sun, 15 Jun 2025 00:00:00 +0000

Hello, My name is Zahra Temori. I am a rising senior in Computer Science at University of Delaware. I’m excited to be working with the Summer of Reproduciblity and the Chameleon Cloud community. My project is cc-snapshot that focuses on enhancing features for helping researchers capture and share reproducible experimental environments within the Chameleon Cloud testbed.

Here is a detailed information about my project and plans to work for summer proposal.

June 10 – June 14, 2025

Getting started with the first milestone and beginning to explore the Chameleon Cloud and the project:

I began familiarizing myself with the Chameleon Cloud platform. I created an account and successfully accessed a project.
I learned how to launch an instance and create a lease for using computing resources.
I met with my mentor to discuss the project goals and outline the next steps.
I experimented with the environment and captured a snapshot to understand the process.

It has been less than a week and I have learned a lot specially about the Chameleon Cloud and how it is different from other clouds like AWS. I am exited to learn more and make progress.

Thanks for reading, I will keep ypu updated as I work :)

Building a Billion-Scale Vector Embeddings Dataset

Sun, 15 Jun 2025 00:00:00 +0000

Billion Vector Embeddings Dataset

As part of the Billion-Scale Embeddings Dataset project, my proposal under the mentorship of Jayjeet Chakraborty aims to create the first large-scale, real-world vector embeddings dataset—bridging the critical gap in Approximate Nearest Neighbor (ANN) benchmarks and Retrieval-Augmented Generation (RAG) systems.

Motivation

Existing ANN benchmarks often fall short—they’re either synthetic (like SIFT) or too small-scale (≤1M vectors). With the rapid evolution of LLM-based vector search systems (e.g., OpenAI’s 3072d text-embedding-3-large), there’s a growing need for:

High-dimensional (>1000d), large-scale (>100M) embeddings
Real-world distributions (Wikipedia-scale text)
Open, reproducible benchmarks for the community

Project Goals

Generate 1 billion embeddings from English Wikipedia using open-source models.
Create multiple dimensional variants: 1024d, 4096d, and 8192d.
Deduplicate, compress, and store embeddings with rich metadata (URL, timestamps, models).
Benchmark ANN performance on FAISS, HNSW, and Annoy.
Distribute the dataset via HuggingFace & AWS S3 with shard-level access.

Open Source Impact

ANN Libraries: Enable reproducible benchmarking for real-world workloads.
RAG Systems: Evaluate and optimize retrieval at scale using real Wikipedia text.
Researchers: Conduct large-scale studies on dimensionality, ANN accuracy, and compression trade-offs.

Develop a clean and intuitive web-based interface for WildberryEye

Sun, 15 Jun 2025 00:00:00 +0000

As part of the WildberryEye, my proposal under the mentorship of Isaac Espinosa aims to develop a clean, intuitive, and responsive web-based interface to support real-time pollinator detection, data visualization, and system configuration.

WildberryEye leverages edge computing (Raspberry Pi 5) and object detection (YOLO) to monitor pollinators like bees and hummingbirds. The expectations for this project focuse on developing a full-stack web interface to support real-time pollinator detection, data visualization, and system configuration. The whole development also include the real-time data extraction from the Raspberry Pi 5). The final result empowers researchers and contributors to engage with environmental data in an accessible and meaningful way.

Developing an Open Testbed for Edge Replication System Evaluation

Sun, 15 Jun 2025 00:00:00 +0000

Hi, I’m Panji. I’m currently contributing to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges under the mentorship of Fadhil I. Kurnia. You can find more details on the project proposal here.

The primary challenge we’re addressing is the current difficulty in fairly comparing different edge replication systems. To fix this, we’re trying to build a testing platform with four key parts. We’re collecting real data about how people actually use edge services, creating a tool that can simulate realistic user traffic across many locations, building a system that mimics network delays between hundreds of edge servers, and packaging everything into an open-source toolkit.

This will let researchers test different coordination methods like EPaxos, Raft, and others using the same data and conditions. We hope this will help provide researchers with a more standardized way to evaluate their systems. We’re working with multiple programming languages and focusing on making complex edge computing scenarios accessible to everyone in the research community.

One of the most interesting aspects of this project is tackling the challenge of creating realistic simulations that accurately reflect the performance characteristics different coordination protocols would exhibit in actual edge deployments. The end goal is to provide the research community with a standardized, reproducible environment for edge replication.

Implement Web Extensions & System Settings Integration

Sun, 15 Jun 2025 00:00:00 +0000

Hi! I’m Hanzhong Liu, a Computer Science student at Fordham University with a minor in Business Administration. My interests lie in distributed systems, backend engineering, and decentralized tools—especially systems that prioritize user autonomy and privacy.

This summer, I’m contributing to the Peersky project as part of OSRE 2025 through Google Summer of Code. My project, “Implement Web Extensions & System Settings Integration,” will add full support for local browser extensions in Peersky, allowing users to customize their experience without relying on centralized extension stores.

Deliverables include an extension loader, drag-and-drop installation for .zip and Git-based extensions, manifest validation, sandboxing, and a unified peersky://settings page for managing everything from themes to privacy tools. Pre-installed extensions like uBlock Origin and DScan will be bundled by default.

You can read my full proposal here. My mentor for this project is Akhilesh Thite.

I’m excited to help build tools that empower users to take control of their browsing experience—and to contribute to the vision of a more decentralized web. Looking forward to the summer ahead!

Into the VR-Verse: My GSoC Adventure Begins!

Sun, 15 Jun 2025 00:00:00 +0000

Hello! I’m Kajal Jotwani, an undergraduate Computer Science student from India who is passionate about building creative, interactive technologies and contributing to open source. This summer, as part of Google Summer of Code 2025, I will be working on the Brahma / Allocentric WebXR Interfaces project under the mentorship of Samir Ghosh. You can read my complete proposal here.

This project focuses on creating a formalized framework for building collaborative and cross-platform WebXR-based experiences. As part of its first public release of Brahma- a lightweight open-source toolkit, our goal is to formalize the framework, create documentation, and implement example applications like multi-user games and scientific visualizations. This will help make Brahma extensible and accessible for a wider developer community.

I’m excited to be working on this project and will be documenting my journey, learnings, and progress here throughout the summer.

Introducing Scenic-RoboSuite Interface

Sun, 15 Jun 2025 00:00:00 +0000

Hey! I’m Sahil, working on integrating Scenic with RoboSuite for GSoC 2025. My project is mentored by Daniel Fremont and Eric Vin .

I’m connecting Scenic (a probabilistic programming language for scenarios) with RoboSuite (a robotics simulation framework). Basically, you write simple scenario descriptions and get complex 3D robot simulations automatically.

Currently, as I’m building things and learning how Scenic works, I have been able to get the basic skeleton for the simulator interface working. I’ve implemented the simulator class and built a world model that can translate Scenic objects into RoboSuite’s simulator (which is MuJoCo-based). The interface now handles precise object placement in the world pretty well.

One of the trickier parts was figuring out the translation logic between Scenic and RoboSuite. I managed to overcome this by building a system that automatically detects the shape of objects when moving between the two frameworks, which lays a foundation for more complex object mapping later on.

I’ve also built some basic example scenarios to run and test with. Currently working on more complex examples and testing Scenic’s features like probabilistic object placement, constraint satisfaction, and spatial relationships between objects.

In summary, the “Scenic to RoboSuite” part of the interface is pretty much done. For next week, I need to work on the “RoboSuite to Scenic” part - basically getting feedback and state information flowing back from the simulation. Achieving this will make a complete bridge and give us a working simulator interface, which is the first major milestone for the project.

Kolmogorov-Arnold-based Transformer for LLMs

Sun, 15 Jun 2025 00:00:00 +0000

Project: KALLM

Proposal: proposal

Mentors:

Sai Suman Lamba Karanam
Prof. Zahmeeth Sakkaff

I am modifying existing large language models to make them more efficient by replacing some of their layers with Kolmogorov-Arnold Network (KAN) modules. These KAN layers use compact univariate polynomial approximations, which can reduce parameter count and improve interpretability. The project explores how to integrate these layers into Transformers, and how far we can push this idea by combining or stacking KAN modules with different polynomial bases. The goal is to keep performance competitive while lowering computational costs.

Beyond just speeding up training, I am exploring several other promising directions. One is testing whether transfer learning remains effective when replacing the linear layers of a pretrained LLM with KAN modules, or when swapping between different KAN configurations. I am also considering curriculum learning strategies that gradually increase KAN complexity during training. I have studied all major KAN implementations and early experiments with a custom Transformer architecture show encouraging results. However, I have found that most LLMs rely on functional-style activation definitions in PyTorch, which makes it difficult to build a universal wrapper. Because of this, KAN-based models will likely need to be integrated manually on a case-by-case basis.

Open Source Repository Browser

Sun, 15 Jun 2025 00:00:00 +0000

Hi! I’m Param Arora, a Computer Science student at Manipal Institute of Technology. My interests lie in backend engineering and AI.

This summer, I’m contributing to the ORB project as part of OSRE 2025 through Google Summer of Code.

My project, “UC Open Source Repository Browser [UC ORB]”, is a discovery platform that maps and categorizes open source projects across the UC system. It offers a comprehensive web interface with intuitive search, advanced filtering capabilities, responsive design, and integrated visualizations of project metrics.

You can read my full proposal here. My amazing mentor for this project is Juanita Gomez.

Looking forward to the summer ahead!

Scaling Sensor Networks for Environmental Research

Sun, 15 Jun 2025 00:00:00 +0000

Hi! I’m Devansh Kukreja, a researcher, indie developer, and Computer Science undergrad. I’m interested in distributed systems, orchestration services, and real-time data platforms. I enjoy working on systems that help different components connect and run smoothly at scale.

This summer, I’m contributing to the ENTS (Environmental NeTworked Sensor) platform with the University of California, Santa Cruz Open Source Program Office as part of Google Summer of Code 2025.

ENTS is an open-source web portal designed to collect, visualize, and analyze data from large-scale environmental sensor networks. It helps researchers and citizen scientists monitor sensors like soil moisture, temperature, current, and voltage supporting real-time environmental research in outdoor settings.

My work this summer focuses on improving the platform’s reliability and usability. I’ll be fixing visualization bugs, enhancing chart synchronization, making data point selection more intuitive, and improving error handling. Alongside that, I’m building a Logger Registration System that lets users easily add and configure their data loggers, with potential support for over-the-air provisioning via The Things Network (TTN) for LoRaWAN-based devices.

You can check out my full proposal here. I’m grateful to be mentored by Colleen Josephson, John Madden, and Alec Levy, who are guiding the project with incredible insight and support.

By the end of the summer, ENTS will be a more stable, user-friendly, and extensible platform—better equipped to support environmental research at scale. I’m super excited to learn, build, and contribute to something meaningful!

Type Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits

Sun, 15 Jun 2025 00:00:00 +0000

Hello! I’m Siva Sathyaseelan D N, a pre-final year B.Tech + M.Tech Engineering student at IIT BHU, Varanasi, India. With a deep-rooted passion for software development and scientific computing. I thrive at the intersection of code and real-world problem-solving. For two years, I’ve engaged in open-source work across scientific simulation, blockchain, and cloud-native technologies, through hobby projects, hackathons, internships, and an LFX mentee. I will be working on Type Narrowing: Evaluate New Gradual Languages and Do Unsound Narrowings Lead to Exploits under the mentorship of Ben Greenman. My proposal can be viewed here!

Building a Simulator for Benchmarking Replicated Systems

Sat, 14 Jun 2025 00:00:00 +0000

Hi, I’m Michael. I’m currently contributing to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges under the mentorship of Fadhil Kurnia. You can find more details on the project proposal here.

What we are trying to achieve is to create a system to test and evaluate the performance of different consensus protocols and consistency models under the same application and workload. The consensus protocols and consistency models are both tested on various replicated black-box applications. Essentially, the testbed itself is able to deploy any arbitrary stateful application on multiple machines (nodes) as long as it is packaged in the form of a docker image. The consensus protocol is used to perform synchronization between the stateful part of the application (in most cases, the database). The goal is that by the end of this project, the testbed we are building has provided the functionality and abstraction to support the creation of new consensus protocols to run tests on.

One major challenge in implementing this is with regards to the handling of replication on the running docker containers. Generally, the services that can be deployed in this system would be of two types:

A Deterministic Application (An application that will always return the same output when given the same input. e.g., a simple CRUD app)
A Non-Deterministic Application (An application that may return the different outputs when given the same input. e.g., an LLM which may return different response from the same prompt request)

Both of these application types requires different implementation of consensus protocols. In the case of a deterministic application, since all request will always yield the same response (and the same changes inside the database of the application itself), the replication protocol can perform replication on the request to all nodes. On the other hand, in a non-determinisitic application, the replication protocol applies synchronization on the state of the database directly since a different response may be returned on the same request.

Kicking Off Intelligent Observability for Seam: My OSRE 2025 Journey

Sat, 14 Jun 2025 00:00:00 +0000

Hi! I’m Manish K Reddy (@kredd2506), a graduate student based in the United States, and I’m excited to join the OSRE 2025 cohort. This summer, I’ll be working with the San Diego Supercomputer Center (SDSC) and the National Research Platform (NRP) on a project that blends my interests in machine learning, cloud systems, and real-world impact.

The National Research Platform (NRP) has moved beyond its original vision as a “ScienceDMZ data freeway” and evolved into a distributed cloud supercomputer, empowering research and education across more than 50 institutions. SDSC, located at UC San Diego, is recognized internationally for driving innovation in data, supercomputing, and advanced cyberinfrastructure.

My project, “Intelligent Observability for Seam: A GenAI Approach” focuses on building an ML-powered service for NRP. The goal is to analyze monitoring data (starting with Prometheus metrics), automatically detect anomalies, and use generative AI (GenAI) for human-readable explanations and root-cause analysis. This will help researchers and operators solve problems faster and keep complex research systems running smoothly.

I am especially grateful to my lead mentor Mohammad Firas Sada, who is personally guiding me throughout this project. I also want to thank Jeffrey Weekley and Derek Weitzel for their support and guidance.
You can read my initial proposal here (PDF).

GenAI-Driven Observability for NRP

Topics: Machine Learning, Observability, DevOps, High Performance Computing, LLMs, GenAI, Distributed Systems
Skills: Python, Prometheus, Docker, Kubernetes, FastAPI, PyTorch, Pandas, LLM APIs, scikit-learn, PostgreSQL
Difficulty: Medium
Size: 350 hours
Mentors: Mohammad Firas Sada, Jeffrey Weekley, Derek Weitzel

This summer, I’m looking forward to:

Delivering an open-source anomaly detection tool for NRP
Building GenAI features for better explanations and root-cause analysis
Learning from my mentors and contributing to a vibrant open science community

Thanks for reading, and I’m looking forward to sharing my journey and progress in the coming weeks!

LINQS: Autograder (LLM Detection)

Sat, 14 Jun 2025 00:00:00 +0000

LINQS: Autograder (GSoC ‘25)

As part of the LINQS: Autograder (LLM Detection) my proposal under the mentorship of Eriq Augustine, Lucas Ellenberger, and Lise Getoor aims to build a tool for AI plagiarism detection in code.

Problem Statement

Academic institutions are facing new sets of challenges in maintaining academic integrity with the rise of Large Language Models and tools like ChatGPT and GitHub Copilot, and their easier accessibility to students. Students are increasingly using these tools for assistance with their coursework, especially in programming assignments.

While these tools are useful for purposes such as brainstorming, research, and drafting, its use in completing assignments often crosses ethical boundaries. The use of these tools by students makes it difficult to uphold fairness in grading and ensure they are truly learning.

AI-generated code often lacks unique identifiers, rendering traditional plagiarism detectors like MOSS ineffective in detecting AI-generated code. That’s why there is a need for better systems that can assess whether code was AI generated by spotting underlying patterns.

Project Overview:

This is the problem that I am working to address with my project ‘LLM Detection’.

I aim to build a system that helps academic institutions ensure fairness and integrity in students’ work. To accomplish this goal, I will be working on 2 tasks:

Building a tool which determines whether a given piece of code was written by AI or not.
Designing and implementing a mechanism to compute a confidence score that indicates the likelihood of AI involvement in the code.

This tool can discourage students from copying or completing entire assignments using AI tools, encouraging honest and independent work.

(Read my full GSoC proposal here: Proposal)

About me:

Hey there!

My name is Anvi Kohli, I am a senior majoring in Computer Science and AI from India. This summer I will be contributing to the Autograder project by the LINQS Lab, under the guidance of Eriq Augustine, Lucas Ellenberger, and Lise Getoor.

A problem-solver at heart, I love to brainstorm, solve, and optimize complex issues. An instance being reaching the grand finals of the Smart India Hackathon to become the third best team nationwide with our app – “PM Poshan”. This app was built to digitize the monitoring and functioning of the mid-day meal scheme in India. It gave me the opportunity to improve my versatility and exposed me to all stages of the product development cycle.

I have hands-on experience in a multitude of domains such as AI/Data Science, cloud, full-stack development, and DevOps. Within AI, I have worked in GenAI, Computer Vision, Deep Learning and Classical Machine Learning. Apart from this, I have a strong interest in entrepreneurship, travelling, and cooking.

MPI Appliance for HPC Research on Chameleon

Sat, 14 Jun 2025 00:00:00 +0000

Hi Everyone,

I’m Rohan Babbar from Delhi, India. This summer, I’m excited to be working with the Argonne National Laboratory and the Chameleon Cloud community. My project focuses on developing an MPI Appliance to support reproducible High-Performance Computing (HPC) research on the Chameleon testbed.

For more details about the project and the planned work for the summer, you can read my proposal here.

👥 Community Bonding Period

Although the project officially started on June 2, 2025, I made good use of the community bonding period beforehand.

I began by getting access to the Chameleon testbed, familiarizing myself with its features and tools.
I experimented with different configurations to understand the ecosystem.
My mentor, Ken Raffenetti, and I had regular check-ins to align our vision and finalize our milestones, many of which were laid out in my proposal.

🔧 June 2 – June 14, 2025

Our first milestone was to build a base image with MPI pre-installed. For this:

We decided to use Spack, a flexible package manager tailored for HPC environments.
The image includes multiple MPI implementations, allowing users to choose the one that best suits their needs and switch between them using simple Lua Module commands.

📌 That’s all for now! Stay tuned for more updates in the next blog.

Thanks for reading!

StatWrap: Cross-Project Searching and Classification using Local Indexing

Sat, 14 Jun 2025 00:00:00 +0000

Hello👋! I am Debangi Ghosh, currently pursuing a degree in Mathematics and Computing at IIT (BHU) Varanasi, India. This summer, I will be working on the StatWrap: Cross-Project Searching and Classification using Local Indexing project under the mentorship of Luke Rasmussen. You can view my project proposal for more details.

My project aims to address the challenges in project navigation and discoverability by integrating a robust full-text search capability within the user interface. Instead of relying on basic keyword-based search—where remembering exact terms can be difficult—we plan to implement a natural language-based full-text search. This approach involves two main stages: indexing, which functions like creating a searchable map of the content, and searching, which retrieves relevant information from that map. We will evaluate and compare available open-source libraries to choose and implement the most effective one. In addition, my project aims to enhance project organization by introducing a new classification system that clearly distinguishes between “Active” and “Past” projects in the user interface. This will improve clarity, reduce clutter, and provide a more streamlined experience as the number of projects grows.

Stay tuned for updates on my progress in the coming weeks! 🚀

WildBerryEye: Mechanical Design & Weather-Resistant Enclosure

Sat, 14 Jun 2025 00:00:00 +0000

Hello! My name is Teodor Langan, an undergraduate student currently persueing a Robotics Engineering degree at the University of California, Santa Cruz. This Summer, I’ll be working on developing the hardware for the WildBerryEye project, mentored by Carlos Isaac Espinosa. Here is my project proposal!

My project focuses on tackling the hardware challenge for WildBerryEye, an open-source ecological monitoring platform built on Raspberry Pi. To reliably support the real-time object detection provided by the system, it requires a robust and weather-resistant camera enclosure that can reliably protect its electronics in the field. To address this, I will be designing and prototyping a modular, 3D-printable camera case using FreeCAD this Summer. The case will be able to protect electrical components from rain and dust while incorporating proper ventilation and heat dissipation features. Designed using FreeCAD, the entire model will be fully open-source, allowing for easy adoption and modification by the community. Over this Summer, this work will incorporate multiple rounds of field testing to test and refine the design under accurate field conditions. Ultimately, my project aims to deliver a detailed open-source FreeCAD model, full assembly documentation, and a user guide.

I’m excited to see what we can learn througout the development of my project!

Improving AI Data Pipelines in AIDRIN: A Privacy-Centric and Multimodal Expansion

Thu, 12 Jun 2025 00:00:00 +0000

⏱️ Reading time: 4–5 minutes

Hi 👋

I’m Harish Balaji, a Master’s student at NYU with a focus on Artificial Intelligence, Machine Learning, and Cybersecurity. I’m especially interested in building scalable systems that reflect responsible AI principles. For me, data quality isn’t just a technical detail. It’s a foundational aspect of building models that are reliable, fair, and reproducible in the real world.

This summer, I’m contributing to AIDRIN (AI Data Readiness Inspector) as part of Google Summer of Code 2025. I’m grateful to be working under the mentorship of Dr. Jean Luca Bez and Prof. Suren Byna from the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL).

AIDRIN is an open-source framework that helps researchers and practitioners evaluate whether a dataset is truly ready to be used in production-level AI workflows. From fairness to privacy, it provides a structured lens through which we can understand the strengths and gaps in our data.

Why this work matters

In machine learning, one principle always holds true:

“Garbage in, garbage out.”

Even the most advanced models can underperform or amplify harmful biases if trained on incomplete, imbalanced, or poorly understood data. This is where AIDRIN steps in. It provides practical tools to assess datasets across key dimensions like privacy, fairness, class balance, interpretability, and support for multiple modalities.

By making these characteristics measurable and transparent, AIDRIN empowers teams to make informed decisions early in the pipeline. It helps ensure that datasets are not only large or complex, but also trustworthy, representative, and purpose-fit.

My focus this summer

As part of my GSoC 2025 project, I’ll be focusing on extending AIDRIN’s evaluation capabilities. A big part of this involves strengthening its support for privacy metrics and designing tools that can handle non-tabular datasets, such as image-based data.

The goal is to expand AIDRIN’s reach without compromising on interpretability or ease of use. More technical insights and updates will follow in the next posts as the summer progresses.

What comes next

As the AI community continues to evolve, there’s a growing shift toward data-centric practices. I believe frameworks like AIDRIN are essential for helping us move beyond the question of “Does the model work?” toward a deeper and more meaningful one: “Was the data ready in the first place?”

Over the next few weeks, I’ll be working on development, testing, and integration. I’m excited to contribute to a tool that emphasizes transparency and reproducibility across the AI lifecycle, and to share lessons and ideas with others who care about responsible AI.

If you’re exploring similar challenges or working in the space of dataset evaluation and readiness, I’d love to connect and exchange thoughts. You can also read my full GSoC 2025 proposal below for more context around the project scope and vision:

👉 Read my GSoC 2025 proposal here

This is the first in a 3-part blog series documenting my GSoC journey with AIDRIN. Stay tuned for technical updates and behind-the-scenes insights as the summer unfolds!

Reproducibility of Interactive Notebooks in Distributed Environments

Thu, 12 Jun 2025 00:00:00 +0000

Hello! I am Raza, currently a Ph.D. student in Computer Science at DePaul University. This summer, I will be working on reproducibility of notebooks in distributed enviornments, mentored by Prof. Tanu Malik. Here is a summary of my project proposal.

Interactive notebooks are web-based systems which enable encapsulating code, data, and their outputs for sharing and reproducibility. They have gained wide popularity in scientific computing due to their ease of use and portability. However, reproducing notebooks in different target environments remains challenging because notebooks do not carry the computational environment in which they are executed. This becomes even more challenging in distributed cluster environments where a notebook must be prepared to run on multiple nodes. In this project, we plan to (i) extend FLINC, an open-source user-space tool for distributed environments such that it can package notebook executions into notebook containers for execution and sharing across distributed environments, and (ii) integrate the extended Flinc with TaskVine, which provides the framework and orchestration to enable distributed notebook execution in high performance computing environments.

You can read my complete proposal here.

I am excited to work on this project and learn from the experience here!

Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

Sun, 08 Jun 2025 00:00:00 +0000

Google Summer of Code ‘25: Optimizing and Benchmarking GPU Collective Communication of PyLops-MPI with NCCL

My project aims to introduce GPU-to-GPU collective communication calls using Nvidia’s NCCL to PyLops-MPI, an extension of the powerful PyLops library.

I’m incredibly grateful for this opportunity and excited to be mentored by two HPC experts, Yuxi Hong from Lawrence Berkeley National Laboratory and Matteo Ravasi from ShearWater GeoServices.

Here’s also the link to my original proposal

What is PyLops-MPI and NCCL ?

PyLops is a Python library that provides a rich collection of linear operators to solve inverse problems. Its MPI extension, PyLops-MPI, takes this a step further by enabling these operations to run on large-scale, distributed computing systems like HPC using the Message-Passing Interface (MPI).

Where does NCCL fit in? The NVIDIA Collective Communication Library (NCCL) is a library of highly-optimized routines for collective communication between GPU. It offers the opportunity to close the performance gap in PyLops-MPI. As we now offload more and more computationally intensive tasks to GPUs, the communication between them can become a bottleneck. NCCL offers a powerful solution to this problem, enabling high-bandwidth, low-latency communication that can significantly boost performance.

Motivation and What was Missing

As a student with a background in geophysics (B.Sc) and now pursuing computer science (M.Sc), I’ve experienced firsthand the challenges of scaling scientific computing research from a personal desktop to a high-performance computing (HPC) cluster. It can be a significant hurdle. My project aims to ease this transition for PyLops-MPI users. PyLops-MPI is something I wish existed while I was doing my undergraduate reseach !

Currently, PyLops-MPI is “CUDA-aware,” meaning it can offload computations to GPUs. However, the communication between those GPUs is still handled by the underlying MPI implementation, which isn’t always optimal. This project will address this gap by integrating NCCL to handle GPU-to-GPU communication directly. If the compute is done in GPU, there shouldn’t be a copy from GPU to CPU, transfer with MPI, and put them back to GPU again.

This will be especially impactful for memory-bound problems where high-bandwidth communication is critical. By the end of this project, we’ll have a clear, quantifiable understanding of the performance gains achieved.

My Best-Laid Plan

My approach is grounded in good software engineering practices to ensure that this new feature is both robust and genuinely useful. I was impressed by the code quality (enjoyable read) of the repository - and so I commit not to break that.

First and foremost, the goal is to seamlessly integrate NCCL without breaking what already works. A significant part of my effort will be dedicated to rigorous testing. This means not only ensuring that all existing tests pass but also developing a new, comprehensive test suite to validate the correctness of the GPU-to-GPU communication across different hardware setups.

Once we’re confident that the integration is solid, the exciting part begins: benchmarking (or you may call it “Moment of Truth”)! The plan is to measure the performance of end-to-end iterative solvers. These solvers are a perfect test case because they involve a mix of intensive gradient computations on the GPU and frequent AllReduce calls to sync up processes. This will give us a clear picture of the speedup and efficiency gains from using NCCL.

Finally, to make sure this work benefits the entire community, I will create clear documentation and tutorials. The goal is to make it easy for any user to leverage this new GPU-accelerated communication in their own research and applications.

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Fri, 06 Jun 2025 10:15:56 -0700

Project Description

Sequential Recommender Systems are widely used in scientific and business applications to analyze and predict patterns over time. In biology and ecology, they help track species behavior by suggesting related research on migration patterns and environmental changes. Medical applications include personalized treatment recommendations based on patient history and predicting disease progression. In physics and engineering, these systems optimize experimental setups by suggesting relevant past experiments or simulations. Environmental and climate science applications include forecasting climate trends and recommending datasets for monitoring deforestation or pollution. In business and e-commerce, sequential recommenders enhance user experiences by predicting consumer behavior, suggesting personalized products, and optimizing marketing strategies based on browsing and purchase history. By leveraging sequential dependencies, these recommender systems enhance research efficiency, knowledge discovery, and business decision-making across various domains. Traditional sequential recommendation systems rely on historical user interactions to predict future preferences, but they often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors. Existing models primarily use predefined embeddings and handcrafted features, limiting their ability to generalize across diverse recommendation scenarios. To address these challenges, we propose LLM Enhanced Contextual Sequential Recommender (LLMSeqRec), which leverages Large Language Models (LLMs) to enrich sequential recommendations with deep contextual understanding and adaptive reasoning. By integrating LLM-generated embeddings and contextual representations, LLMSeqRec enhances user intent modeling, cold-start recommendations, and long-range dependencies in sequential data. Unlike traditional models that rely solely on structured interaction logs, LLMSeqRec dynamically interprets and augments sequences with semantic context, leading to more accurate and personalized recommendations. This fusion of LLM intelligence with sequential modeling enables a more scalable, adaptable, and explainable recommender system, bridging the gap between traditional sequence-based approaches and advanced AI-driven recommendations.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), this project aims to develop an LLM-Enhanced Contextual Sequential Recommender (LLMSeqRec) to improve sequential recommendation accuracy across various scientific and business applications. Sequential recommender systems are widely used to analyze and predict patterns over time, assisting in fields such as biology, ecology, medicine, physics, engineering, environmental science, and e-commerce. However, traditional models often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors, as they primarily rely on vanilla sequential Id orders. To address these limitations, this project will leverage Large Language Models (LLMs) to enhance context-aware sequential recommendations by dynamically integrating LLM-generated embeddings and contextual representations. The core challenge lies in designing LLMSeqRec, a unified and scalable model capable of enriching user intent modeling, mitigating cold-start issues, and capturing long-range dependencies within sequential data. Unlike conventional systems that rely solely on structured interaction logs, LLMSeqRec will interpret and augment sequences with semantic context, resulting in more accurate, adaptable, and explainable recommendations. Below is an outline of the methodologies and models that will be developed in this project:

Step 1: Data Preprocessing & Feature Creation: Develop a data processing pipeline to parse user’s sequential interaction behaviors into sequential data points for LLM-based embeddings and contextual sequential transformer modeling; Extract user behavior sequences, items’ metadata, and temporal patterns to create context-aware sequential representations for training, validation and testing; The data source can be from Amazon open public data or Movie Lense data set. The data points creation can follow SASRec (in the reference 1).
Step 2: Model Development: Design and implement LLM-enhanced sequential recommendation models, integrating pretrained language models to augment user-item interactions with semantic context; Develop an adaptive mechanism to incorporate external contextual signals, such as product descriptions, reviews into the sequential recommendation process; The baseline model can be SASRec pytorch implementation.
Step 3: Evaluation: : Benchmark LLMSeqRec against state-of-the-art sequential recommenders, evaluating on accuracy, NDCG and cold-start performance; Conduct ablation studies to analyze the impact of LLM-generated embeddings on recommendation quality; Optimize model inference speed and efficiency for real-time recommendation scenarios.

Project Deliverables

This project will deliver three components, software, model training, validation and performance evaluation and demo. The software which implements the above LLMSeqRec model will be hosted on the github repo as open-access repositories. The evaluation results and demo will be published along the github repo .

LLMSeqRec

Topics: LLM Enhanced Contextual Sequential Recommender
Skills: Proficiency in Python, Pytorch, Github, Self-attention, Transformer
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Self-Attentive Sequential Recommendation (SASRec)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Amazon Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews
Movie Lense Data: https://grouplens.org/datasets/movielens/

Introduction

I’m Connor, a student at NYU studying CS and Math. This summer I’ve gotten the opportunity to work on LLMSeqRec under Dr. Bin Dong and Dr. Linsey Pang.

In today’s digital age, sequential recommender systems power everything from e-commerce suggestions to personalized content everywhere. However, traditional models fall short in capturing user intent, adapting to dynamic behavior, or tackling cold-start problems. That’s where LLMSeqRec comes in.

Problem Statement

Most sequential recommender systems rely heavily on historical user-item interactions and predefined embeddings. This approach limits their ability to understand nuanced user preferences, struggles to scale across domains, and performs poorly in scenarios like new users or sparse data. The absence of semantic and contextual modeling is a major gap in current solutions.

Overview of project

LLMSeqRec is a novel, LLM-enhanced sequential recommender framework that bridges this gap. By leveraging large language models (LLMs), it incorporates semantic embeddings and prompt-based contextual modeling to understand both user behavior and item metadata at a deeper level. The system explores two core approaches:

Embedding-based: LLMs generate embeddings from item attributes.
Prompt-based: LLMs receive full transaction history in natural language format and infer recommendations.

These techniques are tested using well-known datasets (e.g., Amazon, MovieLens), and evaluated with ranking metrics like NDCG@10 and Hit@10. The goal: deliver more accurate, context-rich, and explainable recommendations.

Next Steps

The project is currently progressing through stages including model training, embedding integration, and evaluation. Upcoming tasks include:

Fine-tuning enhanced models
Designing zero-/few-shot prompts
Running comparative experiments
Publishing findings and writing technical blogs

As part of the LLMSeqRec my proposal under the mentorship of Dr. Bin Dong and Dr. Linsey Pang.

Final Blog: SS_Bench - Benchmarking SciStream

Fri, 31 Jan 2025 00:00:00 +0000

Introduction

Hello! My name is Acheme, and I’m thrilled to have collaborated with my mentors Joaquin Chung and Flavio Castro under the SciStream project. This project aims to develop SciStream-bench, a set of benchmarks and artifacts designed to precisely evaluate the performance of scientific streaming applications across diverse traffic patterns when running over the SciStream framework.

In the first half of the project, I focused on describing scientific streaming profiles based on use-cases experienced at Argonne National Lab. The necessary python scripts were developed to generate bursty and constant rate streaming traffic profiles.

In the second half, I built upon this foundation by conducting experiments with the traffic profiles and measuring performance through metrics of latency, jitter and throughput. These experiments were conducted with different message sizes across LAN and WAN network topology.

Key Achievements

Streaming Traffic Profile:
- Developed scripts to generate streaming traffic profiles with configurable parameters.
Created an Artifact:
- I created an artifact using a Jupyter notebook to document an easy to follow integration of SciStream with FABRIC testbed for future experimenters.

Conclusion and Future Work

The work demonstrated that SciStream offers tolerable overhead for secure data streaming and experimentation with this middlebox is possible in publicly available testbed like FABRIC. Future work would be to look into the comparative analysis of the performance of SciStream with or without hardware acceleration or offloading.

Deliverables

SciStream on FABRIC Demo: A demo can be found here on how to integrate SciStream on the FABRIC testbed SciStream on FABRIC.
Jupyter Notebook: An Artifact on FABRIC portal: FABRIC Artifact.

Final Report: Deriving Realistic Performance Benchmarks for Python Interpreters

Tue, 12 Nov 2024 00:00:00 +0000

Hi, I am Mrigank. As a Summer of Reproducibility 2024 fellow, I have been working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah. In particular, we want to benchmark Meta’s Static Python interpreter (which is a part of their Cinder project) and compare its performance with CPython on different levels of typing. In this post, I will share updates on my work since my last update. This post forms my final report for the Summer of Reproducibility 2024.

Since Last Time: Typing Django Files

Based on the profiling results from load testing a Wagtail blog site, I identified three modules in Django that were performance bottlenecks and added shallow types to them. These are available on our GitHub repository.

I also wrote a script to mix untyped, shallow-typed, and advanced-typed versions of a Python module and create a series of such gradually typed versions.

Summary of Experience and Contributions

I tried to set up different versions of Zulip to make them work with Static Python. My setup scripts are available in our repository. Unfortunately, Zulip’s Zerver did not run with Static Python due to incompatibility of some Django modules. A few non-Django modules were also initially throwing errors when run with Static Python due to a bug in Cinder – but I was able to get around with a hack (which I have described in the linked GitHub issue I opened on Cinder’s repository).
I created a locust-version of the small Django-related benchmarks available in pyperperformance and skybison. This helped me confirm that Django is by itself compatible with Static Python, and helped me get started with Locust. This too is available in our repository.
As described in the midterm report, I created a complete pipeline with Locust to simulate real-world load on a Wagtail blog site. The instructions and scripts for running these load tests as well as profiling the Django codebase are available (like everything else!) in our repository.
We added shallow types to the three Django modules mentioned above, and I created scripts to mix untyped, shallow-typed, and advanced-typed versions of a Python module to create a series of gradually typed versions to be tested for performance. We found that advanced-typed code may often be structurally incompatible with shallow-typed code and are looking for a solution for this. We are tracking some examples of this in a GitHub issue.

Going Forward

I had a great time exploring Static Python, typing in Python, load testing, and all other aspects of this project. I was also fortunate to have a helpful mentor along with other amazing team members in the group. During this project, we hit several roadblocks like the challenges in setting up real-world applications with Static Python and the difficulty in adding advanced types – but are managing to work around them. I will be continuing to work on this project until we have a complete set of benchmarks and a comprehensive report on the performance of Static Python.

Our work will continue to be open-sourced and available on our GitHub repository for anyone interested in following along or contributing.

[Final Report] Automated Reproducibility Checklist support within StatWrap

Sat, 02 Nov 2024 00:00:00 +0000

Namaste🙏🏻! I’m Adi Akhilesh Singh, and I’m excited to share my final updates on the Reproducibility Checklists project by StatWrap, under the mentorship of Luke Rasmussen.

Project Overview

This project introduces customizable reproducibility checklists in StatWrap, enabling metadata-driven and user-guided generation of checklists. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklist to ensure their work is reproducible.

Project Links

Explore the StatWrap project repository and my contributions during GSoC ‘24:

Progress And Achievements

During the timeline of this project, I worked on designing the interface for the checklist page and the data structure to support the project needs.

The interface was designed with user needs in mind, featuring components such as:

URLs component to manage external links or file URIs, attached to the project.
Images component to display project image files.
Checklist Notes component to manage user-added notes.

All these assets (Files, URLs, Images) can be added to each checklist statement using the existing assets and external resources(urls) present in the project.

Additionally, for each checklist item, StatWrap runs relevant scans to provide meaningful data based on its requirements. For example, for the item, “All the software dependencies for the project are documented,” StatWrap scans project files to list the languages and dependencies detected. For each checklist statement supported in StatWrap, we implement methods to retrieve specific information by scanning project data. StatWrap currently supports six such checklist statements identified as foundational for ensuring research reproducibility. Additionally, the checklist can be exported as a PDF summary, generated by StatWrap using the checklist data, with options to include notes.

Future Prospects

As the project concludes, several areas for growth have emerged:

Expanding language support within StatWrap. While StatWrap already includes key languages used in research, there is always a scope to extend compatibility to cover even more technologies.
Options to export a data-extensive report that includes checklist and their associated scan results. These and other enhancements, like adding new checklist statements with their scanning methods, will extend StatWrap’s impact on reproducibility in research.

Earlier Blogs

If you’re interested in seeing the project’s evolution, check out my earlier posts:

Thank you for reading!

Writing a blog about your OSRE 2025 project

Mon, 21 Oct 2024 00:00:00 +0000

OSRE participants are required to blog three times during their summer program. The first blog is a chance to introduce yourself and your project. The second blog occurs around the mid-point of the project and a final blog post is expected as part of you final project delverable. The organization administrator will send emails with specific dates. Instructions for the blog are indicated below. All blogs should include links to proposals, presentations, links to any deliverables/products as well as an overview of the student’s experience. Check out the student pages from previous years to get an idea of content / size.

We will also ask students and contributors to provide regular status updates which will help track your activities. The organization administrator will provide more details once the program work begins.

Making a pull request for your blog

Fork the git repository
If you haven’t already done so, add your profile using these instructions
- IMPORTANT: Under user_groups: add - 2025 Contributors (as opposed to any of the two mentor groups)
- The short bio and any other information goes below the frontmatter
Post your blog
- Add /content/report/osre25/ORGANIZATION/PROJECTNAME/DATE-USERNAME/index.md
- Add a frontmatter to index.md, using the labels below
- Blog text goes below the frontmatter
- In that same directory include a picture and call it featured.png (also supports .jpg, .jpeg)
Commit to your fork and make a pull request. Email OSRE Admins with questions.

Example frontmatter and text body

---
title: "YOUR TITLE"
subtitle: "YOUR SUBTITLE (OPTIONAL)"
summary:
authors:
 - USERNAME1
 - USERNAME2
tags: ["osre25"]
categories: []
date: YYYY-MM-DD
lastmod: YYYY-MM-DD
featured: false
draft: false

# Featured image
# To use, add an image named `featured.jpg/png` to your page's folder.
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
image:
 caption: ""
 focal_point: ""
 preview_only: false
---

As part of the [PROJECTNAME](/project/osre25/ORGANIZATION/PROJECTNAME) my [proposal](https://...) under the mentorship of MENTOR aims to ...

Writing a blog about your OSRE 2026 project

Mon, 21 Oct 2024 00:00:00 +0000

Making a pull request for your blog

Fork the git repository
If you haven’t already done so, add your profile using these instructions
- IMPORTANT: Under user_groups: add - 2026 Contributors (as opposed to any of the two mentor groups)
- The short bio and any other information goes below the frontmatter
Post your blog
- Add /content/report/osre26/ORGANIZATION/PROJECTNAME/DATE-USERNAME/index.md
- Add a frontmatter to index.md, using the labels below
- Blog text goes below the frontmatter
- In that same directory include a picture and call it featured.png (also supports .jpg, .jpeg)
Commit to your fork and make a pull request. Email OSRE Admins with questions.

Example frontmatter and text body

---
title: "YOUR TITLE"
subtitle: "YOUR SUBTITLE (OPTIONAL)"
summary:
authors:
 - USERNAME1
 - USERNAME2
tags: ["osre26"]
categories: []
date: YYYY-MM-DD
lastmod: YYYY-MM-DD
featured: false
draft: false

# Featured image
# To use, add an image named `featured.jpg/png` to your page's folder.
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
image:
 caption: ""
 focal_point: ""
 preview_only: false
---

As part of the [PROJECTNAME](/project/osre26/ORGANIZATION/PROJECTNAME) my [proposal](https://...) under the mentorship of MENTOR aims to ...

ML-Powered Problem Detection in Chameleon

Fri, 18 Oct 2024 00:00:00 +0000

Hello! My name is Syed Mohammad Qasim, a PhD candidate at the Department of Electrical and Computer Engineering, Boston University. This summer I worked on the project ML-Powered Problem Detection in Chameleon as part of the Summer of Reproducibility (SoR) program with the mentorship of Ayse Coskun and Michael Sherman.

Chameleon is an open testbed that has supported over 5,000 users working on more than 500 projects. It provides access to over 538 bare metal nodes across various sites, offering approximately 15,000 CPU cores and 5 petabytes of storage. Each site runs independent OpenStack services to deliver its offerings. Currently, Chameleon Cloud comprehensively monitors the sites at the Texas Advanced Computing Center (TACC) and the University of Chicago. Metrics are collected using Prometheus at each site and fed into a central Mimir cluster. All logs are sent to a central Loki, with Grafana used for visualization and alerting. Chameleon currently collects around 3,000 metrics. Manually reviewing and setting alerts for them is time-consuming and labor-intensive. This project aims to help Chameleon operators monitor their systems more effectively and improve overall reliability by creating an anomaly detection service to augment the existing alerting framework.

Over the summer, we focused on analyzing the data and identified 33 key metrics, after discussions with Chameleon operators, from the Prometheus Node Exporter that serve as leading indicators of resource usage on the nodes. For example:

CPU usage: Metrics like node_load1, node_load5, and node_load15.
Memory usage: Including buffer utilization.
Disk usage: Metrics for I/O time, and read/write byte rates.
Network activity: Rate of bytes received and transmitted.
Filesystem metrics: Such as inode_utilization_ratio and node_procs_blocked.
System-level metrics: Including node forks, context switches, and interrupts.

Collected at a rate of every 5 minutes, these metrics provide a comprehensive view of node performance and resource consumption. After finalizing the metrics we wanted to monitor, we selected the following four anomaly detection methods, primarily due to their popularity in academia and recent publication in high-impact conferences such as SIG-KDD and SC.

Omni Anomaly, [KDD 2019] [without POT selection as it requires labels.]
USAD, [KDD 2020]
TranAD, [KDD 2022]
Prodigy, [SC 2023] [Only the VAE, not using their feature selection as it requires labels.]

We collected 75 days of healthy data from Chameleon, and after applying min-max scaling, we trained the models. We then used these models to run inference on the metrics collected during outages, as marked by Chameleon operators. The goal was to determine whether the outage data revealed something interesting or anomalous. We can verify our approach by manually reviewing the results generated by these four anomaly detection methods. Below are the results from the four methods on different outages, followed by an example of how these methods identified the root cause of an anomaly.

The above figure shows the percentage of outage data that was flagged as anomalous by different models.

The above two plots shows two examples of the top 5 metrics which contributed to the anomaly score by each anomaly detection model.

Although the methods seem to indicate anomalies during outages, they are not able to pinpoint the affected service or the exact cause. For example, the first partial authentication outage was due to a DNS error, which can manifest in various ways, such as reduced CPU, memory, or network usage. This work is still in progress, and we are conducting the same analysis on container-level metrics for each service, allowing us to narrow the scope to the affected service and more effectively identify the root cause of anomalies. We will share the next set of results soon.

Thanks for your time, please feel free to reach out to me for any details or questions.

Data Leakage in Applied ML: model uses features that are not legitimate

Tue, 24 Sep 2024 00:00:00 +0000

Hello everyone!

I have been working on reproducing the results from Identification of COVID-19 Samples from Chest X-Ray Images Using Deep Learning: A Comparison of Transfer Learning Approaches. This study aimed to distinguish COVID-19 cases from normal and pneumonia cases using chest X-ray images. Since my last blog post, we have successfully reproduced the results using the VGG19 model, achieving a 92% accuracy on the test set. However, a significant demographic inconsistency exists: normal and pneumonia chest X-ray images were from pediatric patients, while COVID-19 chest X-ray images were from adults. This allowed the model to achieve high accuracy by learning features that were not clinically relevant.

In Reproducing “Identification of COVID-19 samples from chest X-Ray images using deep learning: A comparison of transfer learning approaches” without Data Leakage, we followed the methodology outlined in the paper, but with a key change: we used datasets containing adult chest X-ray images. This time, the model achieved an accuracy of 51%, a 41% drop from the earlier results, confirming that the metrics reported in the paper were overly optimistic due to data leakage, where the model learned illegitimate features.

To further illustrate this issue, we created a toy example demonstrating how a model can learn illegitimate features. Using a small dataset of wolf and husky images, the model achieved an accuracy of 90%. We then revealed that this performance was due to a data leakage issue: all wolf images had snowy backgrounds, while husky images had grassy backgrounds. When we trained the model on a dataset where both wolf and husky images had white backgrounds, the accuracy dropped to 70%. This shows that the accuracy obtained earlier was an overly optimistic measure due to data leakage.

You can explore our work on the COVID-19 paper here.

Lastly, I would like to thank Fraida Fund and Mohamed Saeed for their support and guidance throughout my SoR journey.

Towards Scalable Performance Benchmarking of Genomics Workflows

Thu, 19 Sep 2024 00:00:00 +0000

Project Background

Optimizing genomics workflows execution on a large-scale & heterogeneous cluster requires in-depth understanding of resource requirement and utilization pattern of each application in the workflows. Such information can be obtained by using a benchmarking tool. However, performance data generated by such tool should represent the scale of its target system, lest the design decisions made from it is misguided. My project aims to build GenScale, the first benchmarking tool which can rapidly generate genomics workload performance data at the scale representative of production systems.

As Summer of Reproduciblity (SoR) 2024 comes to an end, I took the time to reflect on my time working on GenScale, the challenges I faced, and the future works & impacts I hope GenScale create for our community.

Milestones & Challenges

The time I spent working on GenScale during SoR can be classified into three phases:

1. Per-Application Container & Input Creation.

Containerization is the current de-facto standard for genomics workflow execution, thus I designed GenScale to execute applications as containers. This requires me to package each application included in the benchmark as a container. I use state-of-art DNA-Seq & RNA-Seq alignment workflows as references for the list of applications & workflow structure. The container images & source files I created are publicy available in GitHub (Deliverables #1)

I also prepare sample inputs for each application to ease the burden of users who do not have sufficient familiarity with genomics applications. The effort is not trivial, because in a workflow, the inputs for a certain step depend on the outputs of previous step(s). Simply speaking, to prepare inputs for the last application in a workflow, we need to get the outputs of applications executed before it, which also requires the outputs of another set of applications, and so on until we arrive at the beginning of workflow. This translates into significant manual labor of carefully tracing & collecting intermediate files from each step of the reference workflows.

All inputs are hosted in a public Google Drive and ChameleonCloud object store (Deliverables #2). In total, I prepared containers and inputs for 7 popular genomics applications: BWA, FastQC, Fastq Cleaner, GATK, Picard, STAR, and Trimmomatic.


Figure 1. Production-grade softwares used in GenScale: Kubernetes for task orchestration, and Prometheus + Grafana for real-time resource monitoring.

2. Components Development.

In this phase, GenScale main components were developed. GenScale consists of three components: (a) Workflow Manager, (b) Task Orchestrator, and (c) Resource Monitor. The Workflow Manager is built from scratch to allow high degree of freedom when scheduling workflows. I use industry-grade solutions for the other components, namely Kubernetes for orchestrating tasks / containers, and Prometheus + Grafana for real-time resource monitoring. My deliverables include semi-automatic installation scripts & easy-to-follow instructions to set up all three components. (Deliverables #3)

3. Performance Data Generation.

The last phase is to use GenScale prototype to generate performance data of each application. I focused on collecting data for three types of resources: compute (CPU utilization), memory (resident set size), and I/O (read & write operations over time). GenScale export these information into a single CSV file to facilitate easy analysis. My deliverables include performance data for DNA-Seq and RNA-Seq workflows. I also provide a sample Python Notebook which analyzes the CPU utilization pattern of each application in DNA-Seq workflow. (Deliverables #4)


Figure 2. CPU utilization pattern of 9 applications in DNA-Seq Alignment workflow collected by GenScale. y-axis: (num. cores) x 100%, x-axis: time elapsed in seconds.

Deliverables

This project’s deliverables can be found in the following Github repo: https://github.com/martinluttap/sor24-genscale/tree/main. In summary, the deliverables include:

Container Images
Input Dataset
Source Code
Performance Data & Sample Analysis Notebook

Future Works, Broader Impacts

Understanding workload characteristics is a crucial step for designing efficient scheduling policy & resource management techniques. GenScale and the performance data it can generate might be a starting point for such effort. Furthermore, I hope GenScale will catalyze meaningful engagements between the computer systems community and bioinformatics community. I believe state-of-arts systems techniques can greatly aid the computing efforts among bioinformatics community. Similarly, domain-specific knowledge & problems within bioinformatics provide unique grounds for the systems community to further advance their field.

[Final] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Wed, 18 Sep 2024 00:00:00 +0000

Hello everyone,

In my SoR 2024 project, ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. I’m excited to share the final progress and insights we’ve gathered on tackling scalability bugs in large-scale distributed systems. I aimed to tackle the reproducibility challenges posed by scalability bugs in large-scale distributed systems. Below is a detailed summary of the investigations and findings we’ve conducted on scalability bugs in large-scale distributed systems.

Project Overview

As you may recall, our project, ScaleRep, aimed to tackle the challenge of scalability bugs—those insidious issues that often arise in large-scale distributed systems under heavy workloads. These bugs, when triggered, can lead to significant system issues such as downtime, performance bottlenecks, and even data loss. They are particularly difficult to catch using traditional testing methods.

Our primary focus was on reproducing these bugs, documenting the challenges involved, and providing insights into how these bugs manifest under various conditions. This documentation will help researchers identify, benchmark, and resolve similar issues in the future.

Progress

Since the midterm update, several Apache Ignite bugs have been investigated, some of which have been successfully reproduced and uploaded to Trovi for the research community to access and reuse. Below is the progress on the bugs investigated:

Bugs Investigated

Key Insights & Challenges

Complexity of Scalability Bugs Many scalability bugs involve subtle and complex interactions that are not easily detected in standard testing environments. For instance, IGNITE-20602 only manifested under certain high-load conditions and required a specific workload and environment to reliably trigger the issue. This highlights the importance of large-scale testing when investigating scalability issues.
Dependency and Documentation Gaps We encountered significant challenges with outdated dependencies and incomplete documentation, particularly in older bugs like IGNITE-16072. In these cases, reproducing the bug required extensive modifications or wasn’t feasible without investing disproportionate effort in updating dependencies.
Effectiveness of Trovi and Chameleon Packaging and sharing our reproducible investigations through Trovi and Chameleon have proven highly effective. By providing researchers with pre-configured environments and detailed documentation, we’ve laid the groundwork for future collaboration and further research on these bugs. We expect this to greatly benefit others attempting to reproduce similar issues.
Impact of Speed-Based Throttling Our investigation into IGNITE-16600 revealed several important insights into speed-based throttling and its impact on system performance under high-load conditions. By analyzing the checkpoint starvation and thread throttling mechanisms, we were able to identify areas for improvement in the latest Ignite releases.

Next Steps

Expanding Collaboration: The packaged bugs and replayable Trovi experiments will be made available to the broader research community, encouraging further investigation and enhancements to large-scale distributed systems.

The ScaleRep project has been an exciting journey into the world of scalability bugs, pushing the boundaries of what’s possible in terms of reproducibility and benchmarking. Through this project, we’ve demonstrated the importance of rigorous testing and comprehensive documentation in improving the reliability of distributed systems.

Final Blog: Enhancing User Experience Reproducibility through TROVI Redesign

Wed, 18 Sep 2024 00:00:00 +0000

Hello! My name is Alicia Esquivel Morel, and I’m a graduate research assistant at the University of Missouri – Columbia, pursuing a PhD in Computer Science. This summer, I worked on a project to improve user experience reproducibility through a redesign of TROVI, as part of the Summer of Reproducibility (SoR) program.

Before even starting this project, and me as a rising researcher, I always saw reproducibility as one of the biggest challenges in research. What I wanted to see was always as reproducibility—being able to consistently replicate experiments and share them in a way that others can follow.

TROVI, is a platform designed to help with this. However, as I joined the project, I knew it had room for improvement, not oly in the user interface, but also in the ease of integrating code and data.

This project aimed to address these challenges by redesigning TROVI to streamline experiment replication, making the platform more intuitive and accessible. The goal was simple: create a user-friendly experience that eliminates confusion and frustration, allowing researchers to focus on their work instead of the technical aspects of running a research experiment.

Our goals in the beginning of the summer:

We wanted to simplify TROVI’s interface for intuitive navigation, inspired by platforms like Google Colab.
We wanted to make uploading and sharing code and data easier, with seamless integration with tools like GitHub.
We wanted to create a mechanism for users to provide feedback, allowing TROVI to evolve based on real user needs.

How was the progress and what we have achieved

I started by conducting thorough UX research and a literature review on reproducibility platforms, establishing a solid foundation for the redesign. With user feedback guiding the process, I created wireframes and low-fidelity prototypes, focusing on making the platform more intuitive.

As the project progressed, I built a higher-fidelity prototype that connected various components of the platform, ensuring a seamless user journey. I then tackled the back-end integration, which tied together the front-end flows with TROVI’s API.

Throughout this project, I received valuable support and guidance from my mentors. Mark Powers walked me through TROVI’s architecture and helped me understand exactly what was needed for a successful redesign. Thanks to his mentorship, I not only completed the project but learned a great deal along the way. Thanks Mark Powers!!

Through iterations and feedback from initial user testing, and we the help of Kate Keahey, I refined the design to ensure it met the needs of the research community. By the end of the program, TROVI had evolved into a cohesive, user-friendly platform that leads to enhanced experiment reproducibility.

Accomplishments

A simplified interface that makes navigating, uploading, and collaborating much easier.
GitHub integration that streamlines the process of sharing code and data with collaborators.
A built-in feedback loop that enables TROVI to grow with its users, adapting to their needs as they arise.

The platform is also getting ready to move into production and will soon be available for the research community.

What’s Next?

While the core objectives have been successfully met, future improvements could further enhance the platform’s capabilities, such as additional integrations and more advanced collaboration features. User testing will continue to provide insights for ongoing development.

I’m grateful for this opportunity! Thank you for following along!

[MidTerm] StatWrap: Automated Reproducibility Checklists Generation

Mon, 16 Sep 2024 00:00:00 +0000

Namaste🙏🏻! I’m Adi Akhilesh Singh, and I’m excited to share progress updates on the Reproducibility Checklists project by StatWrap, under the mentorship of Luke Rasmussen.

Project Overview

The project aims to integrate customizable reproducibility checklists into StatWrap, using metadata and user input to automate their generation. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklists to ensure their work is reproducible.

Progress

Over the past few months, my mentors and I have worked on developing the interface for the checklists page and designed key components to support our project goals. We’ve implemented logic that iterates over each checklist item, displaying its statement along with Boolean controls (Yes/No buttons) for user interaction.

We’ve also developed components to display attached images and URLs linked to each checklist item. Additionally, we’ve integrated a notes feature that allows users to add, edit, and view project-related notes. Currently, we are writing methods to integrate real-time project data into the checklists. For example, one method we’ve implemented scans project files (assets) to detect the languages used.

What’s Next?

As we move closer to the final evaluation phase, our focus will be on the following objectives:

Implement methods for each checklist item, integrating real-time data from the project data to auto-populate checklist answers.
Enhance the Attached Images component to allow users to select and attach existing image assets from the project.
Display the results of the scans for each checklist item, providing users with detailed outputs based on the automated analysis.

Stay tuned for further updates as we continue developing this feature set! 🚀

Final Post: Enhancing Reproducibility and Portability in Network Experiments

Thu, 05 Sep 2024 00:00:00 +0000

Introduction

As my project with the Summer of Reproducibility (SoR) 2024 comes to a close, I’d like to reflect on the journey and the outcomes achieved. My project focused on enhancing the reproducibility and portability of network experiments by integrating the RO-Crate standard into the TUM intern testbed pos (plain orchestrating service), and deploying this testbed on the Chameleon cloud infrastructure. The aim was to ensure that experiments conducted on one platform could be seamlessly reproduced on another, adhering to the FAIR principles (Findable, Accessible, Interoperable, Reusable) for research data.

Project Recap

The core goal was to make the experiments reproducible and portable between different testbeds like TUM’s pos and Chameleon. To achieve this, I integrated the RO-Crate standard, which ensures that all experiment data is automatically documented and stored with metadata, making it easier for others and especially for machines to understand, replicate, and build on the results. Additionally, deploying a lightweight version of pos on the Chameleon testbed enabled cross-testbed execution, allowing experiments to be replicated across both environments without significant modifications.

Key Achievements

Over the course of the project, several key milestones were achieved:

RO-Crate Integration: The first step was restructuring the results folder and automating the generation of metadata using RO-Crate. This ensured that all experiment data was comprehensively documented with details like author information, hardware configurations, and experiment scripts resulting in comprehensive ro-crate-metadata.json files as important part of each result folder.
Improved Data Management: The integration of RO-Crate greatly simplified the process of organizing and retrieving experiment data and metadata with information about the experiment and the result files. All metadata was automatically generated, making it easier to share and document the experiments for other researchers to replicate.
Automatic Upload to Zenodo: Another crucial achievement was the implementation of automatic uploading of pos experiment result folders to Zenodo, an open-access repository. This step significantly improved the reproducibility and sharing of experiment results, making them easily accessible to the broader scientific community. By utilizing Zenodo, we ensured that experiment results, along with their RO-Crate metadata, could be archived and referenced, fostering greater transparency and collaboration in scientific research.
Chameleon Deployment: Deploying the pos testbed within the Chameleon environment required managing various complexities, particularly related to Chameleon’s OpenStack API, networking setup, and hardware configurations. Coordinating the network components and infrastructure to support pos functionality in this testbed environment demanded significant adjustments to ensure smooth integration and operation.

Challenges

Like any project, this one came with its own set of challenges:

Balancing Automation and Flexibility: While automating the generation of RO-Crate metadata, it was crucial to ensure that the flexibility required by researchers for customizing their documentation was not compromised. Finding this balance required in-depth adjustments to the testbed infrastructure.
Complexity of Testbed Systems: Integrating RO-Crate into a complex system like pos, and ensuring it works seamlessly with Chameleon, involved understanding and adapting to the complexities of both testbeds.

Future Directions

As I move forward with my master’s thesis working on these challenges, we plan to expand on this work by:

Extending the Chameleon Deployment: We aim to deploy the full version of pos on Chameleon, supporting more complex and larger-scale experiments.
Supporting Complex Experiment Workflows: Future work will focus on handling more intricate and larger datasets, ensuring reproducibility for complex workflows. Only by executing more complex experiments will we be able to thoroughly analyze and compare the differences between executions in pos and the pos deployed on Chameleon, helping us better understand the impact of different testbed environments on experiment outcomes.
Automation: The ultimate goal is to fully automate the process of experiment execution, result documentation, and sharing across testbeds, reducing manual intervention and further enhancing reproducibility.

Reflections

By integrating the RO-Crate standard and deploying pos on the Chameleon testbed, we have made significant steps toward enhancing the reproducibility, accessibility, and portability of network experiments across research platforms. These efforts contribute to more shareable, and replicable research processes in the scientific community.

I am excited about the future work ahead and am grateful for the mentorship and support I received during this project.

Deliverables and Availability

Due to the current non-public status of the pos framework, the code and deliverables are not publicly available at the moment.

Previous Blogs

Make sure to check out my other blogs to see how I started this project and the challenges I faced along the way:

Servus!

Understanding Data Leakage in Machine Learning: A Focus on TF-IDF

Thu, 05 Sep 2024 00:00:00 +0000

Hello again!

This is my final blog post, and I will be discussing the second material I created for the 2024 Summer of Reproducibility Fellowship. As you may recall from my first post, I am working on the Exploring Data Leakage in Applied ML: Reproducing Examples of Irreproducibility project with Fraida Fund and Mohamed Saeed as my mentors.

This blog post will explore how data leakage can occur during feature extraction, particularly with the commonly used TF-IDF vectorizer, and its impact on model generalization.

Introduction

In machine learning, data leakage is a critical issue that can severely impact model performance. It occurs when information from outside the training dataset is improperly used to create the model, leading to overly optimistic performance during evaluation. One common source of leakage comes from how features, such as those extracted using TF-IDF (Term Frequency-Inverse Document Frequency), are handled. In this post, we’ll explore how data leakage can happen during feature extraction with TF-IDF and how it affects model accuracy.

What is TF-IDF?

TF-IDF is a method used to evaluate how important a word is in a document relative to a collection of documents. It consists of two components:

Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Reduces the importance of terms that appear frequently across many documents.

Together, they provide a weighted value for each word, reflecting its importance relative to the dataset.

How Data Leakage Occurs with TF-IDF

Data leakage with TF-IDF happens when the inverse document frequency (IDF) is calculated using the entire dataset (including the test set) before splitting it into training and test sets. This means the model has access to information from the test set during training, leading to artificially inflated results. This is a subtle form of data leakage, as it often goes unnoticed.

For example, when calculating the TF-IDF score, if the word “banana” appears more frequently in the test set but is considered during training, the model downplays its significance. As a result, the model may fail to predict correctly when “banana” is important in the test data.

Why Does This Matter?

If the test data is included when calculating the IDF, the model gains unintended insight into the test set’s word distribution. In real-world scenarios, the test data is supposed to be unseen during training. By allowing the model to see this information, you’re essentially reducing the uncertainty that the model should have about future data.

Impact of Data Leakage on Model Performance

Let’s consider two cases to understand the impact of data leakage in detail:

When a word is rare in the training set but common in the test set: The model will underestimate the importance of this word during training, leading to poor performance when the word is critical in test documents.
When a word is common in the training set but rare in the test set: The model will overemphasize the word during training, leading to poor predictions when the word doesn’t appear as often in unseen data.

Case Study: Data Leakage in TF-IDF

To see this effect in action, consider a small toy dataset where the presence of the word “banana” determines the label. If the word “banana” appears in a sentence, the label is 1; otherwise, the label is 0. Using TF-IDF to vectorize the text, we train a machine learning model to predict this label.

In the first scenario, we calculate the TF-IDF using the entire dataset before splitting it into training and testing sets. This causes data leakage since the model now knows the distribution of words across both sets. For instance, if “banana” is more common in the test set than the training set, the IDF score for “banana” will be lower across the entire dataset, leading the model to downplay its importance.

In the second scenario, we calculate TF-IDF only on the training set, ensuring that the test set remains unseen. This preserves the integrity of the test set, giving us a more realistic evaluation of the model’s performance.

In both scenarios, the model’s accuracy is drastically different. When leakage is present, performance is artificially high during training but poor when tested on unseen data. Without leakage, the model generalizes better, as it is evaluated on truly unseen data.

Avoiding Data Leakage

Avoiding data leakage is essential for building reliable machine learning models that generalize well to new data. Here are a few guidelines to help prevent leakage:

Split the dataset before feature extraction: Always divide your data into training and test sets before applying any feature engineering techniques.
Ensure proper cross-validation: When using cross-validation, ensure that the training and test splits do not overlap in any way that can leak information between them.
Be cautious with time-series data: In time-series models, avoid using future data to predict past events, as this can lead to leakage.

Conclusion

Avoiding data leakage is crucial for building robust machine learning models. In the case of TF-IDF, ensuring that feature extraction is done only on the training set and not on the entire dataset is key to preventing leakage. Properly addressing this issue leads to better generalization and more reliable models in real-world applications.

This blog post provided a case study on how TF-IDF can introduce data leakage and why it’s important to carefully handle your dataset before feature extraction. By splitting your data properly and ensuring that no test data “leaks” into the training process, you can build models that truly reflect real-world performance.

Thanks for reading!

AutoAppendix: Towards One-Click reproducibility of high-performance computing experiments

Wed, 04 Sep 2024 00:00:00 +0000

Hi everyone,

I’m excited to wrap up the AutoAppendix project with our final findings and insights. Over the course of this initiative, we’ve worked to assess the reproducibility of artifacts submitted to the SC24 conference and create guidelines that aim to improve the standard for reproducible experiments in the future. Here’s a summary of the project’s final phase and what we’ve learned.

Project Goals and Progress

The goal of AutoAppendix was to evaluate the computational artifacts provided by SC24 paper submissions, focusing on reproducibility. These artifacts accompany papers applying for the “Artifact Replicable” badge in the conference’s reproducibility initiative. Volunteer members of this initiative assess 1-2 paper appendices each. In this project, we analyzed a larger portion of artifacts to gain a broader perspective on potential improvements to the reproducibility process.

We selected 18 out of 45 submissions, focusing on experiments that could be easily replicated on Chameleon Cloud. Our evaluation criteria were based on simplicity (single-node setups) and availability of resources. The final analysis expanded on the earlier midterm findings, shedding light on various challenges and best practices related to artifact reproducibility.

Artifact Evaluation Process

During the evaluation process, we focused on examining the completeness and clarity of the provided artifacts, looking closely at documentation, setup instructions, and the degree of automation.

Our first step was to replicate the environments used in the original experiments as closely as possible using the resources from Chameleon. Many papers included instructions for creating the necessary software environments, but the clarity of these instructions varied significantly across submissions. In some cases, we even encountered challenges in reproducing results due to unclear instructions or missing dependencies, which reinforced the need for standardized, clear documentation as part of the artifact submission process.

We observed that containerization and semi-automated setups (with scripts that break down the experiment into smaller steps) were particularly effective in enhancing the reproducibility of the artifacts. One artifact particularly caught our attention due to its usage of the Chameleon JupyterHub platform, making it reproducible with a single click. This highlighted the potential for streamlining the reproducibility process and showcased that, with sufficient effort and the right tools, experiments can indeed be made replicable by anyone.

Results

Throughout the evaluation, we observed that reproducibility could vary widely based on the clarity and completeness of the documentation and the automation of setup procedures. Artifacts that were structured with clear, detailed steps for installation and execution tended to perform well in terms of replicability.

From our evaluation, we derived a set of guidelines (intended as must-haves) and best practices (recommended) for artifact reproducibility, which can be found below.

Due to our fascination of the potential of the Chameleon JupyterHub platform and its adjacent Trovi artifact repository, we decided to create several templates that can be used as a starting point for authors to make integration of their artifacts with the platform easier. In the design of these templates, we made sure that artifacts structured according to our guidelines are particularly easy to integrate.

Guidelines

Clear Documentation: Provide clear and detailed documentation for the artifact in the corresponding appendix, such that the artifact can be replicated without the need for additional information. For third-party software, it is acceptable to refer to the official documentation.
Software Setup: Clearly specify the versions of all (necessary) software components used in the creation of the artifact. This includes the operating system, libraries, and tools. Particularly, state all software setup steps to replicate the software environment
Hardware Specifications: Specify the hardware the experiment was conducted on. Importantly, state the architecture the experiments are intended to run on, and ensure that provided software (e.g. docker images) are compatible with commonly available architectures.
Expected Results: Always provide the expected outputs of the experiment, especially when run on different hardware, to make it easier for reviewers to assess the success of the replication.
Public Data: Publish the experiment data to a public repository, and make sure the data is available for download to reviewers and readers, especially during the evaluation period. Zenodo is a recommended repository for this purpose.
Automated Reproducibility: For long-running experiments, provide progress output to the reviewer to ensure the experiment is running as expected. Give an idea in the documentation of

how much time long-running steps in the reproduction will take
what the progress output looks like or how frequently it is emitted

Sample Execution: Conduct a sample evaluation with hardware and software as similar as possible to the intended reproduction environment.

Best Practices

Reproduciible Environment: Use a reproducible environment for the artifact. This can come in several forms:

Containerization: Provide instructions for building the environment, or, ideally, provide a ready-to-use image. For example, Docker, Signularity or VirtualBox images can be used for this purpose
Reproducible Builds: Package managers like Nix or Guix have recently spiked in popularity and allow their users to create reproducible environments, matching the exact software versions across different systems.

Partial Automation: It often makes sense to break an experiment down into smaller, more manageable steps. For Linux-based systems, bash scripts are particularly viable for this purpose. We recommend prefixing the scripts for each step with a number, such that the order of execution is clear.
X11 Availability: Usually, reviewers will not have access to a graphical user interface on the system where the artifact is evaluated. If the artifact requires a graphical user interface, provide a way to run the artifact without it. For example, save matplotlib plots to disk instead of showing them with plt.show().
Experiment output: Do not provide output files of the experiment in your artifact, unless explicitly intended. If provided output files are intended for comparison, they should be marked as such (e.g. in their filename). Similarly, any output logs or interactive outputs in Jupyter notebook should not be part of the artifact, but rather be initially generate during the artifact evaluation.

Trovi Templates

Our templates share a common base that features a central configuration file for modifying the Chameleon experiment parameters (such as node type). Building on this base, we provide three templates with sample experiments that each use different environments:

Docker template: This template is designed for containerized experiments and supports nvidia GPUs over the nvidia-container-toolkit integration.
Nix template: Sets up the Nix package manager with a shell.nix file that can be used to configure the environment.
Guix template: Installs the Guix package manager and executes a sample experiment from an existing reproducible paper that hinges on the reproducibility of the software environment.

Conclusion

In summary, the AutoAppendix project has been an insightful journey into the complexities of artifact reproducibility. Our evaluations highlight both the challenges and potential solutions for future reproducibility initiatives. By following these essential guidelines and implementing best practices, we aim for the research community to achieve higher standards of transparency and reliability in scientific research and help to ensure that the results of experiments can be replicated by others.

Thanks for following along with our progress! We’re excited to see the positive impact these findings will have on the research community.

If you are interested in the full project report, you can find it here, together with the Trovi templates.

Reflecting on the ScaleRep Project: Achievements and Insights

Mon, 02 Sep 2024 00:00:00 +0000

Hello everyone,

As we reach the conclusion of our ScaleRep project, I want to take a moment to reflect on the journey we’ve undertaken and the significant milestones we’ve achieved. Throughout this project, our primary focus was on identifying, reproducing, and analyzing scalability bugs in cloud systems such as Cassandra, HDFS, and Hadoop. Under the mentorship of Professor Yang Wang and Bogdan “Bo” Stoica, we have gained valuable insights into the complexities of scalability issues and their impact on large-scale distributed systems.

Key Accomplishments

Over the course of the project, we delved into various aspects of scalability bugs, reproducing some of the most challenging issues faced by cloud systems. One of our notable accomplishments was the successful reproduction and validation of developer fixes for several critical bugs in HDFS. These included:

1. Throttling Bugs in HDFS:

We investigated HDFS-17087, where the absence of a throttler in led to unregulated data reads, causing potential performance degradation. By reproducing the bug and applying the developer’s patch, we were able to observe significant improvements in system stability.DataXceiver#readBlock

2. Reducing DataNode Load:

HDFS-16386 was another crucial bug we worked on, which involved reducing the load on DataNodes when was working. By analyzing the effects of high CPU and memory usage, we proposed and validated a solution that reduced the number of concurrent threads, ultimately improving the DataNode’s performance.FsDatasetAsyncDiskService

3. Improving Log Throttling:

In HDFS-16872, we addressed excessive logging caused by unshared instances of . By making a static member, we were able to share throttling across instances, reducing unnecessary log entries and improving system efficiency.LogThrottlingHelperLogThrottlingHelper

Insights and Learnings

1. Systematic Bug Reproduction:

One of the most critical aspects of our work was developing a systematic approach to bug reproduction. This involved carefully setting up the environment, applying patches, and validating results through detailed monitoring and analysis. Our reproducible artifacts and investigation scripts will serve as a resource for future researchers and developers.

2. Impact of Throttling Mechanisms:

Our exploration of throttling bugs highlighted the importance of accurate throttling mechanisms in maintaining system performance and stability. Small issues, such as incorrect data rate calculations, can have significant ripple effects on system behavior, emphasizing the need for precise and effective solutions.

3. Collaboration and Open Source Contribution:

Working on an open-source project like ScaleRep underscored the importance of collaboration within the community. The bugs we analyzed and fixed not only improved the systems we worked on but also contributed to the broader effort of enhancing the reliability of cloud systems.

Conclusion

As we wrap up the ScaleRep project, I am proud of the progress we have made and the contributions we have delivered to the open-source community. The knowledge and experience gained from this project will undoubtedly shape our future endeavors in the field of distributed systems and cloud computing. I am grateful for the guidance and support provided by Professor Yang Wang and Bogdan “Bo” Stoica throughout this journey.

Thank you for following along, and I look forward to continuing to explore the future of scalable and reliable cloud systems!

Final Report: Stream processing support for FasTensor

Fri, 30 Aug 2024 00:00:00 +0000

Final Report: Stream processing support for FasTensor

Project Description

FasTensor is a scientific computing library specialized in performing computations over dense matrices that exhibit spatial locality, a characteristic often found in physical phenomena data. Our GSoC'24 project aimed to enhance FasTensor by enabling it to ingest and process live data streams from sensors and scientific equipment.

What is FasTensor?

Imagine you’re working on a physical simulation or solving partial differential equations (PDEs). You’ve discretized your PDE, but now you face a new challenge: you need to run your computations fast and parallelize them across massive compute clusters.

At this point, you find yourself describing a stencil [1] operation. But should you really spend your time tinkering with loop orders, data layouts, and countless other side-quests unrelated to your core problem?

This is where FasTensor comes in: Describe your computation as a stencil, and it takes care of ensuring optimal execution. FasTensor lets you focus on the science, not the implementation details.

Repository Links

FasTensor: https://github.com/BinDong314/FasTensor
My fork: https://github.com/my-name/FasTensor/tree/ftstream

PR(s)

Work done this summer

Develop Streaming simulator: FTStream

I was first entasked by Dr. Bin to develop a stream simulator for testing the streaming capability of FasTensor. For testing purposes, a stream is characterized by file size, count, and arrival interval. FTStream can generate streams of various sizes and intervals, up to the theoretical limits of disk and filesystem. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe!

Writing this tool was an adventure in throughput testing and exploring APIs. I wrote multiple drivers, each for a different whim and hijinks of systems in the HPC world. Here’s a brief journey through the APIs we explored:

HDF5 APIs: Pretty fast in flush-to-disk operation, but the API design strongly binds to file handles, which inhibits high throughput duplication.
HDF5 VFL and VOL: We dabbled in these dark arts, but there be dragons! Keeping a long-term view of maintenance, we dropped the idea.
POSIX O_DIRECT: This involved getting your buffers aligned right and handling remainders correctly. A step up, but not quite at the theoretical limits.
Linux AIO: Streaming is latency sensitive domain, to reach the theoretical limits, every syscall saved matters. Linux AIO allowed us syscall batching with io_submit(). It took a few testing sessions to get the correct combo of queue depth, buffer size, and alignment right.

We settled on O_DIRECT + Linux AIO. Feel free to modify ftstream/fastflush.h to suit your needs.

Stream Support

FasTensor has just one simple paradigm: you give it a data source, an output data store, and your transform, and it handles all the behind-the-scenes grunt work of computing over big datasets so you can focus on your research.

We aimed to achieve the same for streaming: Drop in the STREAM keyword, append a pattern identifying your stream, and use your usual transform.

Voila! Now your previous FasTensor code supports live data streams.

Technical tidbits:

Implements a manager-worker pattern to allow us flexibility in the future to implement different stream semantics such as windowing, CPU-memory based load balancing
Supports streams of indefinite size

Challenges

HPC has its fair share of challenges. Things you take for granted might not be available there, and it takes a while to adjust to paradigms of scale and parallelization.

For example, when developing FTStream, we found O_DIRECT is available on some parallel file systems like GPFS but not supported on Lustre/CFS. We developed a separate MPIO driver for FTStream that will be upstreamed once thoroughly tested on Lustre.

Future Work

Implement windowing and explore more advanced stream semantics.
Implement support for for defining workload policies
Optimize interleaving IO and Compute.

References

[1] Anshu Dubey. 2014. Stencils in Scientific Computations. In Proceedings of the Second Workshop on Optimizing Stencil Computations (WOSC ‘14). Association for Computing Machinery, New York, NY, USA, 57. https://doi.org/10.1145/2686745.2686756

Acknowledgement

I struck gold when it comes to mentors.

Dr. Bin Dong was really kind and supportive throughout the journey. From the very first steps of giving a tour around the codebase to giving me a lot of freedom to experiment, refactor, and refine.

Dr. John Wu was encouraging and nurturing of budding talent. We had great research presentations every Monday apart from usual mentor interactions, where different research groups presented their talks and students were invited to present their progress.

I’ve come across Quantum computing many times in the news, but I never thought I’d get a frontline preview from the researchers working at the bleeding edge at the Lawrence Berkeley National Laboratory (LBL).

This GSoC experience, made possible by Google and UC OSPO, has been invaluable for my growth as a developer and researcher.

For people interested in HPC, ML, Systems, or Reproducibility, I encourage you all to apply to UC OSPO. It’s been an incredible journey, and I’m grateful for every moment of it!

Static and Interactive Visualization Capture

Fri, 30 Aug 2024 00:00:00 +0000

Introduction

Hello! My name is Arya Sarkar a machine learning engineer and researcher based out of Kolkata, a city in Eastern India dubbed the City of Joy. During summer of 2024, I worked closely with Professor David Koop on the project titled Reproducibility in Data Visualization. We explored multiple existing solutions and tested different stratergies and made great progress in the capture of visualiations using a relatively less used method of embedding visualization meta-information into the final resultant visualizations jpg as a json object.

Progress and Challenges

Static Visualization Capture

We successfully developed a method to capture static visualizations as .png files along with embedded metadata in a JSON format. This approach enables seamless reproducibility of the visualization by storing all necessary metadata within the image file itself. Our method supports both Matplotlib and Bokeh libraries and demonstrated near-perfect reproducibility, with only a minimal 1-2% pixel difference in cases where jitter (randomness) was involved.

Interactive Visualization Capture

For interactive visualizations, our focus shifted to capturing state changes in Plotly visualizations on the web. We developed a script that tracks user interactions (e.g., zoom, box, lasso, slider) using event listeners and automatically captures the visualization state as both image and metadata files. This script also maintains a history of interactions to ensure reproducibility of all interaction states.

The challenge of capturing web-based visualizations from platforms like ObservableHq remains, as iframe restrictions prevent direct access to SVG elements. Further exploration is needed to create a more robust capture method for these environments.

Future Work

We aim to package our interactive capture script into a Google Chrome extension.

Temporarily store interaction session files in the browser’s local storage.

Enable users to download captured files as a zip archive, using base64 encoding for images.

Conclusion

The last summer, we made significant strides in enhancing data visualization reproducibility. Our innovative approach to embedding metadata directly into visualization files offers a streamlined method for recreating static visualizations. The progress in capturing interactive visualization states opens new possibilities for tackling a long-standing challenge in the field of reproducibility.

Final Blog: BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Thu, 29 Aug 2024 00:00:00 +0000

Hello! I’m Qianru! I have been contributing to the BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking project under the mentorship of Ziheng Duan. My project aims to provide a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics.

Motivation and Overview

The “BenchmarkST” project was driven by the need to address a critical challenge in spatial transcriptomics: the impact of sparse data on downstream tasks, such as spatial domain identification. Sparse data can significantly degrade the performance of these tasks. For example, in a 10X Visium dataset of human brain Dorsolateral Prefrontal Cortex (DLPFC), using the complete dataset with GraphST (a state-of-the-art clustering method) for clustering resulted in an ARI (Adjusted Rand Index) of 0.6347. However, when using only 20% of the data—a common scenario—the performance dropped dramatically to 0.1880. This stark difference highlights the importance of effective gene imputation, which can help restore the lost information and improve the accuracy of downstream analyses.

To tackle this issue, the BenchmarkST project led to the creation of the Impeller package. This package provides a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics, offering preprocessed datasets, reproducible evaluation methods, and flexible inference interfaces. It spans across different platforms, species, and organs, aiming to enhance the integrity and usability of spatial transcriptomics data.

What Was Accomplished

Development of the Impeller Package

Data Aggregation and Preprocessing:

We aggregated and preprocessed spatial transcriptomic datasets from multiple platforms (10X Visium, StereoSeq, SlideSeqV2), species (human, mouse), and organs (Dorsolateral Prefrontal Cortex, olfactory bulb). These datasets are readily available for download within the package.

Unified Evaluation Framework:

A reproducible framework was developed, integrating methods such as K-Nearest Neighbors (KNN) and the deep learning-based Impeller method, enabling users to easily evaluate the performance of different gene imputation techniques.

Inference Interfaces:

We provided interfaces that allow users to apply gene imputation on custom datasets, offering the flexibility to predict any gene in any cell, maximizing the utility for diverse research needs.

Code Contributions and Documentation

Repository:

All code related to the Impeller package has been committed to the Impeller repository.

Link to Versions:

Here you can find all the versions made during the project, with detailed descriptions of each change.

README.md:

Detailed documentation on how to use the Impeller package, including installation instructions, usage examples, and explanations of the key components.

Final Blog: ML in Detecting and Addressing System Drift

Thu, 29 Aug 2024 00:00:00 +0000

Hello! I’m Joanna! I have been contributing to the ML in Detecting and Addressing System Drift project under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy. My project aims to design a pipeline to evaluate drift detection algorithms on system traces.

Methodology

Here is some background on my project: Model drift, or the degradation of model performance, is typically caused by data drift, which is a shift in the input distribution, and concept drift, which is a change in the relationship between input and output. This project focuses specifically on data drift, aiming to design a pipeline for evaluating drift detection algorithms on system traces. The goal is to benchmark different drift detection algorithms and have a better understanding of the features of system traces. The project is divided into two main parts: dataset construction and algorithm benchmarking.

PART 1: Dataset Construction

To benchmark drift detection algorithms in system data, it’s important to recognize that system trace data is inherently different from other data types, often containing more noise, which can complicate detection efforts. Therefore, constructing a labeled dataset specific to system data is crucial. In our case, we utilize the Tencent I/O block trace data as the dataset. This raw data was processed to extract timestamps along with various features such as IOPS, write size ratio, read write ratio, and etc., which were then used to create a data drift dataset.

I constructed this dataset by labeling segments of the trace data as either exhibiting drift or not. To identify where the drift occurs and to help construct the dataset, I employed several offline drift detection algorithms, including Kolmogorov-Smirnov, Cramer-von Mises, KL-Divergence, and Jensen-Shannon Distance.

To enhance the accuracy of the drift detection, especially in the presence of noise common in trace data, I applied additional preprocessing steps such as Fourier transform and moving average. These techniques help to smooth the data, making it easier to detect true drift signals. Finally, a voting strategy was used in combination with post-processing methods to build and refine the final datasets.

The first figure below illustrates the segments of IOPS where drift has been detected. The second figure shows the segments of data where no drift occurs.

PART 2: Benchmark Drift Detection Algorithms

This part focuses on benchmarking the Jensen-Shannon and Wasserstein drift detection methods using system trace data. The evaluation metrics are categorized into three main areas:

Detection Accuracy Metrics

True Positive Rate (Recall)
True Negative Rate (Specificity)
Precision
F1-Score

Detection Overhead Metrics

Time Taken: The computational time required to detect drifts, critical

Stability Metrics

False Positive Rate
False Negative Rate

(Additional) Comparative Analysis:

Accuracy Across Different Features: How well the detection algorithms perform when applied to various features within the system trace data.

Discussion

The results clearly demonstrate that the Jensen-Shannon distance method outperforms the Wasserstein distance method in detecting drift. Additionally, the write size ratio proves to be a more effective feature for representing the variations in the data, offering a more nuanced understanding of the underlying changes.

Conclusion and Next Steps

In conclusion, this project establishes a pipeline that encompasses data labeling, data processing, and the benchmarking of drift detection algorithms. This just serves as the first step in detecting drift in system data.

There is significant potential for further improvement. Future work should focus on enhancing dataset construction by incorporating large language models (LLMs) and other advanced techniques to further clean and refine the datasets. Additionally, the evaluation of drift detection methods should be expanded beyond the current benchmarks, which only include two statistical methods. Incorporating additional statistical methods, as well as machine learning (ML) and deep learning (DL) approaches, could provide a more comprehensive analysis. Furthermore, exploring a broader range of evaluation metrics will ensure a more robust and accurate assessment of drift detection performance. These steps will help to advance the accuracy and reliability of drift detection in system trace data.

Deliverables

The following are the deliverables of this project:

Trovi Artifact
Github Repository: This repository contains the code for generating drift datasets with labels and notebooks with benchmarking results

Final Blogpost: Reproducibility in Data Visualization

Wed, 28 Aug 2024 00:00:00 +0000

Hello everyone!

I’m Triveni, a Master’s student in Computer Science at Northern Illinois University (NIU). I’m excited to share my progress on the OSRE 2024 project Categorize Differences in Reproduced Visualizations focusing on data visualization reproducibility. Working under the mentorship of David Koop, I’ve made some significant strides and faced some interesting challenges.

Reproducibility in data visualization

Reproducibility is crucial in data visualization, ensuring that two visualizations accurately convey the same data. This is essential for maintaining transparency and trust in data-driven decision-making. When comparing two visualizations, the challenge is not just spotting differences but determining which differences are meaningful. Tools like OpenCV are often used for image comparison, but they may detect all differences, including those that do not impact the data’s interpretation. For example, slight shifts in labels might be flagged as differences even if the underlying data remains unchanged, making it challenging to assess whether the visualizations genuinely differ in terms of the information they convey.

A Breakthrough with ChartDetective

Among various tools like ChartOCR and ChartReader, ChartDetective proved to be the most effective. This tool enabled me to extract data from a range of visualizations, including bar charts, line charts, box plots, and scatter plots. To enhance its capabilities, I modified the codebase to capture pixel values alongside the extracted data and store both in a CSV file. This enhancement allowed for a direct comparison of data values and their corresponding pixel coordinates between two visualizations, focusing on meaningful differences that truly impact data interpretation.

Example: Comparing Two Bar Plots with ChartDetective

Consider two bar plots that visually appear similar but have slight differences in their data values. Using ChartDetective, I extracted the data and pixel coordinates from both plots and stored this information in a CSV file. The tool then compared these values to identify any discrepancies.

For instance, in one bar plot, the height of a specific bars were slightly increased. By comparing the CSV files generated by ChartDetective, I was able to pinpoint these differences precisely. The final step involved highlighting these differences on one of the plots using OpenCV, making it clear where visualizations diverged.This approach ensures that only meaningful differences—those that reflect changes in the data—are considered when assessing reproducibility.

ChartDetective: SVG or PDF file of the visualization is uploaded to extract data.

- Data Extraction: Data values along with pixel details are stored in the CSV files.

- Highlighting the differences: Differences are highlighted on one of the plots using OpenCV

Understanding User Perspectives on Reproducibility

To complement the technical analysis, I created a pilot survey to understand how users perceive reproducibility in data visualizations. The survey evaluates user interpretations of two visualizations and explores which visual parameters impact their decision-making. This user-centered approach is crucial because even minor differences in visual representation can significantly affect how data is interpreted and used.

Pilot Survey Example:

Pixel Differences: In one scenario, the height of two bars was altered slightly, introducing a noticeable yet subtle change.

Label Swapping: In another scenario, the labels of two bars were swapped without changing their positions or heights.

Participants will be asked to evaluate the reproducibility of these visualizations, considering whether the differences impacted their interpretation of the data. The goal was to determine which visual parameters—such as bar height or label positioning—users find most critical when assessing the similarity of visualizations.

Future Work and Conclusion

Going forward, I plan to develop a proof of concept based on these findings and implement an extensive survey to further explore the impact of visual parameters on users’ perceptions of reproducibility. Understanding this will help refine tools and methods for comparing visualizations, ensuring they not only look similar but also accurately represent the same underlying data.

ORAssistant - LLM Assistant for OpenROAD

Tue, 27 Aug 2024 00:00:00 +0000

Introduction

Hello! I’m Palaniappan R, an undergraduate student at BITS Pilani, India. Over the past few months, I’ve been working as a GSoC contributor on the LLM Assistant for OpenROAD - Model Architecture and Prototype project, under the mentorship of Indira Iyer and Jack Luar.

The primary objective of my project is to improve the user experience within OpenROAD and OpenROAD-flow-scripts by utilizing Large Language Models(LLMs) to offer fast, relevant answers to FAQs and common issues. The ORAssistant chatbot aims to act as a first line of support, addressing basic queries in domains such as installation and command usage. Its goal is to resolve simple issues before they escalate to public forums, thereby reducing the number of support tickets on platforms like GitHub Issues.

Architecture Overview

Retrieval-augmented-generation (RAG) is a technique that improves the q&a capabilities and reliability of LLMs by incorporating factual information from external sources. When a user submits a query, the RAG process begins by fetching relevant information from a knowledge base. The retrieved content, combined with the original query is the provided to the LLM to generate a relevant, informed response.

The Knowledge Base

ORAssistant is designed to answer queries about all the major tools in the OR flow. The knowledge base primarily consists of official documentation from OpenROAD, OpenROAD-flow-scripts, and their respective manpages. Instead of scraping these primary sources from their websites, the docs are built to the desired markdown format directly from the respective GitHub repositories, using specific commit hashes for reproducibility. The knowledge base also includes documentation from other essential applications in the EDA flow, such as Yosys and OpenSTA. Additionally, it includes scraped and annotated conversational data from discussions on the OpenROAD and OpenROAD-flow-scripts GitHub pages.

The entire dataset building process has been automated, allowing for dynamic updates to accommodate any live changes.

The Tool-Based Architecture

After experimenting with multiple RAG approaches, a tool-based setup proved to be the most effective solution. Data from various domains are embedded into vector databases, and hybrid search retriever functions are applied to these vector stores. These functions are organized as individual tools that can be called by the chatbot. To maintain context, each query is rephrased while considering the chat history. This ensures a more precise and context-rich query. Please refer to my previous blog post for more information on the retrieval tools.

As depicted in the flowchart, a preliminary LLM call analyzes the input query, rephrases it based on the chat history and picks the appropriate tools for the rephrased query. Subsequently, documents are retrieved using the tool and sent to the LLM, which produces a relevant, context-aware response.

Using ORAssistant

ORAssistant is currently hosted at this link.

To set up out ORAssistant locally, find detailed instructions in the GitHub Repo. Both cloud based LLM providers (Gemini, VertexAI) and local options (Ollama) are supported.

Here’s an example of ORAssistant in action,

Future Plans

To further enhance the usability of ORAssistant, there are plans to add support for flow script generation. This will become possible after adding a dedicated script generation tool into the current tool-based workflow. Support for more tools in the EDA flow, such as KLayout will also be added in the near future.

Additionally, ORAssistant is planned to be integrated directly into OpenROAD’s CLI and GUI interfaces.

As I near the end of my GSoC, I’d like to thank the GSoC Organizing Committee, UC OSPO and The OpenROAD Project for this incredible opportunity. I’m immensely grateful to Indira Iyer and Jack Luar for their support and guidance throughout my GSoC journey. Thank You.

Final Blogpost: Drift Management Strategies Benchmark

Sat, 24 Aug 2024 00:00:00 +0000

Background

Hello there! I’m William and this is my final blog for my proposal “Developing A Comprehensive Pipeline to Benchmark Drift Management Approaches” under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy under the LAST project.

If you’re not familiar with it, this project aims to address the issue of model aging, where machine learning (ML) models experience a decline in effectiveness over time due to environmental changes, known as drift. My goal is to design an extensible pipeline that evaluates and benchmarks the robustness of state-of-the-art algorithms in addressing these drifts.

Deliverables

You can find my list of deliverables here:

Final report, this blog is a summarized version of my final report, so do take a look if you’d like to know more!
Github repository, contains code as well as the raw experiment results.
Trovi artifact

Evaluation

Here are some of the graphs that show the performance of every algorithm on the created datasets. For more graphs and figures, you can check out my final report:

CIRCLE: AUE demonstrates stability, maintaining a high accuracy even as the data drifts, which may be due to its ensemble nature. It is even more stable than baseline retraining algorithms. Matchmaker is also able to recover quickly upon experiencing drift, which maybe again due to its ranking the most high performing models to do inference, recovering faster than RetrainWin. On the other hand, DriftSurf experiences several random drops in accuracy, indicating that it can be somewhat unstable.
SINE: Similar to CIRCLE, AUE demonstrates stability throughout the dataset, maintaining a high accuracy even as the data drifts. Matchmaker however was struggling to adapt as fast when encountering such a sudden drift, as it needed some time/windows to recover from the drop. Driftsurf’s performance was notably better than baseline, as unlike them, it was able to recover successfully fairly quickly upon experiencing drift.
CovCon: In CovCon, Matchmaker was able to achieve the best accuracy, as it is able to select the models most relevant to each incoming batch (model trained on the most similar features), performing comparably to retrain window. Most of the other algorithms suffered in this dataset, particularly AUE whose performance is now becoming comparable to the rest of the algorithms and baseline.
IOAdmission: Performance on this dataset was led by AUE, which was able to maintain impressive stability amongst all of the algorithms used. This is followed closely by Matchmaker. The other algorithms used undergo a lot of fluctuations in accuracy.

Findings / Discussion

From the experiments conducted, the findings are as follows:

Matchmaker was able to perform particularly well in the CovCon dataset. This maybe due to its ability to choose the most relevant trained model from its ensemble during inference time. Its training time is also the best compared to other algorithms, especially considering that it keeps data for training an additional random forest model for ranking the models. However, its inference time was the longest amongst all other algorithms. This may be due to the fact that on inference time, one needs to traverse all of the leaf nodes of the random forest used to rank it (computing covariate shift).
AUE was able to perform particularly well in the CIRCLE and IOAdmission dataset. However, it is quite competitive on other datasets too. It’s weighting function which incentives highly relevant models and eviction of less relevant ones may be key. Its inference time is decent compared to other algorithms, being slower than most baselines and Driftsurf, but faster than Matchmaker. However, its training time took the longest amongst other competitors, as it has an expensive weighting function to weight, evict, or retrain models on every retraining.
DriftSurf was performing very similarly to the RetrainWindow baseline, in almost all datasets, except for IO Admission and SINE where it did better. This may be because of the fact that it maintains only at most 2 models every iteration, and as such, its performance was not competitive against the mult-models approach used in Matchmaker and AUE. On the plus side, its inference time is comparable to the baseline single model, having almost no inference overhead compared to most of the competitors out there. Another plausible explanation for the lack of performance is the lack of tuning, such as the number of windows retained, the length of its reactive period, and its reactivity sensitivity threshold. A better performance could be achieved if these parameters were tuned further.

Next Steps

These are some of the potential extensions for this project:

Optimize Matchmaker’s inference time improving Matchmaker’s efficiency, especially in covariate shift ranking, can reduce inference time. Simplifying the random forest traversal could make Matchmaker faster without impacting performance.
Extending the work to include other frameworks like TensorFlow or PyTorch, as it can now only support a scikit-learn base model.

Thank you for reading!

Hardware Hierarchical Dynamical Systems

Sat, 24 Aug 2024 00:00:00 +0000

Hi everyone! I am Ujjwal Shekhar, a Computer Science student at the International Institute of Information Technology - Hyderabad. I am excited to share my work on the project titled “Hardware Hierarchical Dynamical Systems” as part of the Open Source Research Experience (OSRE) program and Google Summer of Code. This project has been an incredible journey, and I’ve had the privilege of working with my mentors, Jose Renau and Sakshi Garg.

Project Overview and Goals

Abstract Syntax Trees (ASTs) are fundamental to modern compilers, serving as the backbone for parsing and transforming code. When compiling hardware code, the sheer volume of data can make compilation times a significant bottleneck. My project focuses on building a memory-optimized tree data structure specifically tailored for AST-typical queries.

The LiveHD repository, developed by the Micro Architecture Lab at UCSC, offers a compiler infrastructure optimized for hardware synthesis and simulation. The existing LHTree data structure provides a foundation, but there was significant potential for further optimization, which I explored throughout this project.

Key AST Queries

The core queries that the tree is optimized for include:

Finding the parent of a node.
Finding the first and last child of a node.
Locating the previous and next sibling of a node.
Adding a child to a node.
Inserting a sibling to a node.
Performing preorder, postorder, and sibling order traversal.
Removing a leaf or an entire subtree from the tree.

The primary goal was to create a tree class that excels at handling these queries efficiently, while still being robust enough to support less frequent operations. The new HHDS tree structure has demonstrated superior performance for specific tree configurations and continues to show potential across other types, particularly in memory consumption and cache efficiency, compared to the current LHTree.

The benchmarks were done using Google Bench to test the tree for scalability and performance. The new version of the tree is currently being integrated into the LiveHD core repository. Profiling to find bottlenecks in the tree was also done using Callgrind and KCachegrind.

Background and Motivation

Naive approach

A straightforward method for storing an n-ary tree is to maintain pointers from each node to its parent, children, and immediate siblings. While simple, this approach is memory-intensive and has poor cache efficiency due to the non-contiguous nature of nodes in memory. The variable memory usage per node, depending on the number of children, can also introduce significant overhead.

Enhancements to the Naive Approach

To reduce memory overhead, one optimization is to store only pointers to the first and last child within each node. This reduces memory usage to a constant per node. Additionally, since many AST-related queries focus on the tree’s structure rather than the data itself, we can separate the data from the structure. The tree would store only pointers to the data, allowing the tree structure to be optimized independently of the data storage.

While separating the data and the structure may seem like an obvious improvement, we will see that it can be extended to provide greater benefits.

Improving the cache efficiency

While reducing memory consumption is beneficial, the tree’s cache efficiency can still be suboptimal if the children of a node are scattered in memory. To enhance cache efficiency, storing children in contiguous memory locations is crucial. This improves spatial locality, which in turn boosts cache performance. Additionally, this approach eliminates the need to explicitly store data pointers in the tree, as the data resides at a contiguous memory index aligned with the bookkeeping.

By storing children contiguously, we can also eliminate the need for previous and next sibling pointers, as siblings are inherently adjacent in memory. Similarly, we can avoid storing the parent pointer for every child, since all children share the same parent.

Optimizations in LHTree (Old method)

The LHTree class in LiveHD was designed with these optimizations in mind. It groups siblings into chunks of four, storing the parent pointer only in the first sibling of each chunk. The last sibling in each chunk points to the next chunk, minimizing the number of pointers required and thus reducing memory overhead.

LHTree organizes the entire tree as a 2-dimensional array, where the first dimension represents the tree level and the second dimension represents the node index at that level. This structure improves cache efficiency by storing nodes contiguously in memory. Each tree position is a 48-bit ID, with the last 32 bits representing the node’s index and the first 16 bits indicating the tree level.

This explicit maintenance of level separately limits the tree’s scalability for deeper trees, due to the fixed number of bits allocated for the level.

Despite these optimizations, LHTree has some limitations, particularly in cache alignment and flexibility, which the HHDS tree aims to address.

Unfortunately, the number of bits required by each “chunk” happens to be slightly bigger than a single cache line (512 bits). This means that the cache efficiency of the tree is not optimal.

HHDS Tree : A New Approach

Eliminating Levels

The HHDS tree stores everything in a single vector, removing the need for explicit level information. This simplification not only improves cache efficiency but also eliminates restrictions on the number of nodes per level and the total number of levels.

Enhanced Cache Alignment

In the HHDS tree, each node has a 46-bit ID. Chunks in the HHDS tree contain up to eight children, with the first 43 bits of the absolute ID serving as the chunk ID and the last three bits indicating the node’s offset within the chunk.

For each chunk, which is exactly 64 bytes (or 512 bits) long—matching the size of a cache line—the following information is stored:

A 46-bit parent pointer (absolute ID).
A 43-bit first child long pointer (chunk ID).
A 43-bit last child long pointer (chunk ID).
43-bit previous and next sibling chunk pointers.
Seven 21-bit short delta pointers for the first child.
Seven 21-bit short delta pointers for the last child.

NOTE: The 0th chunk is an INVALID node, the real nodes start from the 1st chunk, with the node at an absolute ID of 8 (chunk ID of 1) being the root node.

Refer to the next section for more information on the short delta pointers.

The chunk is 512 bits long, which is 64 bytes, exactly the size of a cache line. Thus the amount of memory required in the worst case is 512 bits for a single node in the chunk, and in the best case is 46 bits for all 8 nodes in the chunk.

We utilized the __attribute__((packed, aligned(64))) attribute in C++ to ensure that each chunk aligns perfectly with a cache line. Bitfields were employed to pack the data efficiently within the chunk.

class __attribute__((packed, aligned(64))) Tree_pointers {
private:
 // We only store the exact ID of parent
 Tree_pos parent : CHUNK_BITS + CHUNK_SHIFT;
 Tree_pos next_sibling : CHUNK_BITS;
 Tree_pos prev_sibling : CHUNK_BITS;

 // Long child pointers
 Tree_pos first_child_l : CHUNK_BITS;
 Tree_pos last_child_l : CHUNK_BITS;

 // Short (delta) child pointers
 // You cannot make an array of bitfields inside a packed
 // struct, since the compiler will align each bitfield to the
 // size of the nearest power of two.
 Short_delta first_child_s_0 : SHORT_DELTA;
 Short_delta first_child_s_1 : SHORT_DELTA;
 Short_delta first_child_s_2 : SHORT_DELTA;
 Short_delta first_child_s_3 : SHORT_DELTA;
 Short_delta first_child_s_4 : SHORT_DELTA;
 Short_delta first_child_s_5 : SHORT_DELTA;
 Short_delta first_child_s_6 : SHORT_DELTA;

 Short_delta last_child_s_0 : SHORT_DELTA;
 Short_delta last_child_s_1 : SHORT_DELTA;
 Short_delta last_child_s_2 : SHORT_DELTA;
 Short_delta last_child_s_3 : SHORT_DELTA;
 Short_delta last_child_s_4 : SHORT_DELTA;
 Short_delta last_child_s_5 : SHORT_DELTA;
 Short_delta last_child_s_6 : SHORT_DELTA;
}

Build Append - Short Delta Heuristic

Empirical observations show that children are often added to a node shortly after the parent, meaning they are stored close to the parent in memory. This allows children to be stored as a delta from the parent, reducing the need for full chunk IDs.

When adding a child:

Attempt to store the child as a delta from the parent.
If not feasible, allocate a new chunk for the parent and store the pointer to the child chunk in the newly created parent chunk.

Implementing chunk breaking required careful handling to ensure that when a parent moves to a new chunk, its new chunk can still be referenced efficiently by its parent, potentially requiring recursive adjustments.

This is because the grandparent might not be able to store the parent as a delta from itself after the parent moves to a new chunk.

Compliance with the LiveHD core repository

Since the HHDS tree is an evolution of the LHTree, it was crucial to maintain compatibility with the LiveHD core repository. All necessary methods were implemented in the HHDS tree to ensure seamless integration. Naming conventions and syntax were kept consistent with the LHTree to facilitate a smooth transition.

Exposed methods in the HHDS tree are:

/**
 * Query based API (no updates)
 */
 Tree_pos get_parent (const Tree_pos& curr_index) const;
 Tree_pos get_last_child (const Tree_pos& parent_index) const;
 Tree_pos get_first_child (const Tree_pos& parent_index) const;
 bool is_last_child (const Tree_pos& self_index) const;
 bool is_first_child (const Tree_pos& self_index) const;
 Tree_pos get_sibling_next (const Tree_pos& sibling_id) const;
 Tree_pos get_sibling_prev (const Tree_pos& sibling_id) const;
 bool is_leaf (const Tree_pos& leaf_index) const;


/**
 * Update based API (Adds and Deletes from the tree)
 */
 // FREQUENT UPDATES
 Tree_pos append_sibling(const Tree_pos& sibling_id, const X& data);
 Tree_pos add_child(const Tree_pos& parent_index, const X& data);
 Tree_pos add_root(const X& data);

 void delete_leaf(const Tree_pos& leaf_index);
 void delete_subtree(const Tree_pos& subtree_root);

 // INFREQUENT UPDATES
 Tree_pos insert_next_sibling(const Tree_pos& sibling_id,
 const X& data);

Benchmarking Results

Preliminary benchmarks indicate that the HHDS tree outperforms the LHTree in both runtime efficiency (for certain cases, more on this in a later section) and memory consumption. The HHDS tree demonstrates enhanced performance across various tests, offering a more optimized solution for handling Abstract Syntax Tree (AST) operations.

I constructed identical trees using both the LHTree and HHDS tree structures and executed a series of queries on each. The benchmarks were performed using Google Benchmark to ensure accurate and consistent results. Below, I detail the specific tests conducted.

Benchmark Tests Overview

Deep Tree Test
This test simulates a line graph by repeatedly adding a child to the last node in the tree. It is designed to assess the tree’s performance when handling deep structures, where each node has a single child.
Wide Tree Test
In this scenario, a single root node is created, followed by the addition of numerous child nodes directly under the root. This test evaluates the tree’s efficiency in managing wide structures with many immediate children.
Chip-Typical Tree Test
This test models a tree commonly seen in hardware design. For each node, a random number of children (ranging from 1 to 7) are added, and the process is recursively applied to the leaf nodes up to a certain depth. This test measures the tree’s performance in realistic, varied conditions.
Chip-Typical (Long) Tree Test
Similar to the Chip-Typical Tree Test, but with a broader range of children per node (1 to 20). This test is particularly useful for examining performance when the tree is more complex and chunk splitting is more likely.

These tests provide a comprehensive analysis of the HHDS tree’s capabilities, highlighting its superiority over the LHTree for deeper trees.

Add/Append Benchmarks

Deep Tree Test

test_deep_tree_100_hhds indicates the time taken to run a benchmark on a deep tree of 100 nodes using the HHDS tree structure. This nomenclature is consistent across all tests.

Disabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_deep_tree_10_hhds 11704 ns
test_deep_tree_10_lh 19541 ns
test_deep_tree_100_hhds 85317 ns
test_deep_tree_100_lh 163058 ns
test_deep_tree_1000_hhds 760260 ns
test_deep_tree_1000_lh 1442391 ns
test_deep_tree_10000_hhds 9889199 ns
test_deep_tree_10000_lh 16215232 ns
test_deep_tree_100000_hhds 84650074 ns
test_deep_tree_100000_lh 163255882 ns
test_deep_tree_1000000_hhds 877646208 ns
test_deep_tree_1000000_lh 1659725904 ns
test_deep_tree_10000000_hhds 9256118059 ns
test_deep_tree_10000000_lh 1.4431e+10 ns

Enabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_deep_tree_10_hhds 1443 ns
test_deep_tree_10_lh 1462 ns
test_deep_tree_100_hhds 7398 ns
test_deep_tree_100_lh 17455 ns
test_deep_tree_1000_hhds 79544 ns
test_deep_tree_1000_lh 165656 ns
test_deep_tree_10000_hhds 1337406 ns
test_deep_tree_10000_lh 1494153 ns
test_deep_tree_100000_hhds 12288324 ns
test_deep_tree_100000_lh 14897463 ns
test_deep_tree_1000000_hhds 116810846 ns
test_deep_tree_1000000_lh 188815892 ns
test_deep_tree_10000000_hhds 2338596582 ns
test_deep_tree_10000000_lh 2238844395 ns

Here, the HHDS tree structure consistently outperforms the LHTree in the Deep Tree Test, showcasing its efficiency in handling deep tree structures.

Wide Tree Test

Disabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_wide_tree_10_hhds 6581 ns
test_wide_tree_10_lh 6235 ns
test_wide_tree_100_hhds 34911 ns
test_wide_tree_100_lh 35734 ns
test_wide_tree_1000_hhds 323228 ns
test_wide_tree_1000_lh 312755 ns
test_wide_tree_10000_hhds 3547963 ns
test_wide_tree_10000_lh 2975894 ns
test_wide_tree_100000_hhds 33800125 ns
test_wide_tree_100000_lh 32538424 ns
test_wide_tree_1000000_hhds 332509041 ns
test_wide_tree_1000000_lh 336261868 ns
test_wide_tree_10000000_hhds 3527352810 ns
test_wide_tree_10000000_lh 8774024963 ns

Enabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_wide_tree_10_hhds 837 ns
test_wide_tree_10_lh 512 ns
test_wide_tree_100_hhds 3394 ns
test_wide_tree_100_lh 2675 ns
test_wide_tree_1000_hhds 26019 ns
test_wide_tree_1000_lh 20141 ns
test_wide_tree_10000_hhds 319068 ns
test_wide_tree_10000_lh 245964 ns
test_wide_tree_100000_hhds 3369183 ns
test_wide_tree_100000_lh 2910862 ns
test_wide_tree_1000000_hhds 39243340 ns
test_wide_tree_1000000_lh 26777306 ns
test_wide_tree_10000000_hhds 454508781 ns
test_wide_tree_10000000_lh 331688046 ns

Here without compiler optimizations, the HHDS tree structure typically outperforms the LHTree in the Wide Tree Test for large tree sizes. For smaller tree sizes, the LHTree showed a slightly better performance. However, using compiler optimizations, the LHTree starts to perform better than HHDS.

The reason for the HHDS tree’s superior performance can be attributed to the chunk size being large, which allows for better cache utilization and reduced memory overhead. However, the LH Tree has been put through more tuning and has been in use for a longer time, which could explain its better performance with compiler optimizations. In the future, the HHDS tree could be optimized further to match or exceed the LH Tree’s performance.

Chip Typical Tree Test

Disabled compiler optimizations

--------------------------------------------------------
Benchmark Time
--------------------------------------------------------
test_chip_typical_tree_1_hhds 7109 ns
test_chip_typical_tree_1_lh 6803 ns
test_chip_typical_tree_2_hhds 22728 ns
test_chip_typical_tree_2_lh 22064 ns
test_chip_typical_tree_3_hhds 75398 ns
test_chip_typical_tree_3_lh 70910 ns
test_chip_typical_tree_4_hhds 270062 ns
test_chip_typical_tree_4_lh 254423 ns
test_chip_typical_tree_5_hhds 1110254 ns
test_chip_typical_tree_5_lh 1074439 ns
test_chip_typical_tree_6_hhds 5024264 ns
test_chip_typical_tree_6_lh 3900709 ns
test_chip_typical_tree_7_hhds/iterations:5 13290739 ns
test_chip_typical_tree_7_lh/iterations:5 22145462 ns
test_chip_typical_tree_8_hhds/iterations:5 83438683 ns
test_chip_typical_tree_8_lh/iterations:5 105475664 ns

Enabled compiler optimizations

--------------------------------------------------------
Benchmark Time
--------------------------------------------------------
test_chip_typical_tree_1_hhds 938 ns
test_chip_typical_tree_1_lh 387 ns
test_chip_typical_tree_2_hhds 1877 ns
test_chip_typical_tree_2_lh 1351 ns
test_chip_typical_tree_3_hhds 7095 ns
test_chip_typical_tree_3_lh 5052 ns
test_chip_typical_tree_4_hhds 35019 ns
test_chip_typical_tree_4_lh 21569 ns
test_chip_typical_tree_5_hhds 130915 ns
test_chip_typical_tree_5_lh 78010 ns
test_chip_typical_tree_6_hhds 522385 ns
test_chip_typical_tree_6_lh 278223 ns
test_chip_typical_tree_7_hhds/iterations:5 4015636 ns
test_chip_typical_tree_7_lh/iterations:5 1648426 ns
test_chip_typical_tree_8_hhds/iterations:5 9873724 ns
test_chip_typical_tree_8_lh/iterations:5 4607773 ns

For the Chip Typical test, the HHDS tree’s performance is better for larger tree sizes, while the LHTree performs better for smaller tree sizes. However, with compiler optimizations, the LH Tree performs better than the HHDS tree.

Chip Typical (long) Tree test

Disabled compiler optimizations

-------------------------------------------------------------
Benchmark Time
-------------------------------------------------------------
test_chip_typical_long_tree_1_hhds 8875 ns
test_chip_typical_long_tree_1_lh 8479 ns
test_chip_typical_long_tree_2_hhds 62490 ns
test_chip_typical_long_tree_2_lh 64620 ns
test_chip_typical_long_tree_3_hhds 625064 ns
test_chip_typical_long_tree_3_lh 654787 ns
test_chip_typical_long_tree_4_hhds 6128047 ns
test_chip_typical_long_tree_4_lh 6528778 ns
test_chip_typical_long_tree_5_hhds 71345448 ns
test_chip_typical_long_tree_5_lh 77170587 ns
test_chip_typical_long_tree_6_hhds/iterations:5 656595039 ns
test_chip_typical_long_tree_6_lh/iterations:5 860193491 ns

Enabled compiler optimizations

-------------------------------------------------------------
Benchmark Time
-------------------------------------------------------------
test_chip_typical_long_tree_1_hhds 1139 ns
test_chip_typical_long_tree_1_lh 692 ns
test_chip_typical_long_tree_2_hhds 8666 ns
test_chip_typical_long_tree_2_lh 5238 ns
test_chip_typical_long_tree_3_hhds 90856 ns
test_chip_typical_long_tree_3_lh 48758 ns
test_chip_typical_long_tree_4_hhds 1034346 ns
test_chip_typical_long_tree_4_lh 472964 ns
test_chip_typical_long_tree_5_hhds 13040238 ns
test_chip_typical_long_tree_5_lh 5025192 ns
test_chip_typical_long_tree_6_hhds/iterations:3 131143411 ns
test_chip_typical_long_tree_6_lh/iterations:3 68739573 ns

Similar to the previous case, the HHDS tree performs better in debug mode (without compiler optimizations). However, the LH Tree performs better with compiler optimizations.

We see that the HHDS tree has shown overall better performance without compiler optimizations, however, with compiler optimizations, the LH Tree has shown better performance. HHDS Tree has shown better performance regardless, for the Deep Tree test. This indicates an inherent trade-off between the choice of both trees. To further investigate this behaviour I conducted some profiling, which is in a later section.

Iterators Benchmarks

Deep Tree test

Disabled compiler optimizations

--------------------------------------------------------
Benchmark Time
-------------------------------------------------------
test_deep_tree_10_hhds 884 ns
test_deep_tree_10_lh 1356 ns
test_deep_tree_100_hhds 7987 ns
test_deep_tree_100_lh 11191 ns
test_deep_tree_1000_hhds 86991 ns
test_deep_tree_1000_lh 105809 ns
test_deep_tree_10000_hhds 894127 ns
test_deep_tree_10000_lh 1076983 ns
test_deep_tree_100000_hhds 7927102 ns
test_deep_tree_100000_lh 11177187 ns
test_deep_tree_1000000_hhds/iterations:4 80470145 ns
test_deep_tree_1000000_lh/iterations:4 145763040 ns
test_deep_tree_10000000_hhds/iterations:3 1055529435 ns
test_deep_tree_10000000_lh/iterations:3 995416880 ns

Enabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_deep_tree_10_hhds 202 ns
test_deep_tree_10_lh 93.1 ns
test_deep_tree_100_hhds 1595 ns
test_deep_tree_100_lh 1039 ns
test_deep_tree_1000_hhds 15663 ns
test_deep_tree_1000_lh 11000 ns
test_deep_tree_10000_hhds 164778 ns
test_deep_tree_10000_lh 107293 ns
test_deep_tree_100000_hhds 1615928 ns
test_deep_tree_100000_lh 1260507 ns
test_deep_tree_1000000_hhds 19582402 ns
test_deep_tree_1000000_lh 15954697 ns
test_deep_tree_10000000_hhds 214887559 ns
test_deep_tree_10000000_lh 179118729 ns

Wide Tree test

Disabled compiler optimizations

-------------------------------------------------------
Benchmark Time
-------------------------------------------------------
test_wide_tree_10_hhds 7171 ns
test_wide_tree_10_lh 7098 ns
test_wide_tree_100_hhds 6204 ns
test_wide_tree_100_lh 10372 ns
test_wide_tree_1000_hhds 62762 ns
test_wide_tree_1000_lh 106132 ns
test_wide_tree_10000_hhds 622999 ns
test_wide_tree_10000_lh 1124283 ns
test_wide_tree_100000_hhds 6118490 ns
test_wide_tree_100000_lh 9550170 ns
test_wide_tree_1000000_hhds/iterations:10 59438777 ns
test_wide_tree_1000000_lh/iterations:10 97842431 ns
test_wide_tree_10000000_hhds/iterations:7 778347697 ns
test_wide_tree_10000000_lh/iterations:7 1163215808 ns

Enabled compiler optimizations

------------------------------------------
Benchmark Time
------------------------------------------
test_wide_tree_10_hhds 2103 ns
test_wide_tree_10_lh 1284 ns
test_wide_tree_100_hhds 1563 ns
test_wide_tree_100_lh 632 ns
test_wide_tree_1000_hhds 15627 ns
test_wide_tree_1000_lh 6410 ns
test_wide_tree_10000_hhds 149588 ns
test_wide_tree_10000_lh 56030 ns
test_wide_tree_100000_hhds 1511278 ns
test_wide_tree_100000_lh 563926 ns
test_wide_tree_1000000_hhds 17056051 ns
test_wide_tree_1000000_lh 7754815 ns
test_wide_tree_10000000_hhds 143994848 ns
test_wide_tree_10000000_lh 55040231 ns

Chip typical test

Disabled compiler optimizations

--------------------------------------------------------
Benchmark Time
--------------------------------------------------------
test_chip_typical_tree_1_hhds 344 ns
test_chip_typical_tree_1_lh 892 ns
test_chip_typical_tree_2_hhds 2192 ns
test_chip_typical_tree_2_lh 1691 ns
test_chip_typical_tree_3_hhds 13628 ns
test_chip_typical_tree_3_lh 14235 ns
test_chip_typical_tree_4_hhds 34049 ns
test_chip_typical_tree_4_lh 84096 ns
test_chip_typical_tree_5_hhds 206482 ns
test_chip_typical_tree_5_lh 203680 ns
test_chip_typical_tree_6_hhds 848996 ns
test_chip_typical_tree_6_lh 708212 ns
test_chip_typical_tree_7_hhds/iterations:5 3645372 ns
test_chip_typical_tree_7_lh/iterations:5 6657982 ns
test_chip_typical_tree_8_hhds/iterations:5 7375050 ns
test_chip_typical_tree_8_lh/iterations:5 4577351 ns

Enabled compiler optimizations

-------------------------------------------
Benchmark Time
-------------------------------------------
test_chip_typical_tree_1_hhds 93.1 ns
test_chip_typical_tree_1_lh 50.1 ns
test_chip_typical_tree_2_hhds 149 ns
test_chip_typical_tree_2_lh 212 ns
test_chip_typical_tree_3_hhds 1166 ns
test_chip_typical_tree_3_lh 554 ns
test_chip_typical_tree_4_hhds 7385 ns
test_chip_typical_tree_4_lh 3138 ns
test_chip_typical_tree_5_hhds 54477 ns
test_chip_typical_tree_5_lh 10643 ns
test_chip_typical_tree_6_hhds 215050 ns
test_chip_typical_tree_6_lh 53043 ns
test_chip_typical_tree_7_hhds 492555 ns
test_chip_typical_tree_7_lh 577120 ns
test_chip_typical_tree_8_hhds 2630675 ns
test_chip_typical_tree_8_lh 1278702 ns

Chip typical (long) test

Disabled compiler optimizations

------------------------------------------------
Benchmark Time
------------------------------------------------
test_chip_typical_long_tree_1_hhds 911 ns
test_chip_typical_long_tree_1_lh 1435 ns
test_chip_typical_long_tree_2_hhds 8161 ns
test_chip_typical_long_tree_2_lh 8619 ns
test_chip_typical_long_tree_3_hhds 76618 ns
test_chip_typical_long_tree_3_lh 132467 ns
test_chip_typical_long_tree_4_hhds 1644808 ns
test_chip_typical_long_tree_4_lh 1962406 ns
test_chip_typical_long_tree_5_hhds 7199648 ns
test_chip_typical_long_tree_5_lh 9195894 ns
test_chip_typical_long_tree_6_hhds 169002499 ns
test_chip_typical_long_tree_6_lh 207296570 ns

Enabled compiler optimizations

------------------------------------------------
Benchmark Time
------------------------------------------------
test_chip_typical_long_tree_1_hhds 223 ns
test_chip_typical_long_tree_1_lh 101 ns
test_chip_typical_long_tree_2_hhds 2270 ns
test_chip_typical_long_tree_2_lh 719 ns
test_chip_typical_long_tree_3_hhds 38291 ns
test_chip_typical_long_tree_3_lh 12547 ns
test_chip_typical_long_tree_4_hhds 294222 ns
test_chip_typical_long_tree_4_lh 187010 ns
test_chip_typical_long_tree_5_hhds 4721230 ns
test_chip_typical_long_tree_5_lh 835256 ns
test_chip_typical_long_tree_6_hhds 30302468 ns
test_chip_typical_long_tree_6_lh 10057136 ns

Overall, both add/append and iterators related benchmarks show an improvement in performance. Without compiler optimizations, HHDS tree performs better than the LH Tree. With compiler optimizations, there are similar differences in the traversal benchmarks. We will now look at some profiling that was done to identify the bottlenecks in the HHDS tree.

Exceptions, and a reminder of why they are slow.

When looking at the performance difference between the HHDS tree and LH tree (after enabling compiler optimizations), I was shocked to see that the HHDS tree was performing worse than the LH tree by multiple orders of magnitude upon using exceptions. This was a surprise to me, as I had not expected exceptions to have such a large impact on performance.

The reason this happens is because exceptions are slow. When an exception is thrown, the stack is unwound, and the program has to jump to the catch block. This is a slow process, and should be avoided in performance-critical code. Moreover, the compiler cannot optimize code with exceptions as well as it can without them. This is why the HHDS tree performs so much worse than the LH tree when exceptions are enabled. But the HHDS tree still wasn’t performing as well as it should have been.

Profiling

I used callgrind to profile the HHDS tree and identify potential bottlenecks. The profiling results provided valuable insights into the tree’s performance and areas for optimization. I generated a call graph using KCachegrind and analyzed the function calls to determine the most time-consuming operations.

The call graph clearly shows that the bottleneck is the _create_space call that is tasked with creating space for a new node. This function is called when a new node is added to the tree, and its performance directly impacts the tree’s efficiency.

inline Tree_pos _create_space(const X& data) {
 // Make space for CHUNK_SIZE number of entries at the end
 data_stack.emplace_back(data);
 for (int i = 0; i < CHUNK_MASK; i++) {
 data_stack.emplace_back();
 }

 // Add the single pointer node for all CHUNK_SIZE entries
 pointers_stack.emplace_back();

 return pointers_stack.size() - 1;
}

However, the _create_space function is relatively simple and should not be causing such a significant performance hit. This indicates that the issue may lie in the memory allocation process or the data structure itself. One possible way of dealing with this would be to increase chunk sizes, or enable dynamic chunk sizing, which would allow for more efficient memory allocation.

Another possible bottleneck, seems to be any amount of computation that will be done to find the next vacant space in the chunk (like in get_last_child()). This is because the chunk is a fixed size, and if the chunk is full, the program will have to search for the next chunk that has space. This is a linear operation, and can be slow for wide trees. To fix this, I tried to add extra bookkeeping in the Tree_pointers node structure:

class __attribute__((packed, aligned(64))) Tree_pointers {
private:
 // We only store the exact ID of parent
 Tree_pos parent : CHUNK_BITS + CHUNK_SHIFT;
 Tree_pos next_sibling : CHUNK_BITS;
 Tree_pos prev_sibling : CHUNK_BITS;

 // Long child pointers
 Tree_pos first_child_l : CHUNK_BITS;
 Tree_pos last_child_l : CHUNK_BITS;

 // Storing the last occupied index in the short delta
 // This is to avoid iterating over all short deltas
 // to find the last occupied index
 unsigned short last_occupied : CHUNK_SHIFT;

 // Short (delta) child pointers
 Short_delta first_child_s_0 : SHORT_DELTA;
 Short_delta first_child_s_1 : SHORT_DELTA;
 ...

However, the improvement in performance was marginal after making this change. This indicates that the issue may be more complex and require further investigation. This tree has also been added to the repository, in case a future contributor might be able to make use of it.

There are other possible bottlenecks that might be coming from storing separate short deltas instead of reducing the size of the delta and packing it into a single large integer type. I will be implementing this idea in the future.

Code contributions

All of my Pull requests and code changes here made on the HHDS repository. Each contribution has undergone thorough review and been successfully merged into the main repository:

Additionally, we are planning to integrate these changes into the LiveHD repository in the near future.

Conclusion and Future Work

Working on this project has been a valuable learning experience, particularly in applying core C++ features. I discovered that simple, fundamentally sound optimizations often outperform more complex ones. The greatest challenge for me was to steer through the changes in our original Plan of Action, however, due to the support and guidance from my mentors I was able to make it.

There are still areas where the HHDS tree can be improved to make it more robust. One area of future exploration is dynamic chunk sizing:

Dynamic Chunk Sizing: Instead of using fixed 8-sized chunks as we did, we could implement multiple chunk sizes. This would allow users to “hint” the HHDS tree to use specific chunk types, potentially reducing memory consumption further.

Overall, the HHDS tree has shown promise in handling deep tree structures efficiently. With further optimization and enhancements, it can become a powerful tool for handling complex tree operations.

Acknowledgements

I would like to thank my mentors, Jose Renau and Sakshi Garg for their guidance and support throughout the project. It would not have been possible without their help. Their insights and mentorship have significantly contributed to my learning and the success of this work.

Reproducing and addressing Data Leakage issue : Duplicates in dataset

Fri, 23 Aug 2024 00:00:00 +0000

Hello!

In this blog post, I will explore a common issue in machine learning called data leakage, using an example from the paper:

Benedetti, P., Perri, D., Simonetti, M., Gervasi, O., Reali, G., Femminella, M. (2020). Skin Cancer Classification Using Inception Network and Transfer Learning. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2020. ICCSA 2020. Lecture Notes in Computer Science(), vol 12249. Springer, Cham. https://doi.org/10.1007/978-3-030-58799-4_39 arXiv

Overview of the Paper

In this paper, the authors use transfer learning on a pretrained convolutional neural network (CNN) to classify skin lesions in dermatoscopic images from the HAM10000 (Human Against Machine with 10,000 training images) dataset. The paper reports a final accuracy of 78.9% on the validation set.

While this reported result appears to be impressive, there are concerns regarding the validity of this performance metric due to data leakage. Data leakage occurs when the model is trained or evaluated on data that it would not have access to during real-world deployment, leading to an overestimation of the model’s true performance.

Identifying Data Leakage in the Original Paper

Upon closer inspection, it appears that the original experiment suffers from data leakage in two significant ways:

Duplicate Images in Training and Validation Sets:

The HAM10000 dataset contains near-duplicate images of the same lesions in both the training and validation sets. This results in the model seeing very similar images during training and then again during validation. Consequently, the model’s performance is artificially inflated because it has already been “trained” on images similar to those in the validation set, making the task easier than it should be.
Using the Validation Set for Early Stopping and Final Evaluation:

Another critical issue is the use of the validation set for both early stopping and final model evaluation. Early stopping is a technique where training is halted when the model’s performance on a validation set no longer improves, preventing overfitting. However, if this same validation set is later used to evaluate the model’s final performance, it can lead to overfitting on the validation data itself, resulting in an overly optimistic estimate of model accuracy.

Our Reproduction and Results

To demonstrate the impact of these data leakage issues, we reproduced the experiment with corrected methodologies:

Corrected Data Split: We ensured that there were no duplicate images between the training and validation sets. This setup is crucial to simulate a realistic scenario where the model encounters completely unseen data during validation.
Separate Validation and Test Sets: We introduced a distinct test set to evaluate the final model performance, independent of the data used for early stopping.

Results Comparison

	Original results	Our results
Accuracy	78.9%	78.6%
Number of epochs	Approx. 42 epochs	40 epochs
Training size	Unknown	7000 samples
Validation size	478 samples	478 samples
Confusion martix

Analysis of the Results

While our reproduced accuracy of 78.6% is close to the original reported accuracy, it is based on a properly separated training and validation set, avoiding the data leakage pitfalls of the original paper. The slight drop in accuracy further highlights the overestimation of the original model’s performance due to data leakage.

Moreover, using a separate test set for final evaluation provides a more reliable measure of the model’s ability to generalize to new, unseen data. The confusion matrices show that our model’s performance is consistent across different lesion classes, confirming the robustness of the evaluation.

Conclusion

Data leakage is a common and often overlooked problem in applied machine learning, leading to misleading performance metrics and irreproducible results. By carefully examining and correcting these issues in our reproduction, we hope to provide a clearer understanding of the importance of proper data handling and validation practices.

It is crucial for researchers and practitioners to be vigilant about data leakage and ensure that their models are trained, validated, and tested under realistic conditions. This not only ensures the credibility of their results but also enhances the real-world applicability of their models.

Thank you for reading, and stay tuned for more insights on machine learning reproducibility!

Final blog: Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Thu, 22 Aug 2024 00:00:00 +0000

Introduction

Hello everyone,

I’m Archit from India, an undergraduate student at the Indian Institute of Technology, Banaras Hindu University (IIT BHU), Varanasi. As part of the Automatic Reproducibility of COMPSs Experiments through the Integration of RO-Crate in Chameleon project, my proposal, under the mentorship of Raül Sirvent, aims to develop a service that facilitates the automated replication of COMPSs experiments within the Chameleon infrastructure.

About the Project

The project proposes to create a service that can take a COMPSs crate (an artifact adhering to the RO-Crate specification) and, through analysis of the provided metadata, construct a Chameleon-compatible image for replicating the experiment on the testbed.

Final Product

The basic workflow of the COMPSs Reproducibility Service can be explained as follows:

The service takes the workflow path or link as the first argument from the user.
The program shifts the execution to a separate sub-directory, reproducibility_service_{timestamp}, to store the results from the reproducibility process.
Two main flags are required:
- Provenance flag: If you want to generate the provenance of the workflow via the runcompss runtime.
- New Dataset flag: If you want to reproduce the experiment with a new dataset instead of the one originally used.
If there are any remote datasets, they are fetched into the sub-directory.
The main work begins with parsing the metadata from ro-crate-metadata.json and verifying the files present inside the dataset, as well as any files downloaded as remote datasets. This step generates a status table for the user to check if any files are missing or have modified sizes.

The final step is to transform the compss-command-line.txt and all the paths specified inside it to match the local environment where the experiment will be reproduced. This includes:
- Mapping the paths from the old machine to new paths inside the RO-Crate.
- Changing the runtime to runcompss or enqueue_compss, depending on whether the environment is a SLURM cluster.
- Detecting if the paths specified in the command line are for results, and redirecting them to new results inside the reproducibility_service_{timestamp}\Results directory.
After this, the service prompts the user to add any additional flags to the final command. Upon final verification, the command is executed via Python’s subprocess pipe.

Logging System: All logs related to the Reproducibility Service are stored inside the reproducibility_service_{timestamp}\log.

You can view the basic pseudocode of the service.

Conclusion and Future Work

It’s been a long journey since I started this project, and now it’s finally coming to an end. I have learned a lot from this experience, from weekly meetings with my mentor to working towards long-term goals—it has all been thrilling. I would like to thank the OSRE community and my mentor for providing me with this learning opportunity.

This is only version 1.0.0 of the Reproducibility Service. If I have time from my coursework, I would like to fix any bugs or improve the service further to meet user needs.

However, the following issues still exist with the service and can be improved upon:

Third-party software dependencies: Automatic detection and loading of these dependencies on a SLURM cluster are not yet implemented. Currently, these must be handled manually by the user.
Support for workflows with data_persistence = False: There is no support for workflows where all datasets are remote files.

Deliverables

Reproducibility Service Repository: This repository contains the main service along with guidelines on how to use it. The service will be integrated with the COMPSs official distribution in its next release.
Chameleon Appliance : This is a single-node appliance with COMPSs 3.3.1 installed, so that anyone with access to Chameleon can reproduce experiments.

Previous Blogs

Make sure to check out my other blogs to see how I started this project and the challenges I faced along the way:

Thank you for reading the blog, have a nice day!!

Final Blogpost: HDEval's LLM Benchmarking for HDL Design

Wed, 21 Aug 2024 00:00:00 +0000

Introduction

Hello everyone! I’m Ashwin Bardhwaj, an undergraduate student studying at UC Berkeley. As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau and Sakshi Garg looks to create a suite of benchmark programs for HDEval.

The goal of this project is to create large-scale Verilog programs in order to benchmark that capability of LLMs to develop HDL code. Throughout this project, I have created 3 of the large Verilog testbenches called 3-Stage-RISC_V processor, Gameboy Emulator, and Sorts. The benchmark programs will lose their effectriveness if LLMs such as ChatGPT scrape over Github reposotires and learn from them. As a result, the code itself cannot be made public due to LLM scraping over repositories, this file will cover the test report for all 3 of these projects.

3 Stage RISC V Processor

This is a pipelined RISC processor developed to to handle RV32I instructions. A 3-Stage processsor will typically contain a Fetch, Decode, and Execute cycle. As a result, every instruction will take exactly 3 clock cycles. For this processor, instructions can be formatted into R, I (Load), S (Store), B (Cond), and J (Jump and Link) type instructions. Once a 32 bit instruction is fetched at the location in memory specifed by the pc (Program Counter) register, it is sent to be decoded by the “decode unit”. Through decoding an instruction, we can determine the exact operation code, register location of the 2 operands (rs1 and rs2), and the destination register (rd) at which to write the calculated result. After decoding, an activation flag is sent to the excetution cycle to then take and access the register file at address rs1 and rs2 in order to get the correct operand data. The data and operation is then sent to the ALU to compute the result based on the opcode. The result is then written back into the register file at the rd address and the program counter is incremented and the next instruction is fetched.

The prompts for each module in this processor have been generated and tested against a GPT 3 turbo and GPT 4o models as an example. In the RISC V tab in my test report, I have provided the exact prompts and results after running on MASC’s HDLAgent tool which can access the APIs of many LLMs.

Gameboy Emulator

The Gameboy Emulator is a Verilog implementation of the classic GameBoy console that was widely popular in the 1990s. The main aspects of the GameBoy that were focused on in this project were the Z-80 like CPU, memory objects like RAM, VRAM, and ROM, the PPU (Picture Processing Unit), and other peripherals. The instructions are given to the CISC (variable-length instructions) CPU where they are decoded and executed based on the details and expectations of that specific instruction. In some cases, timing becomes a concern and there is significant effort made to ensure that instructions can be parsed and run predictably and effictively. Instructions from the ROM may take between 1 to 4 clock cycles to run depending on the requirements. For example, the instruction “LD B, HL” , loads the data found at the 16 bit address given by registers H and L into register B is a 2 cycle instruction. The first cycle decodes the HL address and fetches the data at the accurate location, while the second cycle takes the new input data at writes it into register B. This requires accurate timing control between different asects of the GameBoy.

The Picture Processing Unit is also an integral feature of the gameboy. Three frames called Background, Window, and Sprite are combined into the classic Gameboy screens we know today. White the Background and Window data are consistently called from the VRAM after certain clock cycle times, the Sprite and sprtite attributes are accessed using DMA (Direct Memory Access) from OAM (Object Attribute Memory). This reduces the CPU load and improves the speed of sprite data.

Deliverables

HDEval Test Report: The HDEval Test Report contains the module prompts for each testbench, the results after testing on GPT 3 turbo and 4o, and test cases to ensure code correctness and reliability.
HDEval Repo: HDEval contains the encrypted version of the yaml files that encapsulate the code, prompts, and additional data.

Next Steps

Given these benchmarks, it is important to track the abilities of these LLMs to generate HDL code. Therefore, including GPT 3-turbo and 4o. I would like these benchmarks to be applied to more models so that we can track their growth and keep informed on their effectiveness in HDL and hardware.

Previous Blogs

Please feel free to check out my previous blogs!

Thank you for reading!

Deriving Realistic Performance Benchmarks for Python Interpreters

Sat, 17 Aug 2024 00:00:00 +0000

Hi, I am Mrigank. I am one of the Summer of Reproducibility fellows for 2024, and I will be working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah.

Background and Motivation

Recent work by Meta on a statically typed variant of Python – Static Python – which has provided immense promise in moving towards gradually typed languages without compromising on performance due to lack of complete soundness. Lu et al.¹ provide an evaluation of Static Python and conclude that the enhancement in performance reported by Meta on their web servers for Instagram is reasonable and is not just the result of refactoring. In fact, the study notes that very little refactoring is typically required for converting existing Python programs to Static Python. However, this study depends on a limited model of the language and does not represent real-world software applications.

In our project, we aim to create a realistic performance benchmark to reproduce performance improvements reported by Meta and to evaluate the performance of Static Python in real-world software applications. In addition, we will analyze partially-typed code to understand the performance implications of gradual typing in Python.

Key Objectives

We will use widely-used open-sourced applications to derive realistic performance benchmarks for evaluating Static Python. In particular, we will focus on projects that utilize the Python framework Django, which is also known to power the backend of Instagram. We plan to begin with Wagtail, a popular CMS built on Django. We have also identified other potential projects like Zulip, Plane and LibrePhotos. These are all actively maintained projects with significantly large codebases.

Further, we will analyze the performance of partially-typed code. This will be of value to the Python community as it will provide confidence in gradually moving towards Static Python for improving performance. We will make our benchmarks publicly available for the community to use, reproduce, and extend.

Methodology

Load Testing

For each project that we derive benchmarks from, we will design user pipelines that simulate real-world usage and implement them to create load tests using the open-sourced Locust framework. This will allow us to evaluate the performance of Static Python in real-world loads and scenarios. Locust can spawn thousands of users, each of which independently bombards the system with HTTP requests for a range of tasks that are defined in their user pipeline. We will host each project on a server (local or cloud) to run these load tests.

We will profile each project to ensure that our tests cover different parts of the codebase and to identify performance bottlenecks. We can then focus on these bottlenecks while gradually typing the codebase.

Gradual Typing

For typing the code in these projects, we will create two versions of each project: one with the so-called “shallow” type annotations and another with “advanced” type annotations. The former is relatively easier to implement and we can use tools like MonkeyType to generate stubs that can be quickly verified manually. The latter is quite non-trivial and will require manual effort. We will then mix-and-match the three versions of each project to create different combinations of typed and untyped code. Note that this mix-and-match can be done at both the module level and also at the function or class level.

Conclusion

This is my first time working on performance-benchmarking and I am excited to pick up new skills in the process. I am also looking forward to interacting with people from the Python community, people from Meta’s Static Python team, and also with the maintainers of the projects we will be working on. I will be posting more updates on this project as we make progress. Stay tuned!

Kuang-Chen Lu, Ben Greenman, Carl Meyer, Dino Viehland, Aniket Panse, and Shriram Krishnamurthi. Gradual soundness: Lessons from static python. The Art, Science, and Engineering of Programming. ↩︎

Midterm Report: Deriving Realistic Performance Benchmarks for Python Interpreters

Sat, 17 Aug 2024 00:00:00 +0000

Hi, I am Mrigank. As a Summer of Reproducibility 2024 fellow, I am working on deriving realistic performance benchmarks for Python interpreters with Ben Greenman from the University of Utah. In this post, I will provide an update on the progress we have made so far.

Creating a Performance Benchmark

We are currently focusing on applications built on top of Django, a widely used Python web framework. For our first benchmark, we chose Wagtail, a popular content management system. We created a pipeline with locust to simulate real-world load on the application. All of our work is open-sourced and available on our GitHub repository.

This load-testing pipeline creates hundreds of users who independently create many blog posts on a Wagtail blog site. At the same time, thousands of users are spawned to view these blog posts. Wagtail does not have a built-in API and so it took some initial effort to figure out the endpoints to hit, which I did by inspecting the network logs in the browser while interacting with the Wagtail admin interface.

A snapshot from a run of the load test with Locust is shown in the featured image above. This snapshot was generated by spawning users from 24 different parallel locust processes. This was done on a local server, and we plan to perform the same experiments on CloudLab soon.

Profiling

On running the load tests with a profiler, we found that the bottlenecks in the performance arose not from the Wagtail codebase but from the Django codebase. In particular, we identified three modules in Django that consumed the most time during the load tests: django.db.backends.sqlite3._functions, django.utils.functional, and django.views.debug. Dibri, a graduate student in Ben’s lab, is helping us add types to these modules.

Next Steps

Based on these findings, we are now working on typing these modules to see if we can improve the performance of the application by using Static Python. Typing Django is a non-trivial task, and while there have been some efforts to do so, previous attempts like django-stubs are incomplete for our purpose.

We are also writing scripts to mix untyped, shallow-typed, and advanced-typed versions of a Python file, and run each mixed version several times to obtain a narrow confidence interval for the performance of each version.

We will be posting more updates as we make progress. Thank you for reading!

Final Blog: FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Fri, 16 Aug 2024 00:00:00 +0000

Background

Hello, I’m Lihaowen (Jayce) Zhu, a 2024 SoR contributor for the FEP-bench project, under the mentorship of Yuyang (Roy) Huang. Before we started, let’s recap the goal of our project and our progress until mid term. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. In order to solve these challenges, we have collected the basic information of various common datasets for different machine learning tasks, and corresponding preprocessing pipelines.

Methodology

Since our goal is to improve the efficiency of the machine learning preprocessing pipeline and keep the training process of the Deep Learning model busy, it means that we need to enhance the preprocessing throughput which is the feed rate from the preprocessing stage to the training stage. According to some previous works, we have a new way to look at the Deep Learning Preprocessing Pipelines. The preprocessing pipeline can be split into 2 parts. The first part contains steps that are run once (S1-Sm). We can call it the “offline” part. The second part includes all the rest steps, which are run at every iteration of training. We call it the ”online” part. After the offline preprocessing steps, the output data is written back to disk. Then the online preprocessing steps need to load that data from storage first and do the following operations. We can split the pipeline at any step, and each split is a preprocessing strategy. By using this method, some specific strategies can achieve a much higher final preprocessing throughput. Our project adopts this method to profile the performance of different strategies. And our goal is to maximize the final preprocessing throughput into training, for a specific pipeline. We want to make this an automatic process, rather than ask for extra user instructions or parameters.

Experiment

Next, we did the data preprocessing strategy experiment on the LibriSpeech dataset, which is an audio dataset for ML tasks like Auto Speech Recognition. The dataset size is 6.3 GB with almost 30000 samples. Each audio file is in a binary format FLAC. As a result, the first step of the preprocessing pipeline we use is decoding, which converts the binary data into arrays of floats. Then we applied some typical audio preprocessing steps of transformation (normalization, padding, extract loudest section) and augmentation (random cut, random shift audio, random mask, random add noise) to audio data. Finally, the audio data is converted to Log-Mel Spectrogram signal, which is commonly used in audio tasks like Speech Recognition and Speaker identification.

We have benchmarked the throughput performance and storage overhead of all possible strategy split points, and have seen some trade-offs between them. Both storage overhead and throughput speed-up use the fully online method as the baseline. What we’ve observed from our results is that the speed-up keeps increasing when we put operations into the offline part, and the storage consumption is very low for the strategies after audio decoding. Also, we analysed the performance of individual methods of transformation and augmentation steps. We find that the speed-up performance is quite stable between 1.0 and 1.2 across these methods, but some methods can have a high storage overhead, like normalization and random noise.

Another thing we observed during our experiments is that different dataset sizes can influence the preprocessing pipeline throughput. We found that the throughput speed-up of 10000 samples is almost double the speed-up of 5000 samples. It seems like a larger dataset size may lead to a higher speed-up. So, we were thinking that does every operation follows this pattern or only certain operations can have increasing throughput with increasing dataset size, and then did experiments about the throughput speed-ups on different dataset sizes of all operations in the audio preprocessing pipeline. The results showed that only the audio decoding step can have a great increase in speed-up for larger dataset sizes. But for transformation, augmentation and LMS, the throughputs always stay at a steady level. This indicates that the only audio decoding step can become faster and faster when the dataset size grows.

Conclusion

In our work, we have built up a collection of common datasets and their preprocessing pipelines for different machine-learning tasks. For the audio dataset LibriSpeech, we have done experiments about the trade-offs between throughput speed-ups and storage overhead, and dataset sizes. We have found that speed-ups keep increasing when more and more operations are divided into the offline part. Only the audio decoding step can become faster and faster when the dataset size grows.

Future works

In the near future, we still want to find the optimal preprocessing strategy by profiling only a small part of the original enormous dataset. The second thing is that besides the audio dataset, we must expand the range of our experiments on other datasets and ML tasks. Finally, we need to implement our goal of building an automatic system that decides the optimal strategy of a preprocessing pipeline.

Final Blog: FSA - Benchmarking Fail-Slow Algorithms

Wed, 14 Aug 2024 00:00:00 +0000

Introduction

Hello! I hope you’re enjoying the summer as much as I am. I’m excited to join the SOR community as a 2024 contributor. My name is Xikang Song, and I’m thrilled to collaborate with mentors Ruidan Li and Kexin Pei on the FSA-Benchmark project. This project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. Throughout this journey, we tested a broad range of algorithms, from traditional approaches to state-of-the-art techniques, using a robust evaluation system to compare their effectiveness.

In the first half of the project, I focused on implementing and testing different machine learning models for detecting disks at high risk of fail-slow anomalies. This involved setting up initial models such as the Cost-Sensitive Ranking Model and Multi-Prediction Models, and beginning to explore LSTM networks for analyzing input disk data.

In the second half, I built upon this foundation by refining the evaluation processes, exploring advanced models like PatchTST, and investigating the potential of large language models (LLMs) for detecting subtle fail-slow conditions in storage systems. This blog post will summarize the key achievements, findings, and comparisons with baseline models from this phase.

Key Achievements

Comprehensive Benchmarking and Evaluation:
- I extended the benchmarking framework to evaluate multiple algorithms across 25 different data clusters on PERSEUS. This process involved generating and analyzing heatmaps that visualized the precision and recall of each model under various settings, providing a clear understanding of each approach’s strengths and limitations.
Exploration of Advanced Machine Learning Models:
- LSTM Model: I implemented the Long Short-Term Memory (LSTM) model, specifically designed for sequential data, to capture temporal dependencies in disk performance metrics. This model was used to predict potential fail-slow anomalies by analyzing historical data. Using Mean Squared Error (MSE) as a risk indicator, the LSTM model outperformed baseline approaches like the Cost-Sensitive Ranking Model and Multi-Prediction Models, especially in clusters where latency patterns between faulty and normal disks were distinct, such as in Cluster_P. This resulted in a higher precision and fewer false positives. However, in clusters with more complex and overlapping data distributions, like Cluster_L, the LSTM model’s performance diminished, similar to that of the baseline models
- PatchTST Model: I also introduced and evaluated the PatchTST model, which is built on a transformer-based architecture known for its ability to handle sequential data by capturing long-range dependencies and intricate temporal patterns. Unlike traditional models, PatchTST processes time series data in segments or “patches,” enhancing its ability to predict disk behavior over extended periods. Like the LSTM model, PatchTST uses outlier MSE values to assess disk risk. In clusters with a clear separation between faulty and normal disks, PatchTST outperformed baseline models by effectively identifying faulty patterns. However, similar to the LSTM model, PatchTST encountered difficulties in clusters with significant data overlap, such as Cluster_L.
Investigation into Large Language Models (LLMs):
- I explored the use of GPT-4-o-mini for fail-slow detection. While large language models (LLMs) showed potential, particularly in reducing false positives and improving precision over baseline models, they did not consistently outperform specialized models like LSTM and PatchTST in this context. LLMs struggled with recall, especially as thresholds increased, revealing the challenges of adapting LLMs to time series data. This limitation arises because LLMs are primarily trained for natural language generation tasks, not for analyzing time series data. As a result, their ability to fully capture anomalies is limited. To improve their effectiveness, we need to develop methods that help LLMs better understand time series data. For example, incorporating statistical information about each disk’s performance could enhance LLMs’ understanding, leading to better precision in fail-slow detection.

Conclusion and Future Work

The work in this project demonstrated that while advanced machine learning models like LSTM and PatchTST offer significant potential for detecting fail-slow conditions, challenges remain in ensuring consistent performance across diverse clusters. Compared to baseline models, these advanced approaches generally provided better precision and recall, especially in clusters with distinct data patterns between faulty and normal disk performance time series. However, the persistent difficulties in more complex clusters indicate the need for further refinement.

Moving forward, future work will focus on refining these models, particularly in improving their performance in challenging clusters like Cluster_L. Additionally, I plan to further explore techniques such as prompt engineering for LLMs to better tailor them for time series analysis and fail-slow detection tasks.

Deliverables

Repository: All comprehensive analysis code and source code can be found in the FSA_BENCHMARK GitHub Repository.
Jupyter Notebook: A notebook to reproduce the experiments and benchmarks on Chameleon: Chameleon Experiment Notebook.
Final Report: Comprehensive algorithm performance evaluation for all methods in FSA-Benchmarking Final Report.

Data Leakage in Applied ML

Tue, 13 Aug 2024 00:00:00 +0000

Hello everyone!

I have been working on reproducing the results from Characterization of Term and Preterm Deliveries using Electrohysterograms Signatures. This paper aims to predict preterm birth using Support Vector Machine with RBF kernel. However, there is a major flaw in the methodology: preprocessing on training and test set. This happens when preprocessing is performed on the entire dataset before splitting it into training and test sets.

Reproducing the published results came with its own challenges, including updating EHG-Oversampling to extract meaningful features from EHG signals and finding optimal hyperparameters for the model. Through our work on reproducing the published results and creating toy example notebooks, we have been able to demonstrate that data leakage leads to overly optimistic measures of model performance and models trained with data leakage fail to generalize to real-world data. In such cases, performance on test set doesn’t translate to performance in the real-world.

Next, I’ll be reproducing the results published in Identification of COVID-19 Samples from Chest X-Ray Images Using Deep Learning: A Comparison of Transfer Learning Approaches.

You can follow my work on the EHG paper here.

Stay tuned for more insights on data leakage and updates on our progress!

Midterm Report : Halfway through medicinal data visulaization using PolyPhy/Polyglot

Mon, 12 Aug 2024 00:00:00 +0000

Introduction

Hello! My name is Ayush Sharma, a machine learning engineer and researcher based out of Chandigarh, a beautiful city in Northern India known for its modern architecture and green spaces. For the last month and a half I have been working closely with my mentors Oskar Elek and Kiran Deol on the project titled Unveiling Medicine Patterns: 3D Clustering with Polyphy/Polyglotas part of GSoC 2024.

Progress and Challenges

The project focuses on developing effective clustering algorithms to visualize medicine data in three dimensions using PolyPhy and Polyglot. My journey began with data preprocessing and cleaning, where unnecessary data points were removed, and missing values were addressed.

One of the primary techniques we’ve employed is UMAP (Uniform Manifold Approximation and Projection). UMAP’s ability to preserve the global structure of the data while providing meaningful clusters proved advantageous. Initial experiments with UMAP on datasets of various sizes (ranging from 1,500 to 15,000 medicines) provided valuable insights into the clustering patterns. By iteratively halving the dimensions and refining the parameters, we achieved more accurate clustering results.

To complement UMAP, we explored t-SNE (t-distributed Stochastic Neighbor Embedding). t-SNE’s focus on local relationships helped in understanding finer details within the clusters. By adjusting t-SNE parameters and conducting perturbations, we could better comprehend the data’s behavior. Combining UMAP with t-SNE in a loop, halving dimensions iteratively, showed promise, allowing us to leverage the strengths of both techniques to enhance clustering accuracy.

We also experimented with pre-trained models like BERT and Glove to create embeddings for the medicines. BERT’s splitting of salts into subparts and Glove’s limitations in recognizing specific salts led us to inaccurate clustering and we’ve been working on improving it for the time being.

Next Steps

Moving forward, I will focus on refining our clustering and embedding techniques to enhance overall accuracy. This involves integrating Jaccard distance alongside other distance measures to improve similarity assessments between medicines and clusters. Additionally, I’ll continue experimenting with advanced models like gpt,CLIP, gemini etc., for better embeddings while addressing the limitations of BERT and Glove by leveraging custom embeddings created with transformers and one-hot encoding. Optimization of UMAP and t-SNE algorithms will also be crucial, ensuring their effectiveness in clustering and visualization. These steps aim to overcome current challenges and further advance the project’s goals.

Midterm Check-In: Progress on the AutoAppendix Project

Sat, 03 Aug 2024 00:00:00 +0000

Hi all,

I’m happy to share a quick update on the AutoAppendix project as we’re about halfway through. We’ve made some steady progress on evaluating artifacts from SC24 papers, and we’re starting to think about how we can use what we’ve learned to improve the artifact evaluation process in the future.

What We’ve Been Up To

As a quick reminder, the goal of our project is to develop a set of guidelines that researchers can use to improve the reproducibility of their work. We’re focusing on papers from the Supercomputing Conference 2024 that applied for an “Artifact Replicable” badge, and we’re evaluating their artifacts to see how well the experiments can be replicated. As it was difficult to make assumptions about the exact outcomes of the project besides detailed experiment recreation, our main goal of this midterm check-in is to share what insights we have gathered so far and to set the stage for the final outcomes.

Our main task so far has been making a selection of submissions with experiments designed for Chameleon Cloud, or those that could be easily adapted to run on Chameleon. As there were 45 submissions that applied for an “Artifact Replicable” badge, it was not easy to choose which ones to evaluate, but we managed to narrow it down to 18 papers that we thought would be a good fit for our project.

We’ve chosen to focus on papers that do not require special hardware (like a specific supercomputer) or complex network setups, as it would be difficult to generalize the insights from these kinds of experiments. Instead, we’ve been looking at those that require only a single computation node, and could theoretically be run with the available hardware on Chameleon.

Observations and Learning Points

At the moment, we’re about halfway through the evaluation process. So far, we’ve noticed a range of approaches to documenting and setting up computational experiments. Even without looking at the appendices in detail, it’s clear that there’s a lot of room for standardization of the documentation format and software setup, which could make life easier for everyone involved. This particularly applies to software setups, which are often daunting to replicate, especially when there are specific version requirements, version incompatibilities or outright missing dependencies. Since the main goal of this project is to develop a set of guidelines that researchers can use to improve the reproducibility of their work, suggesting a way to deal with software versions and dependencies will be a key part of our results.

We’ve observed that submissions with well-structured and detailed appendices tend to fare better in reproducibility checks. This includes those that utilized containerization solutions like Docker, which encapsulate the computing environment needed to run the experiments and thus eliminates the need for installing specific software packages. It’s these kinds of practices that we think could be encouraged more broadly.

Looking Ahead

The next steps are pretty exciting! We’re planning to use what we’ve learned to draft some guidelines that could help future SC conference submissions be more consistent. This might include templates or checklists that ensure all the necessary details are covered.

Additionally, we’re thinking about ways to automate some parts of the artifact evaluation process. The goal here is to make it less labor-intensive and more objective. A particularly nice way of reproducible artifact evaluation is Chameleon’s JupyterHub interface, which in conglomeration with the Trovi artifact sharing platform makes it easy to share artifacts and allow interested parties to reproduce the experiments with minimal effort. We are thus looking into ways to utilize and contribute to these tools in a way that could benefit the broader research community.

Wrapping Up

That’s it for now! We are working towards getting as many insights as possible from the rest of the artifact evaluations, and hopefully, by the end of this project, we’ll have some solid recommendations and tools to show for it. Thanks for keeping up with our progress, and I’ll be back with more updates as we move into the final stages of our work.

[MidTerm] ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Thu, 01 Aug 2024 00:00:00 +0000

Hey there, scalability enthusiasts and fellow researchers! I’m excited to share my progress on the ScaleRep project for SoR 2024 under the mentorship of Bogdan "Bo" Stoica and Yang Wang. Here’s a glimpse into how we’re tackling scalability bugs in large-scale distributed systems.

Project Overview

Large-scale distributed systems are the backbone of modern computing, powering various applications and services. However, these systems often face challenges related to reliability and performance, particularly scalability bugs. These bugs manifest in large-scale deployments, causing issues such as system downtime, reduced responsiveness, and data loss. Traditional bug-finding methods fall short in detecting these bugs, which are triggered by factors like component count, system load, workload size, recovery protocol reliability, and intermediate failure magnitude.

Our project, ScaleRep, aims to address these challenges by analyzing recent scalability issues from ten popular open-source large-scale systems. We are providing detailed accounts of bug reproduction experiences, identifying common challenges, and developing protocols for triggering and quantifying the impact of scalability bugs.

Progress Highlights

So far, I have been working on the following bugs and have successfully uploaded some of them to Trovi. Here’s a brief overview of my progress:

Bugs Worked On:

IGNITE-20614: Uploaded to Trovi Trovi Link
IGNITE-17407: Uploaded to Trovi Trovi Link
IGNITE-20692
IGNITE-16600
IGNITE-16072

What is Chameleon and Trovi?

Chameleon is a configurable experimental environment for large-scale cloud research. It provides a platform for running and testing distributed systems at scale, allowing researchers to reproduce and study scalability issues in a controlled setting.

Trovi is a platform that facilitates the sharing of reproducible artifacts. By uploading our bug reproduction artifacts to Trovi, we enable other researchers to easily reproduce scalability bugs, fostering collaboration and advancing the field of distributed systems research.

Short Description of the Bugs

IGNITE-20614 This bug refers to an issue where the Ignite service grid experiences degradation or hangs under specific conditions related to service deployment and node restarts.

Root Causes: The root cause is a race condition during the deployment and undeployment of services in the service grid, particularly when nodes are restarted or when there is a significant amount of concurrent service deployment and undeployment activity.

Impact: The impact of this bug includes potential service grid hangs, degraded performance, and possible inability to deploy or undeploy services as expected, which can disrupt the overall operation of the Ignite cluster.

Fix: The fix involves adding proper synchronization mechanisms to handle concurrent service deployment and undeployment operations more gracefully, ensuring that race conditions are avoided.

IGNITE-17407 This issue pertains to the incorrect behavior of the Ignite thin client protocol, particularly when dealing with binary objects and schema changes.

Root Causes: The root cause lies in the way the thin client handles binary object schema changes. The thin client was not correctly updating the schema cache, leading to inconsistencies and incorrect behavior when deserializing binary objects.

Impact: Users of the thin client may experience issues with binary object deserialization, leading to potential data corruption, incorrect query results, and overall application instability.

Fix: The fix involves updating the thin client protocol to properly handle schema changes by ensuring that the schema cache is correctly updated and synchronized with the server.

IGNITE-20692 This bug is related to the performance degradation observed in the Ignite SQL engine when executing certain complex queries.

Root Causes: The root cause is identified as inefficient query planning and execution strategies for specific types of complex SQL queries, leading to excessive resource consumption and slow query performance.

Impact: Users running complex SQL queries may experience significant performance degradation, leading to slower response times, increased CPU and memory usage, and potentially impacting the overall performance of the Ignite cluster.

Fix: The fix involves optimizing the SQL query planner and executor to handle complex queries more efficiently, including better indexing strategies, improved query plan caching, and more effective resource management during query execution.

IGNITE-16600 This bug involves an issue with speed-based throttling in the checkpoint process, leading to possible starvation of the checkpoint thread under heavy load.

Root Causes: The root cause is the absence of proper mechanisms to wake up throttled threads when they no longer need to be throttled, resulting in unnecessary waiting and potential starvation of the checkpoint thread.

Impact: Under heavy load, the checkpoint process can be significantly delayed, leading to slower checkpoint completion times, increased risk of data loss, and overall degraded performance of the Ignite cluster.

Fix: The fix includes implementing methods to wake up throttled threads when they no longer need to be throttled (tryWakeupThrottledThreads and shouldThrottle), ensuring that the checkpoint process can proceed without unnecessary delays.

IGNITE-16072 This issue pertains to the incorrect handling of SQL queries involving NULL values in the Ignite SQL engine, leading to unexpected query results.

Root Causes: The root cause is an incorrect implementation of SQL semantics for handling NULL values in certain query conditions, particularly in the presence of complex joins and subqueries.

Impact: Users may experience incorrect query results when NULL values are involved, leading to potential data inconsistencies and incorrect application behavior.

Fix: The fix involves correcting the SQL engine’s implementation to properly handle NULL values according to the SQL standard, ensuring that queries involving NULL values produce the expected results.

What’s Next?

Continued Bug Reproduction:

Focus on reproducing more scalability bugs

Documentation of Challenges:

Breakdown specific challenges encountered during attempts to reproduce scalability bugs.
Categorize challenges, including technical complexities, environmental dependencies, and lack of documentation in bug reports.

Finalizing Project Deliverables:

Package artifacts using Jupyter notebook scripts for convenient replay of investigation steps.
Upload the package to Trovi for replayable artifacts, enabling other researchers to easily reproduce scalability bugs for our benchmark applications.

Conclusion

The ScaleRep project has made significant strides in reproducing and benchmarking scalability bugs in large-scale distributed systems. By successfully reproducing and documenting scalability bugs, we are contributing valuable insights to the research community, aiding in the development of more robust distributed systems. The protocols and methodologies devised in this project will serve as valuable tools for researchers exploring similar issues.

Stay tuned for more updates as we continue to tackle scalability bugs and improve the reliability and performance of large-scale distributed systems.

Midway Through GSoC

Wed, 31 Jul 2024 00:00:00 +0000

Hello everyone! I’m Joel Tony, and I’m excited to share my progress update on the Drishti project as part of my Google Summer of Code (GSoC) experience. Over the past few weeks, I’ve been diving deep into the world of I/O visualization for scientific applications, and I’m thrilled to tell you about the strides we’ve made.

What is Drishti?

For those unfamiliar with Drishti, it’s an application used to visualize I/O traces of scientific applications. When running complex scientific applications, understanding their I/O behavior can be challenging. Drishti steps in to parse logs from various sources, with a primary focus on those collected using Darshan, a lightweight I/O characterization tool for HPC applications. Drishti provides human-interpretable insights on how to improve I/O performance based on these logs. While Drishti supports multiple log sources, our current work emphasizes Darshan logs due to their comprehensive I/O information. Additionally, Drishti offers visually appealing and easy-to-understand graphs to help users better grasp their application’s I/O patterns, making it easier to identify bottlenecks and optimize performance.

Progress and Challenges

Export Directory Feature

One of the first features I implemented was the export directory functionality. In earlier versions of Drishti, users couldn’t select where they wanted their output files to be saved. This became problematic when working with read-only log locations. I familiarized myself with the codebase, created a pull request, and successfully added this feature, allowing users to choose their preferred output location.

CI Improvements and Cross-Project Dependencies

While working on Drishti, I discovered the tight coupling between various tools in the HPC I/O organization, such as Drishti and DXT Explorer. This highlighted the need for improved Continuous Integration (CI) practices. We currently run about eight GitHub Actions for each pull request, but they don’t adequately test the interactions between different branches of these interconnected tools. This is an area we’ve identified for future improvement to ensure smoother integration and fewer conflicts between projects.

Refactoring for Multi-File Support

The bulk of my time was spent refactoring Drishti to extend its framework from parsing single Darshan files to handling multiple files. This task was more complex than it initially appeared, as Drishti’s insights are based on the contents of each Darshan file. When dealing with multiple files, we needed to find a way to aggregate the data meaningfully without sacrificing on performance.

The original codebase had a single, thousand-line function for parsing Darshan files. To improve this, I implemented a data class structure in Python. This refactoring allows for:

Better separation of computation and condition checking
Easier parallelization of processing multiple traces
Finer-grained profiling of performance bottlenecks
More flexibility in data manipulation and memory management

Learnings and Skills Gained

Through this process, I’ve gained valuable insights into:

Refactoring large codebases
Understanding and improving cross-project dependencies
Implementing data classes in Python for better code organization
Balancing performance with code readability and maintainability

Next Steps

As I move forward with the project, my focus will be on:

Adding unit tests for individual methods to ensure functionality
Exploring alternative data frame implementations like Polars for better performance
Developing aggregation methods for different types of data across multiple Darshan files
Optimizing memory usage and computational efficiency for large datasets

Conclusion

Working on Drishti has been an incredible learning experience. I’ve had the opportunity to tackle real-world challenges in scientific computing and I/O visualization. As we progress, I’m excited about the potential impact of these improvements on the scientific community’s ability to optimize their applications’ I/O performance.

I’m grateful for this opportunity and looking forward to the challenges and discoveries that lie ahead in the second half of my GSoC journey. Stay tuned for more updates as we continue to enhance Drishti!

If you have any questions or would like to learn more about the project, feel free to reach out to me. Let’s keep pushing the boundaries of scientific computing together!

Streaming into the Future: Adding Real-Time Processing to FasTensor

Tue, 30 Jul 2024 00:00:00 +0000

Hey there, HPC enthusiasts and fellow coders! I’m excited to share my progress on this summer’s Google Summer of Code project under UC OSPO’s FasTensor. Here’s a glimpse into how we’re pushing the boundaries of real-time data processing.

The Big Picture: FasTensor and HPC Challenges

First, a quick refresher: FasTensor is our go-to tool for handling dense arrays in scientific computing. It tackles three major HPC challenges:

Optimizing computations
Distributing data efficiently
Balancing workloads across computing cores

FasTensor excels at these tasks, especially when dealing with data that has structural locality - a common feature in scientific computing. Here, the Stencil computations come in handy, capturing data locality for operations like solving partial differential equations in physical simulations.

The Mission: Bringing FasTensor into Real-Time

While FasTensor is great at processing existing data, the next frontier is handling live data streams from scientific instruments and sensors. That’s where my GSoC project comes in: adding stream processing capabilities to FasTensor.

Progress Highlights:

Building a Stream Simulator

We’ve created FTstream, a nifty tool that simulates data streams. It can generate streams of various sizes and intervals, pushing the limits of what your disk can handle. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe! This tool is crucial because many scientific instruments, from particle accelerators to radio telescopes, generate massive amounts of data at incredible speeds and we need to able to simulate that. For context, that’s faster than a 10MP RGB camera shooting at 35 frames per second that generates data at ~1 GiB/s.

Optimizing I/O Strategies

We’ve been experimenting with various I/O approaches to optimize high-speed data stream handling.

Exploring Streaming Semantics

We’re investigating various ways to express and execute stream transformations, to ensure that FasTensor can handle a wide range of streaming computations.

Developing I/O Drivers

We’ve developed two new I/O drivers based on LinuxAIO and MPI IO to ingest incoming data smoothly and maintain stream consistency.

What’s Next?

Putting It All Together

We’re in the final stretch of integrating all these components into a seamless stream processing system.

Rigorous Testing

We’ll push our stream processing to its limits, simulating diverse data flows to ensure rock-solid performance in any scientific setting.

HPC Environment Validation

The ultimate test will be running our new streaming capabilities in real HPC environments, checking how they perform with different I/O setups and computing paradigms.

Wrapping Up

This summer has been a whirlwind of coding, testing, and learning. We’re making significant strides in bringing real-time processing capabilities to FasTensor, which could open up exciting new possibilities in scientific computing and data analysis. Stay tuned for more updates as we finalize this feature. If you’re interested in the nitty-gritty technical details or want to check out the code, feel free to reach out or check our project repository. Happy coding, and may your computations be ever faster!

Mid-term Blog: Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Mon, 29 Jul 2024 00:00:00 +0000

Introduction

Hello everyone I’am Archit from India. An undergraduate student at the Indian Institute of Technology, Banaras Hindu University, IIT (BHU), Varanasi. As part of the Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon my proposal under mentorship of Raül Sirvent aims to develop a service that facilitates the automated replication of COMPSs experiments within the Chameleon infrastructure.

About the project:

The project proposes to create a service that will have the capability to take a COMPSs crate (an artifact adhering to the RO-Crate specification) and, through analysis of the provided metadata construct a Chameleon-compatible image for replicating the experiment on the testbed.

Progress

It has been more than six weeks since the ReproducibilityService project began, and significant progress has been made. You can test the actual service from my GitHub repository: ReproducibilityService. Let’s break down what the ReproducibilityService is capable of doing now:

Support for Reproducing Basic COMPSs Experiments: The RS program is now fully capable of reproducing basic COMPSs experiments with no third-party dependencies on any device with the COMPSs Runtime installed. Here’s how it works:
- Getting the Crate: The RS program can accept the COMPSs workflow from the user either as a path to the crate or as a link from WorkflowHub. In either case, it creates a sub-directory for further execution named reproducibility_service_{timestamp} and stores the workflow as reproducibility_service_{timestamp}/Workflow.
- Address Mapping: The ro-crate contains compss_submission_command_line.txt, which is the command originally used to execute the experiment. This command may include many paths such as runcompss flag1 flag2 ... flagn <main_workflow_file.py> input1 input2 ... inputn output. The RS program maps all the paths for <main_workflow_file.py> input1 input2 ... inputn output to paths inside the machine where we want to reproduce the experiment. The flags are dropped as they may be device-specific, and the service asks the user for any new flags they want to add to the COMPSs runtime.
- Verifying Files: Before reproducing an experiment, it’s crucial to check whether the inputs or outputs have been tampered with. The RS program cross-verifies the contentSize from the ro-crate-metadata.json and generates warnings in case of any abnormalities.
- Error Logging: In case of any problems during execution, the std_out and std_err are stored inside reproducibility_service_{timestamp}/log.
- Results: If any results do get generated by the experiment, the RS program stores them inside reproducibility_service_{timestamp}/Results. If we ask for the provenance of the workflow also, the ro-crate thus generated is also stored here only.

Support for Reproducing Remote Datasets: If a remote dataset is specified inside the metadata file, the RS program fetches the dataset from the specified link using wget, stores the remote dataset inside the crate, and updates the path in the new command line it generates.

Challenges and End-Term Goals

Support for DATA_PERSISTENCE_FALSE: The RS program still needs to support crates with dataPersistence set to false. After weeks of brainstorming ideas on how to implement this, we recently concluded that since the majority of DATA_PERSISTENCE_FALSE crates are run on SLURM clusters, and the dataset required to fetch in such a case is somewhere inside the cluster, the RS program will support this case for such clusters. Currently, I am working with the Nord3v2 cluster to further enhance the functionality of ReproducibilityService.
Chameleon Cluster Setup: I have made some progress towards creating a new COMPSs 3.3 Appliance on Chameleon to test the service. However, creating the cluster setup script needed for the service to run on a COMPSs 3.3.1 cluster to execute large experiments has been challenging.
Integrating with COMPSs Repository: After completing the support for dataPersistence false cases, we aim to launch this service as a tool inside the COMPSs repository. This will be a significant milestone in my developer journey as it will be the first real-world project I have worked on, and I hope everything goes smoothly.

Stay tuned for the next blog!!

Enhancing h5bench with HDF5 Compression Capability

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

As part of the h5bench project my Enhencing h5bench with HDF5 Compression Capability under the mentorship of Dr. Jean Luca Bez and Dr. Suren Byna aims to allow users of h5bench to incoporate compression features in their simulations by creating custom benchmarks with common scientific lossless & lossy compression algorithms such as SZ, SZ3, ZFP, and GZIP.

The problem I am trying to solve is to implement multiple data compression algorithms in h5bench core access patterns through HDF5 filters. This capability should grant users the flexibility to configure the parameters and methods of compression applied to their datasets according to their specific needs and preferences. My solution primarily involves using a user-defined HDF5 filter mechanism to implement lossless and lossy compression algorithms, such as ZFP, SZ, and cuSZ. Throughout the process, I will deliver one C source code implementing compression configuration settings, one C source code implementing lossless and lossy algorithms, a set of performance reports before and after data compression in CSV and standard output files, and a technical documentation on h5bench user manual website.

Midterm Blog

This summer, after completing my junior year, I was honored to have the opportunity working with Dr. Jean Luca Bez and Dr. Suren Byna on the h5bench, an open-source benchmarking project designed to simulate runnning sync/async HDF5 I/O on HPC machines. This post will cover mostly what I have learned, produced, planned, and thoughts over the first six weeks.

First of all, let’s define some of the terms here. HDF5 stands for Hierarchical Data Format 5. Unlike other data storage formats (JSON, CSV, XML…), HDF5 is not only a container that manages data similar to a file system, but also a powerful library that gives you the ability to perform I/O (Inputs/Outputs) operations between memory and file. One of the reasons this tool is commonly used by HPC applications is that it also supports MPI I/O, which is a protocol for parallel computing (you can think of it as the parallel version of POSIX). With exabytes of data and high frequencies of usage for analysis in scientific studies, HDF5 is perfect for the job. Essentially, h5bench is a software that tests the hardware’s performance through HDF5 (it also provides other benchmark kernels such as AMReX, E3SM-IO, MACSio, and openPMD-api, but my job focuses on using vanilla HDF5 I/O).

So, what I have done so far? Frist, my job is to allow users to tune input parameters regarding data compression, and make sure h5bench prints accurate benchmark results with the intended compression algorithm applied to their datasets. h5bench’s frondend is written in Python, which takes an input of a JSON file from user and parses it into a CFG configuration file that can be read by the backend later, which is written in C. I created a new enum struct and made user able to specify one from a range of compression algorithms (SZ3, ZFP, LZ4, GZIP, and other pre-defined algorithms). I also made it possible to apply these algorithms to the datasets, so the .h5 (an HDF5 file) would contain chunks of compressed data after multiple H5Dwrite calls.

Next, the challenges and gains. Throughout the first six weeks, 30% of the time was spent on understanding the newest version of h5bench and HDF5 by reading through C source codes and documentations, and asking many dumb questions to my mentors (thanks to their patience and great answers :D). Writing code is fairly easy after I really understood what the program is doing. By that I mean you have to understand every line in almost all functions and how each and every variables change. 40% of the time was used on debugging and testing the compression algorithm, mainly SZ3. To make code behaves correctly is another level of difficulty. Most of the issues resulted from failing to configure the application and dependent libraries correctly. Without necessary macros enabled during the build process, features like compression filter plugin will not run. As I was also new to CMake and HPC environment, I learned that new envrionment variables will be reset for every new session, even if you requested a compute node resource. Besides getting used to the standard build sequence: “cmake ..”, “make”, “make install”, I also learned to use “ccmake ..” to examine the flags of the compiled program. The rest of time I learned more about parallel computing, HDF5, compression algorithms, by reading some papers and documentations. A lot of notes were taken (I must say a good note taking system is the game changer). Last but not the least, I also spent times synchronizing online and offline with my mentors to discuess problems. Without their help, I can never make this far.

My next phase will tackle these problems, here I am just offering a list:

Test applying filter with other compression algorithms, and with different dimension layout of the dataset
Add decompression capability
Allow users to tune the auxiliary parameters for controlling the behavior of a certain compression filter H5Pset_filter(COMPRESS_INFO.dcpl_id, H5Z_FILTER_SZ3, H5Z_FLAG_MANDATORY, 0, NULL); cd_nelmts cd_values[]
Print additional benchmark results to indicate what and how the compression filter is applied, and the compression ratio

Final Blog: FetchPipe: Data Science Pipeline for ML-based Prefetching

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Peiran Qin, a CS student at the University of Chicago. This summer I worked on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. The FetchPipe project focuses on building a unified Python simulator and evaluating the existing cache-eviction policy and ML-based prefetcher under this simulator. Through this projects, we make the following contributions and get several insights that can share with the community:

We built up a simulator to evaluate various prefetchers under a unified framework, under the production level traces from Alibaba, Microsoft Research, and Tencent.
Through the evaluation, we discover several downsides that existing heuristic-based prefetchers encounter.
We draw several insights that can guide the future prefetchers’ design.

Methodology

In the first half of the SoR project, I mainly focus on the simulator building of I/O prefetcher. The simulator should mimic the real OS-level prefetching as much as possible. First, we develop a mechanism that mimics the users sending I/O requests to the underlying systems. Then, we simulate the process of page division, and memory management inside the systems. Finally, we designed a sleep-based mechanism to mimic the I/O latency of backend storage. The outcome system can eventually simulate the data path of I/O request and prefetching of real systems, and collect the crucial metrics such as hit rate, total prefetched data, bandwidth usage, prefetch accuracy, total cache eviction, etc.

In the second half of the SoR project, I concentrate on the evaluation of existing prefetchers. First, we surveyed the existing state-of-the-art prefetchers and divided them into two categories: (1) Heuristic-based prefetchers and (2) ML-based prefetchers. Next, for each category, we picked several representative prefetchers and implemented them within our simulator. Then, we evaluated those prefetchers using the production-level over 600 traces from Alibaba, Tencent, and Microsoft Research. Finally, we analyzed the performance of those prefetchers and discovered some interesting insights that might guide the future prefeters’s design.

Finally, based on the achievements of the SoR project, I will continue involving this interesting project with Prof. Haryadi S. Gunawi. We are leveraging the current insights we get to build an I/O prefetcher that mitigates the downsides of existing prefetchers.

Insights

Based on our experiments on the existing prefetchers, we would like the share the following insights:

Heuristic-based prefetchers, including Linux Readahead and Stride prefetcher, rely on strict pre-fined rules and detect straightforward access patterns. However, those prefetchers are too conservative to recognize the increasingly complex access patterns. Especially, in real-world applications, sequential accesses are interweaved with random accesses, leading to a next-level complexity that makes it difficult for Linux Readahead and Stride prefetchers to recognize.
Offline learning-based prefetchers learn the access patterns by training machine learning models on pre-collected historical access patterns. Blessed by the representational power of machine learning, these prefetchers excel at recognizing complex access patterns. However, their effectiveness is constrained by their dependence on the patterns encountered during offline training, making them less adaptable to previously unseen patterns in online scenarios. Moreover, due to not relying on the pre-defined rule of prefetching, Offline learning-based prefetchers are more prone to prefetch useless data, which causes cache pollution and extra pressure on backend storage.
We argue that a good prefetcher under nowadays complex and changing workload should have three properties: (1) Complexity-Recognition: which means the prefetcher should be able to recognize the complex access pattern of a complex workload. (2) Reliability: means the prefetcher should reduce its possibility to prefetch using less data and cause cache pollution. (3) Adaptability: means the prefetcher should adapt itself to the changing workload.

Future Works

Based on the above insights, we are now designing our own prefetchers that can mitigate the downsides of existing prefetchers. We will make our code public after we finalize our design.

Conclusion

Through the SoR project, I delved into the research area of I/O prefetching by reproducing the related works, characterizing their performance, and designing our own prefetcher. We contribute to the community with a comprehensive simulator, evaluation results of related prefetchers, and insights that can guide the future prefetchers’ design. In the future, I will continue working on the research area of prefetcher and keep making contributions.

Mid Term Blog: FetchPipe: Data Science Pipeline for ML-based Prefetching

Sat, 27 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Peiran Qin, a CS student at the University of Chicago, currently working on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. The FetchPipe project focuses on building a unified python simulator and evaluating the existing chache-eviction and ML-Based prefetcher under this simulator.

Motivation

Existing prefetching algorithms can be categorized into (a) heuristic-based methods such as the Linux lookahead prefetcher and (b) machine learning-based methods like Long Short Term Memory (LSTM) models. However, there is a research gap in comprehensively comparing all existing ML solutions, such as Leap and LSTM Prefetcher, under a consistent evaluation setup. To ensure the fairness of evaluations, it is essential to integrate all baselines and our prefetcher into a homogeneous evaluation environment. Additionally, there is a need to evaluate cache eviction algorithms under prefetching scenarios.

Therefore, in this project, we aim to build a fair simulator, deploy state-of-the-art prefetchers and cache eviction algorithms onto this platform, and then evaluate them using comprehensive metrics. The state-of-the-art prefetchers we consider include Pythia (MICRO'21), SGDP (arXiv), and the Markov-Chain prefetcher. For cache eviction algorithms, we consider S3FIFO (SOSP'23) and SIEVE (NSDI'24). Our focus is on implementing these algorithms on our simulator and evaluating their performance using block storage datasets from Alibaba, Tencent, and MSR. Besides evaluating the prefetchers and eviction algorithms individually, we also aim to combine prefetchers with cache eviction algorithms to test overall performance.

Current Progress

In the past one and a half months, I have focused on (1) implementing our Python simulator and (2) deploying state-of-the-art prefetchers and cache eviction algorithms on this simulator. The implementation phase is now complete. The detailed progress is as follows:

The python simulator of evaluating both ML-based or heuristic-based prefetchers and cache eviction are done.
Evaluations metrics collection, such as hit rate, total prefetched data, prefetch overhead, prefetch accuracy are implemented on the simulator.
Two ML-based prefetchers, SGDP, Pythia and Markov-Chain are deployed on the simulator. SGDP is a graphed neural network based prefetcher, and Pythia is a reinforment learning based prefetcher.
State-of-the-art heuristic based eviction algorithms are implemented in the simulator, including S3FIFO and SIEVE.

With the simulator and state-of-the-art ML-based prefetchers and eviction algorithms in place, the next steps are to (1) organize a large-scale dataset (including over 600 traces from real storage servers) for testing performance and (2) evaluate the implemented prefetchers and eviction algorithms on this dataset. Finally, I will analyze the evaluation results and provide insights from the experimental outcomes. For the ML-based prefetchers, I will analyze both ML-related metrics such as accuracy and F1-score, and system metrics such as hit rate and various overheads.

Challenges

The biggest challenge is implementing existing prefetchers correctly and fairly. Since some state-of-the-art prefetchers are designed for DRAM prefetching, adapting them for SSD prefetching in the simulator is challenging. Additionally, the lack of source code for some works makes it difficult to reproduce their algorithms accurately based solely on their paper descriptions.

Improving Usability and Performance in cc-snapshot: My Midterm Update

Wed, 24 Jul 2024 00:00:00 +0000

Hi! I’m Zahra Temori, a rising junior studying Computer Science at the University of Delaware. This summer, I’ve had the exciting opportunity to participate in the Chameleon Summer Reproducibility Program, where I’ve been working under the mentorship of Paul Marshall. In this blog post, I’d love to share a midterm update on my project cc-snapshot and highlight what I’ve accomplished so far, what I’ve learned, and what’s coming next. It’s been a challenging but rewarding experience diving into real-world research and contributing to tools that help make science more reproducible!

Project Overview

CC-Snapshot is a powerful tool on the Chameleon testbed that enables users to package their customized environments for reproducibility and experiment replication. In research, reproducibility is essential. It allows scientists to run experiments consistently, share complete setups with others, and avoid environment-related errors. However, the current snapshotting mechanism has limitations that make it unreliable and inefficient, particularly in terms of usability and performance. These issues can slow down workflows and create barriers for users trying to reproduce results. Our goal is to improve both the usability and performance of the cc-snapshot tool. A more user-friendly and optimized system means that users can create and restore snapshots more quickly and easily, without needing to manually rebuild environments, ultimately saving time and improving reliability in scientific computing.

Progress So Far

To structure the work, we divided the project into two main phases:

Improving usability, and
Optimizing performance.

I’ve nearly completed the first phase and have just started working on the second.

Phase One – Usability Improvements

The original version of the cc-snapshot tool had several usability challenges that made it difficult for users to interact with and for developers to maintain. These issues included a rigid interface, lack of flexibility, and limited testing support. All of which made the tool harder to use and extend. To address these, I worked on the following improvements:

Problem: The command-line interface was limited and inflexible. Users couldn’t easily control features or customize behavior, which limited their ability to create snapshots in different scenarios.

Solution: I enhanced the CLI by adding:

A flag to disable automatic updates, giving users more control.
A –dry-run flag to simulate actions before actually running them which is useful for testing and safety.
Support for a custom source path, allowing snapshots of specific directories. This makes the tool much more useful for testing smaller environments.

Problem: The code lacked automated tests. Without tests, developers have to manually verify everything, which is time-consuming and error-prone.

Solution: I implemented a basic test suite and integrated it with GitHub Actions, so the tool is automatically tested on every pull request.

Problem: The tool didn’t follow a modular design. The logic was tightly coupled, making it hard to isolate or extend parts of the code.

Solution: I refactored the code by extracting key functions. This makes the code cleaner, easier to understand, and more maintainable in the long term.

Next Steps – Phase Two: Performance Optimization

After improving the usability of the cc-snapshot tool, the next phase of the project focuses on addressing key performance bottlenecks. Currently, the snapshotting process can be slow and resource-intensive, which makes it less practical for frequent use especially with large environments.

Problem 1: Slow Image Compression The current implementation uses the qcow2 image format with zlib compression, which is single-threaded and often inefficient for large disk images. This leads to long snapshot creation times and high CPU usage.

Solution: I will benchmark and compare different compression strategies, specifically:

qcow2 with no compression
qcow2 with zstd compression, which is faster and multi-threaded
raw image format, which has no compression but may benefit from simpler processing

These tests will help determine which method provides the best tradeoff between speed, size, and resource usage.

Problem 2: Suboptimal Storage Backend Snapshots are currently uploaded to Glance, which can be slow and unreliable. Uploading large images can take several minutes, and this slows down the user workflow.

Solution: I will compare Glance with a faster alternative, the Object Store. Smaller, compressed images may upload significantly faster to the Object Store e.g. 30 seconds vs. 2 minutes. By measuring upload speeds and reliability, I can recommend a better default or optional backend for users.

How I will Measure Performance

To understand the impact of different strategies, I will try to collect detailed metrics across three stages:

Image creation: How long it takes to build the image, depending on compression and format
Image upload: How quickly the snapshot can be transferred to Glance or Object Store
Instance boot time: How fast a new instance can start from that image (compressed formats must be decompressed)

I will run multiple tests for each scenario and record performance metrics like CPU usage, memory usage, disk throughput, and total time for each step. This will help identify the most efficient and practical configuration for real-world use.

Conclusion

Addressing the current usability and performance issues in cc-snapshot is essential to improving the overall user experience. By making the tool easier to use, faster, and more flexible, we can support researchers and developers who depend on reproducible computing for their work. So far, I’ve worked on enhancing the tool’s interface, adding testing support, and refactoring the codebase for better maintainability. In the next phase, I’ll be focusing on benchmarking different compression methods, image formats, and storage backends to improve speed and efficiency. These improvements will help make cc-snapshot a more powerful and user-friendly tool for the scientific community.

Stay tuned for the next update and thank you for following my journey!

Halfway Blog: FSA: Benchmarking Fail-Slow Algorithms

Tue, 23 Jul 2024 00:00:00 +0000

Introduction

Hi, I’m Xikang Song, a 2024 SoR contributor to the project, working with mentors Ruidan Li and Kexin Pei. Our FSA-Benchmark project is dedicated to exploring and benchmarking various machine learning models to identify disks at high risk of fail-slow anomalies. We will benchmark a range of machine learning algorithms, from traditional to advanced methods, and compare the results using a comprehensive evaluation system. This will provide a clear view of how machine learning impacts critical error detection in RAID systems.

Motivation

Fail-slow issues in storage systems , where a disk operates at a significantly reduced speed without completely failing, are subtle and can manifest as consistently higher latency compared to peer disks or recurrent abnormal latency spikes. These issues are challenging to detect but can significantly degrade overall system performance over time. Fixed thresholds are ineffective because latency distributions vary across different clusters, leading to thresholds that are either too low or too high, resulting in numerous false alerts. Therefore, we are enthusiastic about using machine learning models to analyze disk performance data. Machine learning algorithms can deeply learn the trends in the data, providing better detection capabilities.

Current Progress and Challenges

Algorithm Implementation:

Cost-Sensitive Ranking Model: Inspired by the paper “Improving Service Availability of Cloud Systems by Predicting Disk Error” presented at the USENIX ATC ‘18 conference, this model ranks disks based on fail-slow risk.
Multi-Prediction Models: Drawing from “Improving Storage System Reliability with Proactive Error Prediction” presented at the USENIX ATC ‘17 conference, this approach uses multiple traditional machine learning models to evaluate disk health using diverse features. Various models were tested, with the Random Forest classifier proving most effective.
LSTM Model: This model employs Long Short-Term Memory (LSTM) networks, trained on the first day’s data for each cluster and evaluated on data spanning all days. It captures temporal dependencies to accurately predict fail-slow anomalies over time.

Comprehensive Evaluation:

Collected outputs from all algorithms on Chameleon for Perseus data A to Y (25 clusters).
Parsed the outputs through a comprehensive evaluation system, recording the true/false positives/negatives.
Plotted heat maps to show precision and recall with different look-back days and alert threshold settings.
Compared the performance across different clusters to draw conclusions.

Packaging Code:

Packaged all the code into a Trovi Jupyter notebook, including the Chameleon server setup, to provide clear steps for running the code and reproducing the experiments. All algorithm testing and result parsing can be easily done here.

Challenges

Initially, I was unsure how to evaluate the performance of different algorithms. Ruidan Li provided comprehensive guidance on collecting all the results uniformly and parsing them to gather true/false positives/negatives. This approach enabled us to derive meaningful metrics and plot heatmaps for precision and recall. I learned the scientific method of benchmarking performance, and I am grateful for the guidance.

Future Steps

Further Investigation of Advanced Algorithms

We plan to explore advanced algorithms such as PatchTST. This will involve systematically collecting outputs and conducting comprehensive benchmarking to assess their performance in identifying fail-slow anomalies.

Transition to Large Language Models (LLMs)

Recognizing the limitations of traditional machine learning methods, we intend to transition to utilizing Large Language Models (LLMs). LLMs have demonstrated superior capabilities in understanding complex patterns and making accurate predictions. We anticipate that incorporating LLMs into our analysis will enhance our ability to detect and predict fail-slow anomalies more accurately, leading to better overall system reliability.

Exploring Throttling Bugs in HDFS: Reproducing Developer Fixes

Mon, 22 Jul 2024 00:00:00 +0000

Scalability is a critical concern for large-scale distributed systems like the Hadoop Distributed File System (HDFS). Throttling bugs, which affect the system’s ability to manage data transfer rates effectively, can lead to performance issues and system instability. In my recent work, I focused on reproducing the effects of two specific throttling bugs in HDFS, which were fixed by developers. This blog provides an overview of these bugs and the process of reproducing their effects to validate the fixes.

HDFS-17087: Missing Throttler in DataXceiver#readBlock

One of the throttling bugs I explored was HDFS-17087. The DataXceiver#readBlock function in HDFS lacked a throttler, resulting in unregulated data reads. This absence could lead to potential performance degradation under heavy loads. The developer fixed this issue by adding a throttler to regulate the data transfer rate. In my work, I reproduced the bug and observed the system’s behavior both before and after applying the developer’s patch. The results showed a significant improvement in stability and performance post-fix.

HDFS-17216: Incorrect Data Rate Calculation

Another crucial bug was HDFS-17216. The issue stemmed from the use of integer division in the getBytesPerSec function, which caused incorrect speed calculations and failed to trigger the throttle, resulting in overspeed. The developer addressed this by switching from integer to float for calculating the elapsed time, ensuring accurate speed measurements. I reproduced the conditions that highlighted the bug’s effects and compared the system’s performance with and without the fix. The post-fix results confirmed that the throttling mechanism worked correctly, effectively preventing overspeed.

Conclusion

Reproducing these throttling bugs and validating the developer fixes was a vital step in understanding their impact on HDFS’s scalability. The improvements observed in system stability and performance underscore the importance of accurate throttling mechanisms. This work contributes to the broader effort of maintaining robust and scalable distributed systems, ensuring they can handle increasing loads efficiently.

Trovi redesign process and low fidelity prototype in Figma

Mon, 22 Jul 2024 00:00:00 +0000

Hello! My name is Alicia Esquivel Morel, and I’m a graduate research assistant at the University of Missouri – Columbia, pursuing a PhD in Computer Science. This summer, I’m working on a project to improve user experience reproducibility through a redesign of TROVI, as part of the Summer of Reproducibility (SoR) program. I’m excited to be working with two fabulous mentors, Kate Keahey, and Mark Powers. .

Research Reproducibility with a TROVI Redesign

As researchers, we constantly face challenges replicating experiments due to limitations in current tools. TROVI, a platform designed to facilitate experiment replication, can be hindered by hard-to-follow interfaces and difficulties integrating code and data. This leads to confusion and frustration.

My SoR project tackles these issues by redesigning TROVI to enhance user experience reproducibility. Imagine a user-friendly platform where uploading code, sharing data, and collaborating with colleagues becomes easy and straighforward.

The Redesign’s Goals

Enhanced User Experience: Inspired by user-friendly platforms like Google Colab, we’ll simplify TROVI’s interface for intuitive navigation and ease of use.
Uploads and Sharing: Uploading code and data, as well as collaborating with researchers, are key goals. Integration with platforms like GitHub will further streamline collaboration.
Continuous Improvement: A built-in feedback loop will allow users to provide input and suggestions, ensuring TROVI constantly evolves based on user needs.

Progress I have made so far

The first stage of my project began with conducting User Experience (UX) research and identifying user requirements for TROVI. I then conducted a literature review on reproducibility platforms to learn about efficient methodologies and platforms for reproducibility. This helped establish a clearer project scope. Additionally, I analyzed TROVI end-user feedback to understand redesign needs.

In summary, during the first weeks of the project, I focused on research and requirements gathering, including the literature review on state-of-the-art reproducibility platforms. Before midterm assessment, my work also involved the redesign process, prioritizing improved usability and user experience. I designed wireframes following requierements and user feedback and later translated them into a low-fidelity prototypes. Front-end and back-end considerations were made, such as selecting a front-end language (Vue.js) and a collaborative design tool (Figma).

What do I plan to do over the next weeks?

During the next two weeks, I will address challenges encountered in the design process and make the necessary adjustments to ensure the success of the next steps of the project. A higher-fidelity prototype will be completed, including connections between the different objects and frames. This will facilitate the creation of a front-end with multiple flows in the prototype. Additionally, this will provide a preview of the end-user experience through the design process, without requiring the back-end to be functional or connected yet. I’m also investigating design tool API integrations to access TROVI’s APIs. This will give us the ability to access and isolate any TROVI artifact properties associated with it.

I’m halfway in the redesign process. Next steps will include the integration of both the backend and frontend components to create a cohesive and functional system. We will also facilitate initial user interactions and testing to gather valuable feedback and ensure that the system meets the needs and expectations of end users.
In addition, as I progress, my focus will shift towards enhancing the user experience and refining the final product based on the feedback received. The final two weeks of the program will be dedicated to this critical phase, where I will implement user experience techniques and conduct thorough testing to polish the product. This period will involve close analysis and iteration to address any issues, and an optimize functionality.
By the end of the program, I aim to deliver a functional and user-friendly product that not only meets the initial project goals but also exceeds user expectations.

Stay tuned to see how TROVI is built for reproducible research!!

Data Engineering and Automated Evaluation for OpenROAD's Chat Assistant: Midterm Update

Sun, 21 Jul 2024 00:00:00 +0000

Hello everyone! We’ve reached the halfway point of our Google Summer of Code 2024 journey, and it’s time for an update on our project to build a conversational chat assistant for OpenROAD. Under the guidance of our mentors, Indira Iyer and Jack Luar, we’re making significant strides in enhancing OpenROAD’s user support capabilities.

Project Focus

My project focuses on two crucial aspects of our chat assistant:

Data Engineering: Ensuring our assistant has access to comprehensive and relevant information.
Evaluation: Developing robust methods to assess and improve the assistant’s performance.

The ultimate goal is to create a more responsive and accurate chat assistant capable of aiding users with troubleshooting, installation, and general queries about OpenROAD. I’m working in tandem with Palaniappan R, who is developing the RAG architecture for our assistant.

Progress

Since our initial deployment, I’ve been concentrating on implementing automated evaluation systems for our RAG architecture. We’ve developed two primary evaluation methods:

Basic Abbreviation Evaluation

This method assesses the model’s ability to accurately identify and explain common abbreviations used within the OpenROAD community. It ensures that our assistant can effectively communicate using domain-specific terminology.

LLM Judge-Based Evaluation

For this more comprehensive evaluation, we:

Prepared a dataset of question-answer pairs relevant to OpenROAD.
Queried our model with these questions to generate answers.
Employed LLMs (including GPT-4o and Gemini 1.5 Flash) to act as judges.
Evaluated our model’s responses against ground truth answers.

Here’s a glimpse of our early benchmark results:

Exploratory Data Analysis (EDA) on GitHub OpenROAD issues

To gather more data, I performed Exploratory Data Analysis (EDA) on GitHub OpenROAD issues using GitHub’s GraphQL API. This allowed us to:

Filter data based on parameters such as:
- Minimum number of comments
- Date range
- Mentioned PRs
- Open or closed status
Structure the data, focusing on issues tagged with Build, Query, Installation, and Runtime.
Process the data into JSONL format with key fields including:
- url: URL of the GitHub issue
- id: Unique issue number
- title: Issue title
- author: Username of the issue creator
- description: Initial issue description
- content: Array of messages related to the issue
- category: General category of the issue
- subcategory: More specific category of the issue
- tool: Relevant tools or components
- date: Issue creation timestamp

After curating this dataset, I was able to run an Analysis on OpenROAD Github Issues, identifying multiple categories of issues in the form of a pie chart.

Looking Ahead

As we move into the second half of the GSOC period, our plans include:

Incorporating GitHub Discussions data into our knowledge base.
Utilizing this expanded dataset to enhance our RAG architecture.
Continually refining and improving our model’s performance based on evaluation results.

We’re excited about the progress we’ve made and look forward to delivering an even more capable and helpful chat assistant for the OpenROAD community. Stay tuned for more updates as we continue this exciting journey!

Halfway Through SoR24: Building a Scalable Performance Benchmarking Tool for Genomics Workflows

Sun, 21 Jul 2024 00:00:00 +0000

Project Overview

Hi! I’m Martin Putra, and I’m working on the “Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster” project under the supervision of In Kee Kim. We are building GenScale, a scalable benchmarking tool for genomics workfload which leverages industrial-grade cluster manager and monitoring systems. GenScale will allow us to generate performance data under a setup that is representative of large-scale production settings. Ultimately, we hope GenScale and the datasets it produces will catalyze engagement between the computer systems and bioinformatics community, thus accelerating the pace of discovery at both fields.

Progress and Challenges

We have built a prototype using Kubernetes as cluster manager and Prometheus for monitoring systems. At its current state, the prototype can support an arbitrary number of compute nodes, owing to Kubernetes’ notable scaling capability. This provides a suitable environment for small- to mid-scale experiments. We leverage ChameleonCloud to provide the necessary computational and reproducibility infrastructure. The monitoring system supports cluster-level, node-level, and container-level metrics collection and failure detection. We integrated Grafana dashboards for visualizations.

The prototype also supports the execution of user-defined workflows. During the design process, we considered integrating one of existing workflow execution systems, such as cwltool, Nextflow, or Cromwell. Each system has its own pros and cons when placed within the context of how we envision GenScale. However, we ultimately decided to build our own workflow execution system in order to provide maximum flexibility for the capabilities we plan to add in the future. For example, we believe it will be interesting to study how hardware heterogeneity affects the performance of each application in the workflow (a well-known workflow scheduling problem). Studying the problem requires capability to schedule execution on specific machines. In addition, if we want to study contention, we may need to execute on machines which are currently running specific workflows, too. While there are ways to do them with existing workflow execution systems + Kubernetes stack, we believe it will be hugely simplified if we build our own workflow execution system.


Figure 1. Proportion of execution time for DNA Alignment applications, executed on Chameleon’s cascadelake_r node with 1500MB paired-end input. y-axis: proportion of application’s exec. time out of the whole workflow’s exec. time, x-axis: top 10 applications accounting for 97% exec. time, sorted by proportion. Other applications are aggregated.

We confirmed GenScale’s capability to produce useful data by executing a DNA alignment workflow and capturing its runtime resource usage. We use Genomics Data Commons’ (GDC) DNA alignment workflow as reference, which has a total of 27 applications ranging from quality check, read trimming, actual alignment, indexing, and various metrics collection. We wrote our own simplified version of the workflow by first analyzing the execution time & resource usage of each application, then we chose 10 applications which represents 97% of the workflow execution time. We took into account that containerization is the de-facto standard for workflow execution among the bioinformatics community. Thus, we packaged each application as its own separate container, then hosted their Dockerfiles & containers in a private Github Container Registry (GHCR). We plan to make them public in the future. Our monitoring system is able to show resource usage in real time. We also built sidecar containers which use Unix’s pidstats to generate a CSV of cores, memory, and storage utilization throughout each workflow’s execution. This will allow easier analysis and data sharing for GenScale’s users.


Figure 2. CPU utilization pattern of BWA, Picard’s CollectWGSMetrics, and Picard’s ValidateSamFile collected by GenScale. y-axis: (num. cores) x 100%, x-axis: time elapsed in seconds.

One technical challenge is in automating the creation of Kubernetes cluster and in keeping it alive. We believe GenScale’s users would be interested in the performance of workflows under dynamic cluster sizes, either due to intentional scaling or machine failures. While the current prototype supports creating a cluster with arbitrary nodes, there are still steps which require a reboot when adding nodes. This makes cluster creation and horizontal scaling not fully automated yet. Keeping a cluster alive is also expensive. Since we use ChameleonCloud as our testbed, we have a choice of either keeping the cluster alive at the cost of significant service units (SU) usage, or save SUs by terminating our leases at the cost of rebuilding the cluster from scratch later. We choose a middle ground by keeping only Kubernetes’ control plane alive. The approach works well so far.

Next Steps

For the remaining weeks, we plan to work on the second workflow, namely RNA Alignment. We would also like to add simple user interfaces if time permits. Finally, we plan to package GenScale’s source code, container images, and sample benchmark results for the open-source community. We look forward to the second half of Summer of Reproducibility!

Midterm Blog: ML in Detecting and Addressing System Drift

Sun, 21 Jul 2024 00:00:00 +0000

Hello! I’m Joanna! Over the past month, I have been contributing to the ML in Detecting and Addressing System Drift project under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy. My project aims to design a pipeline to evaluate drift detection algorithms on system traces. The goal is to characterize different drifts, understand how they affect model performance, and evaluate the performance of state-of-the-art (SOTA) drift detection algorithms.

Progress

Over the past month, I’ve primarily been constructing a data drift dataset from the Tencent I/O block trace, which includes both drift and non-drift data. By combining offline drift detection algorithms such as Maximum Mean Discrepancy, Cramér-von Mises, and Kolmogorov-Smirnov, I am developing a dataset that contains segments with and without drifts for features such as IOPS (Input/Output Operations Per Second), read/write size ratio, write size, and other relevant performance metrics. The diagrams below illustrate the data segments identified with and without drifts, respectively.

In addition to constructing the datasets, I have begun evaluating some online drift detection algorithms and designing metrics to assess their performance. I have tested the performance of online drift detection algorithms such as Online Maximum Mean Discrepancy and Online Cramér-von Mises under various settings, including different window lengths and sensitivity levels. The following diagrams illustrate the drift points detected for the IOPS feature under these different settings.

Next Steps

Here are my plans for the next month:

Complete the experiments on data drift and generate improved visualizations to summarize the performance of these online drift detection algorithms, including their overhead and accuracy over time.
Characterize drifts by identifying the types of drifts that lead to model performance degradation
Evaluate drift detection algorithms in the context of concept drifts.

Stay tuned for my future updates on this project!

Enabling VAA Execution: Environment and VAA Preparation and/or Reproducibility for Dynamic Bandwidth Allocation (CONCIERGE)

Sat, 20 Jul 2024 00:00:00 +0000

Hi there!

I am Rafael Sinjunatha Wulangsih, a Telecommunication Engineering graduate from the Bandung Institute of Technology (ITB), Bandung, Indonesia. I’m currently contributing to the “EdgeRep: Reproducing and benchmarking edge analytic systems” project under the mentorship of Yuyang (Roy) Huang and Prof. Junchen Jiang. You can find more details about the project proposal here.

This project addresses the challenges posed by the massive deployment of edge devices, such as traffic or security cameras, in smart cities and other environments. In the previous Edgebench project, the team proposed a solution to dynamically allocate bandwidth and compute resources to video analytic applications (VAAs) running on edge devices. However, that project was limited to a single VAA, which may not represent the diverse applications running on edge devices. Therefore, the main goal of this project, “EdgeRep,” is to diversify the VAAs running on edge devices while utilizing a solution similar to that of the Edgebench project. EdgeRep aims to reproduce state-of-the-art self-adaptive VAAs (with seven candidates) and maintain self-adaptation in these video analytics pipelines. We will implement it ourselves if the video analytics applications do not support self-adaptation.

Halfway Through GSOC: Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Sat, 20 Jul 2024 00:00:00 +0000

Hello, I’m Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University. I’m currently working on the AIIO / Graph Neural Network project under the guidance of Bin Dong and Suren Byna. Our project focuses on enhancing the AIIO framework to automatically diagnose I/O performance bottlenecks in high-performance computing (HPC) systems using Graph Neural Networks (GNNs).

Project Overview

Our primary goal is to tackle the persistent issue of I/O bottlenecks in HPC applications. Identifying these bottlenecks manually is often labor-intensive and prone to errors. By integrating GNNs into the AIIO framework, we aim to create an automated solution that can diagnose these bottlenecks with high accuracy, ultimately improving the efficiency and reliability of HPC systems.

Progress and Challenges

Over the past few weeks, my work has been centered on developing a robust data pre-processing pipeline. This pipeline is crucial for converting raw I/O log data into a graph format suitable for GNN analysis. The data pre-processing involves extracting relevant features from Darshan I/O logs, which include job-related information and performance metrics. One of the main challenges has been dealing with the heterogeneity and sparsity of the data, which can affect the accuracy of our models. To address this, we’ve focused on using correlation analysis to identify and select the most relevant features, ensuring that the dataset is well-structured and informative for GNN processing.

We’ve also started constructing the GNN model. The model is designed to capture the complex relationships between different I/O operations and their impact on system performance. This involves defining nodes and edges in the graph that represent job IDs, counter types, and their values. We explored different graph structures, including those that focus on counter types and those that incorporate more detailed information. While more detailed graphs offer better accuracy, they also require more computational resources.

Current Achievements

Data Pre-processing Pipeline: We have successfully developed and tested the pipeline to transform Darshan I/O logs into graph-structured data. This was a significant milestone, as it sets the foundation for all subsequent GNN modeling efforts.
GNN Model Construction: The initial version of our GNN model has been implemented. This model is now capable of learning from the graph data and making predictions about I/O performance bottlenecks.
Correlation Analysis for Graph Structure Design: We have used correlation analysis on the dataset to understand the relationships between I/O counters. This analysis has been instrumental in designing a more effective graph structure, helping to better capture the dependencies and interactions critical for accurate performance diagnosis.

Training for Different Graph Structures: We are currently training our model using various graph structures to determine the most effective configuration for accurate I/O performance diagnosis. This ongoing process aims to refine our approach and improve the model’s predictive accuracy.

Next Steps

Looking ahead, we plan to focus on several key areas:

Refinement and Testing: We’ll continue refining the GNN model, focusing on improving its accuracy and efficiency. This includes experimenting with different graph structures and training techniques.
SHAP Analysis: To enhance the interpretability of our model, we’ll incorporate SHAP (SHapley Additive exPlanations) values. This will help us understand the contribution of each feature to the model’s predictions, making it easier to identify critical factors in I/O performance.
Documentation and Community Engagement: As we make progress, we’ll document our methods and findings, sharing them with the broader community. This includes contributing to open-source repositories and engaging with other researchers in the field.

This journey has been both challenging and rewarding, and I am grateful for the support and guidance from my mentors and the community. I look forward to sharing more updates as we continue to advance this exciting project.

Hardware Hierarchical Dynamical Systems

Sat, 20 Jul 2024 00:00:00 +0000

Hi everyone! I am Ujjwal Shekhar, a Computer Engineering student at the International Institute of Information Technology - Hyderabad. I am excited to share my current progress on the project titled “Hardware Hierarchical Dynamical Systems” as part of the Open Source Research Experience (OSRE) program and Google Summer of Code. I am working with my mentors, Jose Renau and Sakshi Garg, on this project.

Project Overview

With hardware compilers, it is not uncommon for the size of code that the hardware compilers need to handle to go into millions. We aim to improve the efficiency of the tree data structure to be used for representing the Abstract Syntax Tree (AST) of the input program. The tree data structure is optimized for typical AST traversal and queries. Some queries that are made to this tree are much more frequent than others.

Thus, the goal of this project is to be able to optimize the tree for frequent queries while still providing support for other infrequent queries. We use Google Bench to benchmark the tree for scalability and performance and expect it to outperform the current version of the tree. Finally, the new version of the tree will be integrated into the LiveHD core repository.

Progress and Challenges

Over the past month and a half, I have successfully finished working on the add/append methods of the tree. Moreover, I have finished writing the iterators on the tree too. There are preliminary tests already in place and the HHDS repository now has a working Bazel build system.

As shown in the figure, we can see that the tree went from storing pointers to everything that it could to only storing pointers to the nodes that are absolutely necessary. Moreover, by not maintaining multiple levels in the tree, we have been able to reduce the memory footprint of the tree. This is a significant improvement from the LHtree that was being used earlier.

Furthermore, we have also been able to improve the cache friendliness of each node of the tree. By realizing that most of the time, new children are added soon after the parent is added, we have been able to store the children in a contiguous memory location whenever possible, or access them using a shorter delta from the parent node. This has significantly improved the cache friendliness of the tree by allowing the packing of the book-keeping of up to 8 children in a single 512-bit word. This 512-bit chunk has amazing cache alignment properties.

Highlights

Finished working on the add/append methods of the tree.
Finished writing the iterators on the tree.
Preliminary tests are in place.
HHDS repository now has a working Bazel build system.

Challenges

Working out a new plan: The initial plan was to use a flattening policy to optimize the tree for frequent queries. However, this plan has been revised and we have flattened the tree not using a tour-based flattening policy, but by still storing pointers to various nodes in the tree. This has been done to ensure that the tree is still able to support infrequent queries.
Benchmarking: The benchmarking of the tree is still in progress. I am working on creating a benchmarking suite that will be able to test the tree for scalability and performance. This will allow future developers to test the tree for performance and scalability after they make changes.

Next Steps

From here, a lot of testing and benchmarking is still left to be done. Moreover, we need to add the delete methods and make sure that the integration with the LiveHD core repository is smooth. The next steps involve:

Adding the delete methods to the tree.
Benchmarking the tree for scalability and performance.
Ensuring that the syntax of the tree is in line with the LiveHD core repository.
Integrating the tree into the LiveHD core repository.
Adding documentation to the tree.
Integrating the testing of the tree into the LiveHD testing suite.

Conclusions

My experience so far has been amazing. I have been able to work on a project that is at the intersection of hardware and software. Moreover, I have been able to work with a team that is very supportive and has been able to guide me through the project. I am looking forward to the next steps and am excited to see the final version of the tree in the LiveHD core repository.

Acknowledgements

I would like to thank my mentors, Jose Renau and Sakshi Garg for their guidance and support throughout the project. It would not have been possible without their help.

Optimizing Scientific Data Streaming: Developing Reproducible Benchmarks for High-Speed Memory-to-Memory Data Transfer over SciStream

Sat, 20 Jul 2024 00:00:00 +0000

Hello there! I’m Acheme and I’m thrilled to share the progress on my project, “Optimizing Scientific Data Streaming: Developing Reproducible Benchmarks for High-Speed Memory-to-Memory Data Transfer over SciStream” under the mentorship of Joaquin Chung and Flavio Castro under the SciStream project.

Project Overview

This project aims to develop SciStream-bench, a set of benchmarks and artifacts designed to precisely evaluate the performance of scientific streaming applications across diverse traffic patterns when running over the SciStream framework.

Progress

One of the first points of call in the project was consultation with SciStream team members working at Argonne to identify use cases in scientific streaming applications and what typical traffic profiles they represent. The goal was to simulate these profiles using traffic generator tools and network configuration of network resources on the FABRIC/Chameleon testbed. The following traffic profiles were identified to meet many use-cases including one of the ESnet’s broad categorization, “The Time-Sensitive Pattern”, in integrated research workflows:

Throughput intensive startup
Intermittent burst of traffic for a duration of time
Constant rate traffic
Latency sensitive

Since data streaming applications have some unique requirements for optimum performance, the following metrics were selected as important for testing streaming performance.

Latency
Jitter
Packet loss / message loss
Throughput

Subsequently, about seventeen open-source traffic generator applications were identified and compared to determine a few suitable ones for generating our defined traffic profiles and that expose the desired performance metrics. We ultimately settled on iperf3 and pvaPy (a scientific streaming application developed at Argonne National Lab)

So far, the first set of tools for benchmarking using iperf3 as traffic generator with profiles of constant rate and intermittent bursts have been developed, the tools generate traffic, collects the metrics that iperf3 exposes metrics including throughput, jitter and datagram losses, and saved to a csv file for further analysis. A Jupyter notebook is used to setup a FABRIC slice and configure a four-node experiment suitable for benchmarking SciStream base architecture. After running the experiments on the nodes on FABRIC and collecting results in a CSV file, cells in the Jupyter notebook were coded to analyze the data. In the analysis includes average, min, max and standard deviation of the various metric performances.

Findings

From the experiments conducted so far, the findings are as follows:

We could not properly simulate some of the listed traffic profiles initially defined: for example, to simulate a latency-sensitive traffic profile, we needed the ability to set timeouts in iperf3 which is not available at the moment
It is not straightforward to implement SciStream on the Chameleon testbed at the moment.
Iperf3 does not expose the latency metric and the jitter computation is suspect.

Next Steps

Similar to the iperf3-based benchmarking tool developed and the analysis tools, I will focus next on pvaPy:

Fully develop traffic generator and metric collection tools for pvaPy for the defined traffic profiles and exposing the chosen metrics
Perform initial experiment like for iperf3 before
Repeat both iperf3 and pvaPy-based benchmarking operation in multiple scenario (LAN, METRO, WAN), compare performance and explain results.

Stay tuned for my final blog as I present deeper results and insights!

Architecture Updates - LLM Assistant for OpenROAD

Fri, 19 Jul 2024 00:00:00 +0000

Hi again! I’m Palaniappan R, a GSoC contributor working on the OpenROAD chat assistant project under the mentorship of Indira Iyer and Jack Luar. My project aims to build an LLM-powered chat assistant designed to provide seamless access to existing online resources, thereby reducing support overhead. Over the past month, I’ve been collaborating with Aviral Kaintura, on data engineering to deliver on our common project goal of an OpenROAD assistant and an open-EDA dataset that promotes further research and collaboration.

Progress

The retrieval architecture is at the heart of any retrieval-augmented generation (RAG) setup. Our current setup employs a hybrid-search technique, combining a traditional keyword search method with more advanced vector search methods. As illustrated in the diagram, we combine a simple semantic search, a Maximal Marginal Relevance (MMR) search and a text-based BM25 ranking technique to build our hybrid retriever.

flowchart LR id0([Query]) --> id1 id1([Vectorstore]) --- id2([Semantic Retriever]) id1([Vectorstore]) --- id3([MMR Retriever]) id1([Vectorstore]) --- id4([BM25 Retriever]) id2([Semantic Retriever]) -- Retrieved Docs ---> id5([Reranking]) id3([MMR Retriever]) -- Retrieved Docs ---> id5([Reranking]) id4([BM25 Retriever]) -- Retrieved Docs ---> id5([Reranking]) id5([Reranking]) ---> id6(top-n docs)

Upon receiving a query, relevant documents are sourced from each retriever, resulting in a broad set of results. We feed these results into a cross-encoder re-ranker model to get the top-n documents with maximum relevance.

After building the retriever, we utilized the LangGraph framework to develop a stateful, multi-agent workflow tailored to our use case. This allows flexibility in servicing a diverse set of user questions in an efficient and accurate manner, given the sparse nature of our dataset.

Our current dataset can be broadly classified into the following categories:

OpenROAD Documentation
OpenROAD-flow-scripts Documentation
OpenSTA Documentation
OpenROAD Manpages

These data sources are embedded into separate FAISS vector databases using open-source embeddings models (we’ve been working on fine-tuning an embeddings model for better retrieval accuracy). The hybrid search retrievers are then applied to these vector databases, creating internal tools that can be queried by our LLM as needed. Each tool has access to different data sources in various domains. For instance, the retrieve_cmds tool selectively has access to information detailing the multiple commands in the OpenROAD framework, while the retrieve_install deals with installation-related documentation. As depicted in the flowchart, a routing LLM call classifies the input query and forwards it to the appropriate retriever tool. Relevant documents are then sent back to the LLM for response generation.

graph TD __start__ --> router_agent router_agent -.-> retrieve_cmds router_agent -.-> retrieve_general router_agent -.-> retrieve_install router_agent -.-> retrieve_opensta retrieve_cmds --> generate retrieve_general --> generate retrieve_install --> generate retrieve_opensta --> generate generate --> __end__

Feel free to try out our chat assistant here. Instructions to set up and run our chatbot can be found here.

Here’s an example of our chatbot in action.

Future Plans

In the upcoming weeks, we aim to enhance our dataset by incorporating actionable information filtered from GitHub issues and discussions. We’ll be adding support to keep track of the conversation history as well.

Stay tuned for more updates!

Reproducibility in Data Visualization

Fri, 19 Jul 2024 00:00:00 +0000

Hello everyone!

Initial Approach and Challenges

I began my work by comparing original visualizations with reproduced ones using OpenCV for pixel-level comparison. This method helped highlight structural differences but also brought to light some challenges. Different versions of libraries rendered visualizations slightly differently, causing minor positional changes that didn’t affect the overall message but were still flagged as discrepancies.

To address this, I experimented with machine learning models like VGG16, ResNet, and Detectron2. These models are excellent for general image recognition but fell short for our specific needs with charts and visualizations. The results were not as accurate as I had hoped, primarily because these models aren’t tailored to handle the unique characteristics of data visualizations.

Shifting Focus to Chart-Specific Models

Recognizing the limitations of general ML models, I shifted my focus to chart-specific models like ChartQA, ChartOCR, and ChartReader. These models are designed to understand and summarize chart data, making them more suitable for our goal of comparing visualizations based on the information they convey.

Generating Visualization Variations and Understanding Human Perception

Another exciting development in my work has been generating different versions of visualizations. This will allow me to create a survey to collect human categorization of visualizations. By understanding how people perceive differences whether it’s outliers, shapes, data points, or colors. We can gain insights into what parameters impact human interpretation of visualizations.

Next Steps

Moving forward, I’ll continue to delve into chart-specific models to refine our comparison techniques. Additionally, the survey will provide valuable data on human perception, which can be used to improve our automated comparison methods. By combining these approaches, I hope to create a robust framework for reliable and reproducible data visualizations.

I’m thrilled about the progress made so far and eager to share more updates with you all. Stay tuned for more insights and developments on this exciting journey!

Reproducibility in Data Visualization

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello! My name is Arya Sarkar a machine learning engineer and researcher based out of Kolkata, a city in Eastern India dubbed the City of Joy. For the last month and a half I have been working closely with Professor David Koop on the project titled Reproducibility in Data Visualization. I’m thrilled to be able to make my own little mark on this amazing project and aid in exploring solutions to capture visualizations in hopes of making reproducibility easier in this domain.

Progress and Challenges

The last month and a half have mostly been spent trying to explore best possible solutions to facilitate the reproducibility of STATIC visualizations from local sources and/or the web. We have taken inspiration from existing work in the domain and successfully captured meta-information required to ensure reproducibility in the regenerated visualizations from the said metadata. The metadata extracted is saved into the generated .png figure of the visualization therefore allowing reproducibility as long as you have (a) The original dataset (b) The generated .png of the visualization. Every other information is stored inside the .png file as a json object and can be used to regenerate the original image with a very high accuracy.

The problem however remains with visualizations where randomness such as jitter is involved. Capturing the randomness has not been 100% successful as of now, and we are looking into options to ensure the capture of certain plots that contains randomness.

The following images can be used to highlight some results from our reproducibility experiments: Original Histogram using Matplotlib on the iris dataset:

Reproduced Histogram using metainformation from the original:

The next steps

We have already started looking into solutions and ways to capture visualizations from the web i.e. from platforms such as ObservableHq and use these experiments to transition into capturing interactive visualizations from the web.

Capturing user interactions and all states in an interactive visualization can prove to be very useful as it is a very known pain-point in the reproducibility community and has been a challenge that needs to be solved. My next steps involve working on finding a solution to capture these interactive visualizations especially those living on the web and ensuring their reproducibility.

Halfway Through GSOC: My Experience and Learnings

Thu, 18 Jul 2024 00:00:00 +0000

Hello there! I’m Qianru, and this is my mid-term blog post for the 2024 Google Summer of Code. I am working on the BenchmarkST project, focusing on benchmarking gene imputation methods in spatial transcriptomics. My goal is to create a comprehensive, reproducible platform for evaluating these methods across various datasets and conditions.

In this post, I will share some of the progress I have made so far, the challenges I have faced, and how I overcame them. I will also highlight some specific accomplishments and what I plan to do next.

Achievements:

Developed the Python Package: I created the “Impeller” Python package, which includes tools for downloading example data, processing it, and training models. This package aims to standardize gene imputation tasks in spatial transcriptomics.
Example Data Integration: Successfully integrated various spatial transcriptomics datasets into the package for benchmarking purposes.
Benchmarking Framework: Established a framework for objective comparison of different gene imputation methodologies.

Python Package: Installation and Usage

You can install the package using pip:

pip install Impeller

Download Example Data

from Impeller import download_example_data
download_example_data()

Load and Process Data

from Impeller import load_and_process_example_data, val_mask, test_mask, x, original_x = load_and_process_example_data()

Train Model

from Impeller import create_args, train args = create_args(),test_l1_distance, test_cosine_sim, test_rmse = train(args, data, val_mask, test_mask, x, original_x)

Challenges:

Reproducing the results of various gene imputation methods was not an easy task. I faced several challenges along the way:

Lack of Standardized Data: Some methods had incomplete or missing code, making it difficult to reproduce their results accurately.
Reproducibility Issues: Successfully integrated various spatial transcriptomics datasets into the package for benchmarking purposes.
Resource Limitations: Running large-scale experiments required significant computational resources, which posed constraints on the project timeline.

Future Work:

Moving forward, I plan to:

Extend the package’s functionalities to include more datasets and imputation methods.
Enhance the benchmarking framework for more comprehensive evaluations.
Collaborate with other researchers to validate and improve the package’s utility in the bioinformatics community.

I hope you found this update informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for your attention and support!

Mid Term Blog: FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Lihaowen (Jayce) Zhu, a 2024 SoR contributor for the FEP-bench project, under the mentorship of Yuyang (Roy) Huang. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. By exploring innovative caching, prefetching, and heuristic strategies, this proposal aims to optimize the preprocessing workflow, thereby enhancing efficiency and reducing the required resources of ML projects.

Motivation

Our research project is based on the context of Deep Neural Networks. To train a DNN, we first need a large amount of data. All raw data must be preprocessed by a data preprocessing pipeline, which is specific to different ML tasks. As usual, in a preprocessing pipeline, the data must be loaded from the disk and converted to the correct format, transformed and augmented. And then, it can be fed into the training stage. In common ML training tasks and datasets, the data preprocessing stage can consume almost 65% of the total training time. However, compared with the fast development of computing hardware including GPUs and TPUs, the speed of data preprocessing pipelines has not been improved by a lot and cannot keep up with these hardware innovations, which leads to a bottleneck in the efficiency of Deep Neural Network training.

The bottlenecks can be divided into 2 categories: the data side and the computation side. The data side bottleneck is mainly caused by the data transfer in the system, including data fetching, I/O bound, huge size of data, and complex data format. However, the computation side bottleneck can always happen during data preprocessing operations and data shuffling. For distributed Machine Learning training systems, gathering the distributed data can also lead to the computation side bottleneck.

Current Progress

In order to improve the efficiency of the machine learning preprocessing pipeline, we first need to understand and document the preprocessing workflows commonly used in machine learning, including pipelines of Natural Language Processing, Computer Vision, and Audio datasets. As a result, for the past month, we have built up a collection of common datasets for different machine learning tasks. The dataset types include NLP, CV, Audio, Linear Regression, Video and LiDAR. The machine learning job types are collected based on the dataset types, such as sentiment analysis for NLP, and image classification for CV. The data has either a structured or unstructured format. In addition, our collection contains the following attributes:

Data/Sample size
Typical preprocessing operations
Preprocessing difficulty: hard/easy
Input splittable
Output reusable
CPU/GPU/IO Bound
Dataset and preprocessing links.

By collecting all this data, we can gain an overview of all common preprocessing pipelines in the current machine learning research field, and build up a solid basis for the next phase of our project, which requires hard work on benchmark profiling. For example, for the Audio datasets, we focus on the LibriSpeech dataset. It contains 1000 hours of speech sampled at 16kHz, making it one of the largest publicly available datasets for speech recognition tasks. The typical preprocessing steps of the LibriSpeech dataset include feature extraction, label to integer conversion, and padding.

Challenges

During the first phase of the project, I met a lot of challenges as I had not been exposed to topics similar to this project. The first big problem was that I needed to learn the concepts of some machine learning tasks from scratch, such as NLP, so that I could have a better understanding of the common datasets and pipelines. Also, I needed to deeply review a lot of different preprocessing pipelines for each machine learning task, to make the table more comprehensive.

Midterm Blogpost: Drift Management Strategies Benchmark

Thu, 18 Jul 2024 00:00:00 +0000

Hello there! I’m William and I’m thrilled to share the progress on my project, “Developing A Comprehensive Pipeline to Benchmark Drift Management Approaches” under the mentorship of Ray Andrew Sinurat and Sandeep Madireddy under the LAST project.

Project Overview

Progress

So far, I’ve generated various synthetic datasets, which include:

CIRCLE: This dataset contains two features x1, x2 drawn uniformly from the interval [0, 1]. Each data point is labeled as per the condition (x1 − c1)^2 + (x2 − c2)^2 <= r where the center (c1, c2) and radius r of the circular decision boundary changes gradually over a period of time introducing (gradual) concept drift.
COVCON: This 2-dimensional dataset has covariate shift and concept drift. The decision boundary at each point is given by α ∗ sin(πx1) > x2. We use 10000 points (100 batches, 1000 points per batch). Covariate shift is introduced by changing the location of x1 and x2 (for batch t x1 and x2). Concept drift is introduced by alternating the value of α.
SINE: This dataset contains two features x1, x2 drawn uniformly from the interval [0, 1]. In the first context all points below the curve y = sin(x) are classified as positive. The label for the classes are flipped after.

Additionally, I’ve also curated drifting data from the Tencent I/O block trace. These datasets will be used to benchmark model performance under different drift conditions.

The pipeline can receive a base sci-kit learn model, and evaluate their performance on these datasets prequentially. Here are some of the initial results for the performance of the models on these drifting dataset, under a never retraining and retraining, using 1 & 7 past windows. As you can see, model performance degrades upon encountering extreme drift.

Findings

From the experiments conducted so far, the findings are as follows:

A model without retraining struggles to maintain performance when drift occurs.
Retraining on data from previous drifting windows, whether abruptly (SINE) or gradually (CIRCLE), leads to poorer performance, especially evident in the retrain Window, which incorporates data up to 7 windows prior.
However, retraining on previous data proves beneficial in cases of covariate shift (CovCon), allowing the model to better align with the evolving real-world feature distributions.

Next Steps

As the base template for the pipeline and dataset curation is done, as I move forward, my focus will be on:

Implementing three advanced algorithms: AUE (Accuracy Updated Ensemble), MATCHMAKER, and Driftsurf, then integrating them into the pipeline.
Enhancing the benchmarking process by adding more metrics and plots, such as training time and inference time, to better evaluate the strategies.
Packaging the entire experiment into a Chameleon Trovi Artifact, ensuring ease of reproducibility and extension.

Stay tuned for my final blog as I delve deeper into this project!

Midterm Blogpost: HDEval's LLM Benchmarking for HDL Design

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello! My name is Ashwin Bardhwaj, an electrical engineering and computer science student based in San Diego, CA. For the past 6 weeks, I have been working closely with Professor Jose Renau on the HDEval project. The aim of this project is to create multiple project sized HDL benchmarks to evaluate how well existing LLMs can generate Verilog/Chisel code. These benchmarks will include my own “golden” HDL implementation of the project as well as respective English prompts to guide the LLM. I am excited to be able to work with these tools that have the potential to become a valuable resource for HDL design. So far, I have been successful in creating the first benchmark, a pipelined 3 stage RISC-V core, as well as working through by second project, a Gameboy Emulator.

RISC-V Implementation

Over this past month and a half, I have successfully completed my first benchmark which focuses on creating, modeling, and testing a pipelined 3-stage RISC-V core. The core uses the fetch, decode, and execute structure and is functional for most RV32I instructions. I synthesized and simulated my Verilog using Icarus Verilog and displayed the waveforms on GTKWave. After development, a good section of time was spent creating and tuning the English explanation of each Verilog module. After running these benchmark files through several LLM APIs, we compared the existing “golden” modules with the generated ones and noticed that more recent versions of LLMs such as GPT 4o and Claude 3 preform much better at creating syntactically correct and efficient code.

In addition, I have also created a tool that will parse the Verilog and instruction files into the necessary json structure to then test on various models.

Gameboy Emulator

I am also in the process of developing the second benchmark, which targets a Gameboy emulator. This will challenge the LLMs much more than the RISC-V project because apart from the custom CISC CPU, the model should also understand how to handle various other blocks of the hardware system including memory, picture processing unit (PPU), sound processing unit (SPU), various input/output systems like the buttons and cartridge, and interrupt handlers. As a result, it will challenge the model to understand the system as a whole when creating each individual module.

Next Steps

As we continue on to the second half of the project, I will continue working on my gameboy emulator. I have already completely developed and tested the Z80-esque CPU, DMA, and interrupt handler but need to continue working on the display and sound interfaces. Also, I will also continue to evaluate and run these tests over a wider range of LLMs to get a better picture of what models and versions are best suited for HDL design as well as the direction these models are going in.

Halfway Through OSRE24: My Experience and Learnings

Mon, 15 Jul 2024 00:00:00 +0000

Hello there! I’m Kilian Warmuth, a computer science student from Germany. This summer, I’m part of the 2024 Summer of Reproducibility (SoR) initiative. My project, “Reproducible Experiment Workflows in SLICES/pos,” aims to enhance reproducibility in scientific research, aligning with the FAIR principles (Findable, Accessible, Interoperable, Reusable).

Project Overview

The “Reproducible Experiment Workflows in SLICES/pos” project is part of the larger SLICES-RI initiative, designed to improve the reproducibility and reusability of large-scale experimental research. The project focuses on integrating the RO-Crate standard into the pos testbed to organize and document experiment results systematically. This integration will enhance the accessibility and comprehensibility of research findings, ensuring they adhere to the FAIR principles. Additionally, the project aims to improve the portability of pos experiments to the Chameleon testbed, facilitating collaboration and seamless execution across different research environments.

Progress and Challenges

The first half of the project is done, marked by significant progress and learnings. My initial focus was on familiarizing myself with the pos framework and the RO-Crate standard. This foundational knowledge was crucial for the subsequent steps of restructuring the results folder and integrating automated RO-Crate generation into the pos framework.

Key Achievements:

Restructured Results Folder: The structure of the results folder has been redesigned to streamline navigation and enable systematic storage of result data.
Automated RO-Crate Generation: Successfully integrated the basics of the RO-Crate standard into the pos framework, enabling the automated generation of comprehensive results documentation.
Metadata Documentation: Added comprehensive documentation to the results data, including essential metadata such as author details, user scripts, and hardware information, enhancing reproducibility and interpretability.

Challenges Encountered:

Balancing Automation with Flexibility: Ensuring the automated generation of RO-Crates did not compromise the flexibility required by researchers to customize their experiment documentation and mess with the complex requirements of a testbed.
Complexity of Testbed Systems: FIntegrating the RO-Crate implementation for a complex system like a testbed has required deep dives into the code base of the testbed.

Despite these challenges, the progress made has been rewarding, laying a solid foundation for the next phase of the project.

Learnings and Skills Gained

Understanding the Complexity of Testbeds: One of the key learnings from this project has been the realization that testbeds are complex systems. Despite their complexity, the process became manageable thanks to well-documented software and the invaluable support of top mentors who provided detailed answers to in-depth questions. Their guidance was crucial in navigating the challenges of the project.

Open Source Development in an Educational Environment: My experience in open source development has been enriched by working within an educational context. This skill is particularly important when adapting and simplifying code to ensure that users can follow along and gain a deeper understanding of the experiments, improving the quality of research experiments.

Next Steps

As we move into the second half of this project, our primary focus will be on enhancing the portability of pos experiments to the Chameleon testbed. Key tasks include:

Finetune RO-Crate Implementation: Continue refining the RO-Crate integration to handle the complexities of testbed systems more effectively like special edge cases.
Enhance Portability: Refine the integration with Trovi, ensuring seamless upload and retrieval of experiment results across testbeds.
Develop Introductory Examples: Create examples demonstrating the use of pos in various testbed environments to guide researchers.
Execute and Analyze Experiments: Design and execute a complex network experiment on both SLICES/pos and Chameleon, validating and refining portability features.

These steps are crucial to achieving our goal of making pos experiments more accessible and reproducible across different research environments.

Conclusion

Reflecting on the first half of my OSRE24 journey, I am incredibly grateful for the opportunity to work on the “Reproducible Experiment Workflows in SLICES/pos” project. The experience has been both challenging and rewarding, providing valuable insights into open-source development, machine learning techniques, and the creation of educational resources.

As we move forward, I am excited about the coming weeks. The completion of the portability enhancements and the execution of complex experiments lie ahead, marking significant milestones in our project. The skills and lessons I have acquired will guide me in future endeavors.

Data leakage in applied ML: reproducing examples from genomics, medicine and radiology

Mon, 01 Jul 2024 00:00:00 +0000

Hello everyone! I’m Shaivi Malik, a computer science and engineering student. I am thrilled to announce that I have been selected as a Summer of Reproducibility Fellow. I will be contributing to the Data leakage in applied ML: reproducing examples of irreproducibility project under the mentorship of Fraida Fund and Mohamed Saeed. You can find my proposal here.

This summer, we will reproduce studies from medicine, radiology and genomics. Through these studies, we’ll explore and demonstrate three types of data leakage:

Pre-processing on train and test sets together
Model uses features that are not legitimate
Feature selection on training and test sets

For each paper, we will replicate the published results with and without the data leakage error, and present performance metrics for comparison. We will also provide explanatory materials and example questions to test understanding. All these resources will be bundled together in a dedicated repository for each paper.

This project aims to address the need for accessible educational material on data leakage. These materials will be designed to be readily adopted by instructors teaching machine learning in a wide variety of contexts. They will be presented in a clear and easy-to-follow manner, catering to a broad range of backgrounds and raising awareness about the consequences of data leakage.

Stay tuned for updates on my progress! You can follow me on GitHub and watch out for my upcoming blog posts.

FetchPipe: Data Science Pipeline for ML-based Prefetching

Tue, 25 Jun 2024 00:00:00 +0000

Hello, I’m Peiran Qin, a first-year Pre-Doctoral student in Computer Science at the University of Chicago. In this summer I will focus working on the project FetchPipe: Data Science Pipeline for ML-based Prefetching under the mentorship of Prof. Haryadi S. Gunawi. This is my proposal.

Caching and prefetching are integral components of modern storage systems, aimed at reducing I/O latency by utilizing faster but less dense memory for storing data that is accessed frequently. Traditional prefetching strategies, which primarily rely on heuristic-based methods, often fall short in performance, particularly in complex scenarios. To address the complex scenarios, in recent years, machine learning solutions have emerged as a promising alternative, offering the ability to learn and predict complicated data access patterns. However, each existing ML prefetcher may bias toward different scenarios and distinct evaluation metrics. There is still a necessity to evaluate state-of-the-art machine learning based literatures comprehensively and fairly under an aligned evaluation framework and extensive performance metrics. Therefore, It becomes the motivation for me to spend my summer on this interesting project!

Developing an Efficient CMS for Polyphy Project

Fri, 21 Jun 2024 00:00:00 +0000

Hello everyone,

My name is Mohit, and I am currently a sophomore at NIT Jalandhar. As part of the Polyphy project’s team, I am determined to make data management much easier for everyone involved. My project also aims to increase the project’s social presence.

As part of the PolyPhy my proposal under the mentorship of MENTOR aims to …. You might wonder, why create a new CMS when we could use existing solutions like Strapi, Contentful, or WordPress? The answer lies in the specific requirements of our project, which I’ll cover in a separate blog post about the selection of the tech stack and code architecture.

Returning to my programming journey, while the CMS is a significant part of the project, it also includes refactoring existing React code and migrating it to Next.js, among other cool tasks. The first two weeks of my project were primarily focused on this. Now, I am shifting more towards the CMS development.

How did we start? Initially, we created a curated list of essential features needed, through discussions with my mentors. They ensured that I wouldn’t face the burden of unnecessary features, focusing instead on what was truly beneficial for both me and the project. I began by experimenting with various WYSIWYG editors such as React Quill, Tiptap, Draft.js, and Slate.

By the end of this week, I successfully created a small working prototype of the CMS. As the coding period progresses, things are beginning to take shape, and I am really excited about creating something that will help people in the long run.

Thank you for reading, and stay tuned for more updates!

Causeway: A New Approach to Web Development Teaching

Thu, 20 Jun 2024 00:00:00 +0000

As part of the Causeway team, my proposal under the mentorship of Professor David Lee aims to enhance web development education through situated learning.

Causeway addresses shortcomings in current online coding tutorials by offering a comprehensive approach to web development using an Angular, RxJS, NgRx, and Firebase stack. By breaking down the complex task of creating a website down into discrete chunks (micro-roles) and tracking individual progress, students can be assured they are acheiving their desired learning goals. With this project, our team hopes to demonstrate the potential of sitatuted learning – tacit knowledge picked up within a real-world context – instead of content-based learning approaches used in sites like Khan Academy and Coursera.

Over the course of this summer, we plan on reinvigorating the pre-existing v1 platform through the addition of new features such as dashboards, quizzes, and in-depth walkthroughs of new potential projects for users to implement. The platform will also leverage the Stackblitz WebContainer API and Firebase Cloud Functions to run full applications in the browser for interactive and secured learning.

Unveiling Medicine Patterns: 3D Clustering with Polyphy/Polyglot

Wed, 19 Jun 2024 00:00:00 +0000

Hello! My name is Ayush and this summer I’ll be contributing to Polyphy and Polyglot, a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. under the mentorship of Oskar Elek and Kiran Deol.

For the reference here’s my proposal for this project.

Polyglot offers an immersive 3D visualization experience, enabling users to zoom, rotate, and delve into complex datasets. My project aims to harness these capabilities to unlock hidden connections in the realm of medicine, specifically focusing on the relationships between drugs based on their shared salt compositions, rather than just their active ingredients. This approach promises to reveal intricate patterns and relationships that have the potential to revolutionize drug discovery, pharmacology, and personalized medicine.

In this project, I will create custom embeddings for a vast dataset of over 600,000 medicines, capturing the relationships between their salt compositions. By visualizing these embeddings in Polyglot’s 3D space, researchers can identify previously unknown connections between medicines, leading to new insights and breakthroughs. The dynamic and interactive nature of Polyglot will empower researchers to explore these complex relationships in a very efficient and cool way, potentially accelerating the discovery of new drug interactions and therapeutic applications.

I am really excited to work on this project. Keep following the blogs for further updates!.

Assessing the Computational Reproducibility of Jupyter Notebooks

Tue, 18 Jun 2024 00:00:00 +0000

Like so many authors before me, my first reproducibility study and very first academic publication started with the age-old platitude, “Reproducibility is a cornerstone of the scientific method.” My team and I participated in a competition to replicate the performance improvements promised by a paper presented at last year’s Supercomputing conference. We weren’t simply re-executing the same experiment with the same cluster; instead, we were trying to confirm that we got similar results on a different cluster with an entirely different architecture. From the very beginning, I struggled to wrap my mind around the many reasons for reproducing computational experiments, their significance, and how to prioritize them. All I knew was that there seemed to be a consensus that reproducibility is important to science and that the experience left me with more questions than answers.

Not long after that, I started a job as a research software engineer at Purdue University, where I worked heavily with Jupyter Notebooks. I used notebooks and interactive components called widgets to create a web application, which I turned into a reusable template. Our team was enthusiastic about using Jupyter Notebooks to quickly develop web applications because the tools were accessible to the laboratory researchers who ultimately needed to maintain them. I was fortunate to receive the Better Scientific Software Fellowship to develop tutorials to teach others how to use notebooks to turn their scientific workflows into web apps. I collected those and other resources and established the Jupyter4Science website, a knowledgebase and blog about Jupyter Notebooks in scientific contexts. That site aims to improve the accessibility of research data and software.

There seemed to be an important relationship between improved accessibility and reuse of research code and data and computational reproducibility, but I still had trouble articulating it. In pursuit of answers, I moved to sunny Arizona to pursue a History and Philosophy of Science degree. My research falls at the confluence of my prior experiences; I’m studying the reproducibility of scientific Jupyter Notebooks. I have learned that questions about reproducibility aren’t very meaningful without considering specific aspects such as who is doing the experiment and replication, the nature of the experimental artifacts, and the context in which the experiment takes place.

I was fortunate to have found a mentor for the Summer of Reproducibility, Tanu Malik, who shares the philosophy that the burden of reproducibility should not solely rest on domain researchers who must develop other expertise. She and her lab have developed FLINC, an application virtualization tool that improves the portability of computational notebooks. Her prior work demonstrated that FLINC provides efficient reproducibility of notebooks and takes significantly less time and space to execute and repeat notebook execution than Docker containers for the same notebooks. My work will expand the scope of this original experiment to include more notebooks to FLINC’s test coverage and show robustness across even more diverse computational tasks. We expect to show that infrastructural tools like FLINC improve the success rate of automated reproducibility.

I’m grateful to both the Summer of Reproducibility program managers and my research mentor for this incredible opportunity to further my dissertation research in the context of meaningful collaboration.

Exploring Reproducibility in High-Performance Computing Publications with the Chameleon Cloud

Sat, 15 Jun 2024 00:00:00 +0000

Hello everyone,

I’m Klaus Kraßnitzer and am currently finishing up my Master’s degree at the Technical University of Vienna. This summer, under the guidance of Sascha Hunold, I’m excited to dive into a project that aims to enhance reproducibility in high-performance computing research.

Our project, AutoAppendix, focuses on the rigorous evaluation and potential automation of Artifact Description (AD) and Artifact Evaluation (AE) appendices from publications to this year’s Supercomputing Conference (SC). Due to a sizeable chunk of SC publications utlizing Chameleon Cloud, a platform known for its robust and scalable experiment setups, the project will be focused on and creating guidelines (and potentially, software tools) that users of the Chameleon Cloud can utilize to make their research more easily reproducible. You can learn more about the project and read the full proposal here.

My fascination with open-source development and research reproducibility was sparked during my undergraduate studies and further nurtured by my role as a teaching assistant. Hands-on projects and academic courses, like those in chemistry emphasizing precise experimental protocols, have deeply influenced my approach to computational science.

Project Objectives

Analyze and Automate: Assess current AE/AD appendices submitted for SC24, focusing on their potential for automation.
Develop Guidelines: Create comprehensive guidelines to aid future SC conferences in artifact submission and evaluation.
Build Tools (Conditionally): Develop automation tools to streamline the evaluation process.

The ultimate aim of the project is to work towards a more efficient, transparent, and reproducible research environment, and I’m committed to making it simpler for researchers to demonstrate and replicate scientific work. I look forward to sharing insights and progress as we move forward.

Thanks for reading, and stay tuned for more updates!

Reproducibility in Data Visualization

Fri, 14 Jun 2024 00:00:00 +0000

Hello! My name is Arya Sarkar and I will be contributing to the research project titled Reproducibility in Data Visualization, with a focus on investigating and coming up with novel solutions to capture both static and dynamic visualizations from different sources. My project is titled Investigate Solutions for Capturing Visualizations and I am mentored by Prof. David Koop.

Open-source has always piqued my interest, but often I found it hard to get started in as a junior in university. I spent a lot of time working with data visualizations but had never dived into the problem of reproducibility before diving into this project. When I saw a plethora of unique and interesting projects during the contribution phase of OSRE-2024, I was confused at the beginning. However, the more I dived into this project and understood the significance of research in this domain to ensure reproducibility, the more did I find myself getting drawn towards it. I am glad to be presented this amazing opportunity to work in the Open-source space as a researcher in reproducibility.

This project aims to investigate, augment, and/or develop solutions to capture visualizations that appear in formats including websites and Jupyter notebooks. We have a special interest on capturing the state of interactive visualizations and preserving the user interactions required to reach a certain visualization in an interactive environment to ensure reproducibility.My proposal can be viewed here!

Artificial Intelligence Explainability Accountability

Fri, 14 Jun 2024 00:00:00 +0000

Hey! I’m Sarthak Chowdhary(Shaburu), and I am thrilled to share my incredible journey with the Open Source Program Office of UC Santa Cruz! Association as part of Google Summer of Code (GSoC) 2024. This experience marks a pivotal milestone in my career, offering me the chance to delve into an intriguing project while learning from the brightest minds in the open-source community. Allow me to guide you through my adventure thus far, from the nerve-wracking wait for results to the exhilarating commencement of the coding period.

Before we start here’s my Proposal.

Pre-GSoC Application

I had shortlisted 3 Organizations that i was working on

OSPO UC Santa Cruz - Amplifying Research Impact Through Open Source
CVAT.AI - Computer Vision Data Annotation for AI
Emory University - Biomedical Research to Advance Medical Care

On the 1st of May, like many students eagerly anticipating the results of the Google Summer of Code (GSoC) 2024, I found myself glued to my screen, anxiously awaiting the clock to strike 11:30 PM IST. After what felt like an eternity of waiting, I finally received the email that changed everything: I had been selected for GSoC 2024 with the Open Source Program Office of UC Santa Cruz!

The first month of GSoC, known as the community bonding period, is for establishing rapport with the people working on the project. I researched about my mentor Dr. Leilani H. Gilpin and build a good rapport with her, who is an Assistant Professor in Computer Science and Engineering and an affiliate of the Science & Justice Research Center at UC Santa Cruz. She is also a part of the AI group @ UCSC and leads the AI Explainability and Accountability (AIEA) Lab. Her research focuses on the design and analysis of methods for autonomous systems to explain themselves. Her work has applications to robust decision-making, system debugging, and accountability. Her current work examines how generative models can be used in iterative XAIstress testing. She guided me through the necessary documentation and explained the Project demands and requirements in detail, which was invaluable for my project.

Project

The project aims to build a system that is capable of taking some input which will be the student’s code and explaining them their mistakes from low level syntax errors, compilation errors to high level issues such as overloaded variables.

My Proposal aims to create custom novel basic questions and take it up a notch by creating custom drivers for each problem, common drivers to detect low level errors and give baseline explanations for various error cases, combining these drivers to make a robust system and use third-party open source software (like monaco code editor - the editor of the web) where necessary. Write uniform and consistent feedback/explanations for Each coding problem while covering all the possible edge cases and a pipeline which will iterate the test cases and feedbacks. This benchmark suite will be used for testing the system.

Additionally I plan on building an interface that has a roadmap from basics such as arrays, hashmaps to advanced topics such as trees, heap, backtracking along with progress bars and throws confetti on successful unit tests (important). These will be using the same benchmark suite that will be built under the hood. I will be utilizing Judge0 (open-source online code execution system) for the code execution and Monaco(open-source The Editor of the Web) as the code editor for this.

Project goals:

Project Objective: By the end of summer the software should be a novel and robust tool for helping the community of beginner and advanced programmers alike in learning programming by hyper-focusing on the mistakes they make and using AI to explain to them the how, what and why of their code. Provide clear and concise explanations accompanied by actionable suggestions for debugging and improvement.
Expected deliverables: A Robust eXplainable AI benchmark suite which will be used extensively for the undergraduate AI courses and possibly the Graduate courses as well. Along with anyone interested in learning programming with the help of personalized AI.
Future work based on project: A beautiful Gamified interface that gets people excited to learn programming which utilizes the above benchmark suite would be awesome to build!

When I Started my programming journey (before ChatGPT😨) I personally encountered problems that were way above my skill set and I had no way of knowing so, which used to result in spending countless hours without proper feedback as to where I was going wrong. This project has a real impact on people in an innovative way which I wish I had access to at the start of my Programming journey, so working on it comes from a place of passion. Also this specific project will test my own understanding of programming and spending the summer solidifying it, that too under the guidance of Leilani H. Gilpin is a dream come true for me.

Data leakage in applied ML: reproducing examples of irreproducibility

Fri, 14 Jun 2024 00:00:00 +0000

Hello,

I am Kyrillos Ishak I am happy to be part of SOR 2024, I am working on Data leakage in applied ML: reproducing examples of irreproducibility project. My proposal was accepted.

I am excited to work with Fraida Fund and Mohamed Saeed as my mentors. The objective of the project is to develop educational resources that can be adjusted by professors/instructors to explain specific data leakage problems. This involves ensuring the reproducibility of certain research papers that contain data preprocessing issues, then fixing these issues to demonstrate how they can affect the results.

Data leakage is a problem caused when information from outside the training dataset is used to create the model. This issue can lead to overly optimistic performance estimates and, ultimately, models that do not perform well on new, unseen data.

Despite the importance of addressing data leakage, many people from fields not closely related to computer science, are often unfamiliar with it, even if they are aware of best practices for data preprocessing. Developing educational materials on this topic will greatly benefit them.

I am excited to dive into the topic of data leakage in machine learning. Throughout the summer, I will be sharing regular updates and insightful blog posts on this subject. Stay tuned for more information!

Developing Trustworthy Large Language Models

Fri, 14 Jun 2024 00:00:00 +0000

Hi! Thanks for stopping by.

In this first blog post of a series of three, I’d like to introduce myself, my mentor, and my project.

My name is Nikhil. I am an ML researcher who works at the intersection of NLP, ML, and HCI. I previously worked as a Machine Learning Engineer II at VMware and spent some wonderful summers interning with ML teams at NVIDIA and IIT Bombay. I also recently graduated from the University of Southern California (USC) with honors in Computer Science and a master’s thesis.

This year at Google Summer of Code (GSoC 24), I will be working on developing trustworthy large language models. I’m very grateful to be mentored by Leilani H. Gilpin at the AIEA lab, UC Santa Cruz. I truly admire the flexibility and ownership she allows me in pursuing my ideas independently within this project. Please feel free to peruse my accepted GSoC proposal here.

Project: My project has a tangible outcome: An open-source, end-to-end, full-stack web app with a hybrid trustworthy LLM in the backend.

This open-source web app will be a lightweight tool that not only has the ability to take diverse textual prompts and connect with several LLMs and a database but also the capability to gather qualitative and quantitative user feedback. Users will be able to see how this feedback affects the LLMs’ responses and impacts its reasoning and explanations (xAI). The tool will be thoroughly tested to ensure that the unit tests are passing and there is complete code coverage.

At the moment, we are investigating LLMs and making them more trustworthy in constraint satisfaction tasks like logical reasoning and misinformation detection tasks. However, our work has applicability in other areas of Responsible AI, such as Social Norms (toxicity detection and cultural insensitivity), Reliability (misinformation, hallucination, and inconsistency), Explainability & Reasoning (lack of interpretability, limited logical, and causal reasoning), Safety (privacy violation and violence), and Robustness (prompt attacks and distribution shifts).

Impact:

Responsible AI research teams across industry and academia can use this as a boilerplate for their user study projects.
Diverse PhD students and academic researchers looking to study LLM and user interaction research will find this useful.
LLM alignment researchers and practitioners can find this resourceful as user feedback affects the inherent rewards model of the internal LLMs.
Explainable AI (xAI) researchers can find value in the explanations that this tool generates, which reveal interpretable insights into how modern LLMs think and use their memory. These are just a few use cases; however, there are several others that we look forward to describing in the upcoming posts.

This was my first blog in the series of three for the UC OSPO. Stay tuned for the upcoming blogs, which will detail my progress at the halfway mark and the final one concluding my work.

If you find this work interesting and would love to share your thoughts, I am happy to chat! :) Feel free to connect on LinkedIn and mention that you are reaching out from this blog post.

It is great to meet the UC OSPO community, and thanks for reading. Bye for now.

Enhancing Usability and Expandability of the Open Sensing Platform project

Fri, 14 Jun 2024 00:00:00 +0000

Greetings everyone,

I am Ahmed Falah and I am delighted to be part of the 2024 Google Summer of Code program, where I am contributing to the Open Sensing Platform project.

My proposal was accepted, and I am fortunate to have Colleen Josephson and John Madden as my mentors. The objective of my project is to enhance usability and expandability of the Open Sensing Platform, a hardware solution for deploying sensor networks in outdoor environments. This platform utilizes low-power, long-range communication to transmit data from various sensors to a visualization dashboard. While the platform effectively collects data, its configuration process requires modifying source code to make it more user-friendly. My first steps to enhance usability of the project:

Improve User Interface (UI): Develop a user-friendly interface to interact with the platform, enabling researchers to configure the device without modifying code.
Conversion of user configuration: convert user configuration data to the Protobuf format for efficient storage and transmission.

Additionally, I will explore updating the NVRAM functions to interact with Protobuf messages instead of directly writing/reading raw data to NVRAM. I will also implement functions to serialize user configuration data into a Protobuf message and deserialize the message back into a data structure for use within the firmware.

I will be posting regular updates and informative blogs throughout the summer, so stay tuned!

Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Fri, 14 Jun 2024 00:00:00 +0000

Hello, I am Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University, specializing in Artificial Intelligence. This summer, I will be working on the project AIIO / Graph Neural Network under the mentorship of Bin Dong and Suren Byna.

High-Performance Computing (HPC) applications often face performance issues due to I/O bottlenecks. Manually identifying these bottlenecks is time-consuming and error-prone. My project aims to enhance the AIIO framework by integrating a Graph Neural Network (GNN) model to automatically diagnose I/O performance bottlenecks at the job level. This involves developing a comprehensive data pre-processing pipeline, constructing and validating a tailored GNN model, and rigorously testing the model’s accuracy using test cases from the AIIO dataset.

Through this project, I seek to provide a sophisticated, AI-driven approach to understanding and improving I/O performance in HPC systems, ultimately contributing to more efficient and reliable HPC applications.

StatWrap: Automated Reproducibility Checklists Generation

Fri, 14 Jun 2024 00:00:00 +0000

Namaste🙏🏻! I am Adi Akhilesh Singh, currently pursuing a degree in Computer Science and Engineering at IIT(BHU). This summer, I will be working on the StatWrap: Automated Reproducibility Checklists Generation project under the mentorship of Luke Rasmussen. You can view my project proposal for more details.

My project aims to integrate customizable reproducibility checklists into StatWrap, using metadata and user input to automate their generation. The goal is to enhance the reproducibility of research projects by providing researchers with structured and comprehensive checklists to ensure their work is reproducible.

Stay tuned for updates on my progress in the coming weeks! 🚀

LLM Assistant for OpenROAD - Data Engineering and Testing

Thu, 13 Jun 2024 00:00:00 +0000

Hello! My name is Aviral Kaintura, and I will be contributing to OpenROAD, a groundbreaking open-source toolchain for digital integrated circuit automation (RTL to GDSII) during GSoC 2024.

My project, LLM Assistant for OpenROAD - Data Engineering and Testing, is jointly mentored by Indira Iyer and Jack Luar.

The aim of this project is to develop a chat assistant to improve the user experience with OpenROAD. My focus will be on developing a well-curated dataset from OpenROAD’s knowledge base. This dataset will be fundamental for another project led by Palaniappan R, which involves building the chatbot’s architecture. It will be used for training and validating the model and ensuring efficient context retrieval to generate accurate user responses, aiding in troubleshooting, installation, and other common issues to reduce the maintainers’ workload.

In addition to dataset creation, I will be working on testing and evaluation. This includes developing metrics for model evaluation, incorporating both human and automated techniques.

Our human evaluation framework will utilize chatbot feedback for valuable insights, enhancing the model and dataset. An automated batch testing application is also used to further enhance the evaluation process.

Here is an early build of the evaluation framework.

By leveraging advanced data engineering and testing methodologies, we aim to build an assistant that combines high accuracy with optimal response times. Additionally, we will collaborate with research teams at NYU and ASU to contribute to the research on AI-based chat assistants for electronic design automation.

I am thrilled to be part of this journey and look forward to making a meaningful impact on the OpenROAD project.

Stay tuned for more updates on the project!

LLM Assistant for OpenROAD - Model Architecture and Prototype

Thu, 13 Jun 2024 00:00:00 +0000

Hi there!

I’m Palaniappan R, currently an undergraduate student at the Birla Institute of Technology & Science, Pilani, India.

I’ll be working on the LLM Assistant for OpenROAD - Model Architecture and Prototype project, under the mentorship of Indira Iyer and Jack Luar.

My project aims to develop the architecture for a chat assistant built for OpenROAD and its native flow, designed to assist beginners and experienced users by giving easy access to existing resources, offering troubleshooting assistance, and providing fast and accurate responses to common questions. I plan to do this by leveraging state-of-the-art retrieval and fine-tuning techniques.

As part of this project, I will be working alongside another project to build and test on a valid dataset for training and deployment. We will also be collaborating with other research teams at NYU and ASU, working on similar projects related to OpenROAD chat assistants and flow generation using Generative AI. Our primary objective is to minimize support overhead, improve user experience by reducing response times, and provide access to updated information about OpenROAD.

Upon completion, my project will offer a viable chat assistant architecture as part of OpenROAD that benefits both the users and tool developers of OpenROAD.

An early prototype developed along with a human evaluation framework shows promising results.

Here are some responses generated by the prototype,

I’m excited about the potential of ORAssistant as part of the OpenROAD tool suite to accelerate innovation in EDA and chip design by utilizing open-source tools along with Generative AI.

Stay tuned for more updates!

Memory Compiler in OpenROAD

Thu, 13 Jun 2024 00:00:00 +0000

Greetings! I’m Yash Kumar working on the OpenROAD Memory Compiler Project for which my proposal under the mentorship of Matt and Austin aims to enhance the OpenROAD flow by integrating a DFFRAM generator that extensively uses the OpenDB database to build and layout various memory components like bits, bytes, and 32x32 configurations and more. Taking inspiration from the work of the AUCOHL repository’s DFFRAM memory compiler,

The goal is to develop a DFF/Latch-based RAM that utilizes standard cell libraries. The compiler will generate different views (HDL netlist, functional models, LEF, DEF, Timing, etc.) for specified size configurations, targeting compact design and optimal routing. The compiler should work across various PDKs satrting with Sky130. My initial works tries to test the Bit and Byte level design.

Optimizing Scientific Data Streaming: Developing Reproducible Benchmarks for High-Speed Memory-to-Memory Data Transfer over SciStream

Thu, 13 Jun 2024 00:00:00 +0000

Hello, I am Acheme, currently a PhD student in Computer Engineering at Clemson University. I will be working on SciStream, mentored by Joaquin Chung and Flavio Castro over this summer. Here is my proposal - for this project.

I am excited to meet everyone and contribute to this project!

Reproducibility in Data Visualization

Thu, 13 Jun 2024 00:00:00 +0000

Hello everyone!

I’m Triveni, a Master’s student in Computer Science at Northern Illinois University (NIU). When I came across the OSRE 2024 project Categorize Differences in Reproduced Visualizations focusing on data visualization reproducibility, I was excited because it aligned with my interest in data visualization. While my initial interest was in geospatial data visualization, the project’s goal of ensuring reliable visualizations across all contexts really appealed to me. So, I actively worked on understanding the project’s key concepts and submitted my proposal My proposal can be viewed here under mentorship of David Koop to join the project.

Early Steps and Challenges:

I began working on the project on May 27th, three weeks ago. Setting up the local environment initially presented some challenges, but I persevered and successfully completed the setup process. The past few weeks have been spent exploring the complexities of reproducibility in visualizations, particularly focusing on capturing the discrepancies that arise when using different versions of libraries to generate visualizations. Working with Dr. David Koop as my mentor has been an incredible experience. Our weekly report meetings keep me accountable and focused. While exploring different algorithms and tools to compare visualizations can be challenging at times, it’s a fantastic opportunity to learn cutting-edge technologies and refine my problem-solving skills.

Looking Ahead:

I believe this project can make a valuable contribution to the field of reproducible data visualization. By combining automated comparison tools with a user-centric interface, we can empower researchers and data scientists to make informed decisions about the impact of visualization variations. In future blog posts, I’ll share more about the specific tools and techniques being used, and how this framework will contribute to a more reliable and trustworthy approach to data visualization reproducibility.

Stay tuned!

I’m excited to embark on this journey and share my progress with all of you.

Stream Processing support for FasTensor

Thu, 13 Jun 2024 00:00:00 +0000

Hi, I’m Aditya Narayan,👋

I’m a frequent visitor to the town square of theoretical CS, operations (Ops), and robust high-performance systems. Sometimes I indulge myself with insights on Computing and Biology, and other times I enjoy the accounts of minefield experiences in the systems world. Luckily, this summer, OSRE offered an opportunity that happened to be at the perfect intersection of my interests.

This summer, I will be working on a scientific computing library called FasTensor that offers a parallel computing structure called Stencil, widely popular in the scientific computing world to solve PDEs for Physical Simulations and Convolutions on Signals, among its many uses. I am excited to introduce my mentors, Dr. Bin Dong and Dr. John Wu of the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL). They bring invaluable expertise to the project.

They recognized the need for a tensor processing library that provided dedicated support for big datasets with inherent structural locality, often found in the scientific computing world, which was lacking in popular open-source MapReduce or Key-Value based frameworks.

More often than not, the operations performed on these datasets are composed of computations involving neighboring elements. This motivated the development of the FasTensor library.

I will be working on providing a Stream Processing interface that enables online data processing of large-scale datasets as they arrive from Data Producers. The project focuses on offering rich interfaces for managing and composing streams, supporting common scientific data formats like HDF5, and integrating fault tolerance and reliability mechanisms.

I am thrilled to work on the FasTensor project because I believe it has the potential to make a significant impact by enabling researchers to implement a rich set of computations on their big datasets in an easy and intuitive manner.

After all, FasTensor has just one simple paradigm: A -> Transform(F(x), B),

and it handles all the behind-the-scenes grunt work of handling big datasets so you can focus on your research.

Stay tuned for updates and feel free to collaborate!

Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Wed, 12 Jun 2024 00:00:00 +0000

About the project:

How it all started

This journey began amidst our college’s cultural fest, in which I was participating, just 15 days before the proposal submission deadline. Many of my friends had been working for months to get selected for GSoC. I didn’t think I could participate this year because I was late, so I thought, “Better luck next year.” But during the fest, I kept hearing about UC OSPO and that a senior had been selected within a month. So, I was in my room when my friend told me, “What’s the worst that can happen? Just apply,” and so I did. I chose this project and wrote my introduction in Slack without knowing much. After that, it’s history. I worked really hard for the next 10 days learning about the project, making the proposal, and got selected.

First few weeks:

I started the project a week early from June 24, and it’s been two weeks since. The start was a bit challenging since it required setting up a lot of things on my local machine. For the past few weeks, the majority of my time has been dedicated to learning about COMPSs, RO-Crate, and Chameleon, the three technologies this project revolves around. The interaction with my mentor has also been great. From the weekly report meetings to the daily bombardment of doubts by me, he seems really helpful. It is my first time working with Chameleon or any cloud computing software, so it can be a bit overwhelming sometimes, but it is getting better with practice.

Stay tuned for progress in the next blog!!

FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I’m Lihaowen (Jayce) Zhu, currently pursuing my Master of Science in Computer Science at the University of Chicago. I will be spending my summer working on the project FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning under the mentorship of Yuyang (Roy) Huang and Swami Sundararaman, my proposal.

The landscape of machine learning (ML) is profoundly impacted by the initial stages of feature engineering and data preprocessing. This phase, critical for the success of ML projects, is often the most time-consuming, representing about 80% of the effort in typical ML workflows. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. By exploring innovative caching, prefetching, and heuristic strategies, this proposal aims to optimize the preprocessing workflow, thereby enhancing efficiency and reducing the required resources of ML projects.

First Steps in Enhancing User Experience Reproducibility through TROVI Redesign

Wed, 12 Jun 2024 00:00:00 +0000

Hello! My name is Alicia Esquivel Morel, and I’m a graduate research assistant at the University of Missouri – Columbia, pursuing a PhD in Computer Science. This summer, I’m working on a project to improve user experience reproducibility through a redesign of TROVI, as part of the Summer of Reproducibility (SoR) program. Excited to be working with two fabulous mentors; Kate Keahey, and Mark Powers.

Research Reproducibility with a TROVI Redesign

Researchers constantly face challenges replicating experiments due to limitations in current tools. TROVI, a platform designed to facilitate experiment replication, can be hindered by hard to follow interfaces and difficulties integrating code and data. This leads to confusion and frustration.

The Redesign’s Goals

Enhanced User Experience: Inspired by user-friendly platforms like Google Colab, we’ll simplify TROVI’s interface for intuitive navigation and ease of use.
Uploads and Sharing: Uploading code and data, as well as collaborating with researchers are key goals. Integration with platforms like GitHub will further streamline collaboration.
Continuous Improvement: A built-in feedback loop will allow users to provide input and suggestions, ensuring TROVI constantly evolves based on user needs.

The Road Ahead

We’re at the beginning of the redesign process. In the next blog post, I’ll describe the project’s specific goals and the deliverables you can expect.

Stay tuned to see how TROVI is built for reproducible research!!

FSA: Benchmarking Fail-Slow Algorithms

Wed, 12 Jun 2024 00:00:00 +0000

Hi everyone! I’m Xikang, a master’s CS student at UChicago. As a part of FSA benchmarking Project, I’m thrilled to be a contributor to OSRE 2024, collaborating with Kexin Pei, the assistant Professor of Computer Science at Uchicago and Ruidan, a talented PhD student at UChicago.

This summer, I will focus on integrating some advanced ML into our RAID slowdown analysis. Our aim is to assess whether LLMs can effectively identify RAID slowdown issues and to benchmark their performance against our current machine learning algorithms. We will test the algorithms on Chameleon Cloud and benchmark them.

Additionally, we will explore optimization techniques to enhance our pipeline and improve response quality. We hope this research will be a start point for future work, ultilizing LLMs to overcome the limitations of existing algorithms and provide a comprehensive analysis that enhances RAID and other storage system performance.

I’m excited to work with all of you and look forward to your suggestions. if you are interested, Here is my proposal

ML-Powered Problem Detection in Chameleon

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I am Syed Mohammad Qasim, a PhD candidate in Electrical and Computer Engineering at Boston University. I will be spending my summer working on the project ML-Powered Problem Detection in Chameleon under the mentorship of Ayse Coskun and Michael Sherman.

Currently, Chameleon Cloud monitors sites at the Texas Advanced Computing Center (TACC), University of Chicago, Northwestern University, and Argonne National Lab. They collect metrics using Prometheus at each site and feed them all to a central Mimir cluster. All the logs go to a central Loki, and Grafana is used to visualize and set alerts. Chameleon currently collects around 3000 metrics. Manually reviewing and setting alerts on them is time-consuming and labor-intensive. This project aims to help Chameleon operators monitor their systems more effectively and improve overall reliability by creating an anomaly detection service that can augment the existing alerting framework.

OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I’m Jiajun Mao, a BS/MS student at the University of Chicago studying Computer Science. I will be spending this summer working on the project OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS under the mentorship of Meng Wang and Anjus George, my proposal.

How to increase data’s durability and reliability while decreasing storage cost have always been interesting topics of research. Erasure coded storage systems in recent years have been seen as strong candidates to replace replications for colder storage tiers. In the paper “Design Considerations and Analysis of Multi-Level Erasure Coding in Large-Scale Data Centers”, the authors explored using theory and simulation on how a multiple tiered erasure coded system can out-perform systems using single level erasure codes in areas such as encoding throughput and network bandwidth consumed for repair, addressing a few pain points in adopting erasure coded storage systems. I will be implementing the theoretical and simulation result of this paper by building on top of HDFS and ZFS, and benchmarking the system performance.

The project will aim to achieve

HDFS understanding the underlying characteristics of ZFS as the filesystem
HDFS understanding the failure report from ZFS, and use new and special MLEC repair logic to execute parity repair
ZFS will be able to accept repair data from HDFS to repair a suspended pool caused by catastrophic data corruption

Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster

Wed, 12 Jun 2024 00:00:00 +0000

Hi! I’m Martin, and I will be working on Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster under the mentorship of In Kee Kim. Our work is driven by the scale of computing systems that hosts data commons – we believe that performance characterization of genomics workload should be done rapidly and at the scale similar to production settings. Feel free to check our proposal for more details!

We propose GenScale, a genomics workload benchmarking tool which can achieve both the scale and speed necessary for characterizing performance under large-scale settings. GenScale will be built on top of industrial-grade cluster manager (e.g. Kubernetes), metrics collection & monitoring systems (e.g. Prometheus), and will support comprehensive set of applications used in state-of-art genomics workflows. Initial version developed during this project will include DNA and RNA alignment workflows.

Finally, we believe that open access and reproducible research will greatly accelerate the pace of scientific discovery. We aim to package our artefacts and generated datasets in ways that makes it easiest to replicate, analyze, and build upon. I personally look forward to learn from & contribute to the open source community!

LAST: ML in Detecting and Addressing System Drift

Tue, 11 Jun 2024 00:00:00 +0000

Hello! I am Joanna, currently an undergraduate student studying Computer Science and Applied Mathematics and Statistics at Johns Hopkins University. I will be working on ML in Detecting and Addressing System Drift, mentoring by Ray Andrew Sinurat and Sandeep Madireddy over this summer. Here is my proposal for this project.

This project aims to build a data analysis pipeline to analyze various datasets, both system and non-system, that have shown notable changes over time. The goal is to understand the characteristics of these datasets(specifically drifts), evaluate the efficacy of Aging Detection Algorithms, and identify their limitations in computer system tasks.

I am excited to meet everyone and contribute to this project!

Developing a Pipeline to Benchmark Drift Management Strategies

Mon, 10 Jun 2024 00:00:00 +0000

With guidance from mentors Ray Andrew Sinurat and Sandeep Madireddy under the LAST project, I aim to develop a pipeline to benchmark the efficacy of various drift management algorithms.

Despite the abundance of literature on this subject, reproducibility remains a challenge due to the lack of available source code. As such, by crafting this pipeline, I aim to create standardized platform for researchers and practitioners to compare several state-of-the-art drift management approaches. Through rigorous testing and benchmarking, we seek to identify the most effective algorithms across a spectrum of drift scenarios, including gradual, sudden, and recurring drift.

This final deliverable of this pipeline will be packaged into a Chameleon Trovi Artifact. The pipeline will also be made easily extensible to cater to additional datasets or any custom-made drift-mitigation methods. This is my proposal for the project.

See you around!

Reproducing and benchmarking scalability bugs hiding in cloud systems

Mon, 10 Jun 2024 00:00:00 +0000

Hello there!

I am Shuang Liang, a third-year student studying Computer and Information Science at The Ohio State University. My passion lies in cloud computing and high-performance computing, areas I have explored extensively during my academic journey. I have participated in various projects and competitions, which have honed my technical skills and deepened my interest in distributed systems.

As part of the ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems, my proposal under the mentorship of Professor Yang Wang and Bogdan "Bo" Stoica aims to tackle the critical challenges posed by scalability bugs in systems like Cassandra, HDFS, and Hadoop. These bugs can lead to severe operational issues such as system downtime and data loss, particularly as systems scale up.

The project goals include systematically analyzing and documenting scalability bugs, developing protocols to effectively trigger and quantify the impact of these bugs, and creating reproducible artifacts and detailed investigation scripts to aid in bug analysis.

Our project will involve rigorous bug report analysis, reproduction of scalability bugs, and a comparative study of system behaviors before and after bug fixes. We aim to develop methodologies that enhance the reliability and performance of large-scale distributed systems, providing valuable insights and resources to the open-source community.

Stay tuned to explore the future of reliable and scalable distributed systems!

BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Sun, 09 Jun 2024 00:00:00 +0000

Hello! My name is Qianru, and I will be working on a project to improve spatial transcriptomics during Google Summer of Code 2024. My project, Benchmarking Gene Imputation Methods for Spatial Transcriptomics, is mentored by Ziheng Duan and Cormac Flanagan. The goal is to create a standard platform to evaluate methods for filling in missing gene data, which is a big challenge in spatial transcriptomics. My proposal can be viewed here!

Spatial transcriptomics lets us see where genes are active in tissues, giving us insight into how cells interact in their natural environment. However, current methods often miss some gene data, making it hard to get a complete picture. Gene imputation can help fill in these gaps.

My project will:

Create a benchmark dataset to standardize gene imputation tasks across different platforms, species, and organs.

Compare various gene imputation methods to see how well they work in different scenarios.

Develop a user-friendly Python package with tools for gene imputation to help researchers improve their data.

I’m excited to contribute to this project and help advance the field of spatial transcriptomics by making data analysis more accurate and comprehensive.

ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Sat, 08 Jun 2024 00:00:00 +0000

Hi! I’m Zahra, an undergraduate at Universitas Dian Nuswantoro, Indonesia. As part of the ScaleRep my proposal under the mentorship of Bogdan "Bo" Stoica and Yang Wang aims to systematically understand, characterize, and document the challenges associated with scalability bugs in large-scale distributed systems.

ScaleRep proposes a two-fold strategy to address scalability bugs in large-scale distributed systems. First, Bug Analysis and Documentation involves studying recent scalability issues across popular open-source systems such as Cassandra, Hadoop, HDFS, Ignite, and Spark to understand bug causes, symptoms, and solutions. This includes pinpointing common challenges hindering bug reproduction and devising protocols to trigger and measure scalability bug impacts. Second, Implementation and Artifact Packaging focuses on identifying, reproducing, and documenting scalability bugs, then packaging artifacts with Chameleon Trovi. This method emphasizes precise bug analysis, establishing reproducible environments, and detailed documentation to ensure artifact reliability and usability.

Drishti

Thu, 06 Jun 2024 00:00:00 +0000

Namaste everyone! 🙏🏻

I’m Joel Tony, a third-year Computer Science undergraduate at BITS Pilani, Goa, India. I’m truly honored to be part of this year’s Google Summer of Code program, working with the UC OSPO organization on a project that genuinely excites me. I’m particularly grateful to be working under the mentorship of Dr. Jean Luca Bez, a Research Scientist at Lawrence Berkeley National Laboratory, and Dr. Suren Byna, a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.

My project, “Drishti: Visualization and Analysis of AI-based Applications”, aims to extend the Drishti framework to better support AI/ML workloads, focusing specifically on optimizing their Input/Output (I/O) performance. I/O refers to the data transfer between a computer’s memory and external storage devices like hard drives (HDDs) or solid-state drives (SSDs). As AI models and datasets continue to grow exponentially in size, efficient I/O management has become a critical bottleneck that can significantly impact the overall performance of these data-intensive workloads.

Drishti is an innovative, interactive web-based framework that helps users understand the I/O behavior of scientific applications by visualizing I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualizations, making performance issues more apparent. Now, I’m working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.

Through my studies in high-performance computing and working with tools like BeeGFS and Darshan, I’ve gained insights into the intricacies of I/O performance. However, adapting Drishti for AI/ML workloads presents new challenges. In traditional HPC, computing often dominates, but in the realm of AI, the tables have turned. As models grow by billions of parameters and datasets expand to petabytes, I/O has become the critical path. Training larger models or using richer datasets doesn’t just mean more computation; it means handling vastly more data. This shift makes I/O optimisation not just a performance tweak but a fundamental enabler of AI progress. By fine-tuning Drishti for AI/ML workloads, we aim to pinpoint I/O bottlenecks precisely, helping researchers streamline their data pipelines and unlock the full potential of their hardware.

As outlined in my proposal, my tasks are threefold:

Modularize Drishti’s codebase: Currently, it’s a single 1700-line file that handles multiple functionalities. I’ll be refactoring it into focused, maintainable modules, improving readability and facilitating future enhancements.
Enable multi-trace handling: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I’ll build a layer to aggregate these, providing a comprehensive view of the application’s I/O behavior.
Craft AI/ML-specific recommendations: Current suggestions often involve MPI-IO or HDF5, which aren’t typical in ML frameworks like PyTorch or TensorFlow. I’ll create targeted recommendations that align with these frameworks’ data pipelines.

This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimize it for the unique I/O challenges that AI/ML applications face. Whether it’s dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.

From classroom theories to hands-on projects, from understanding file systems to optimizing AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.

In today’s AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimized I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behavior in AI/ML applications more interpretable through Drishti, we’re not just tweaking code. We’re providing developers with the insights they need to optimize their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.

I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimizing I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I’m deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavor and to the UC OSPO and GSoC communities for this incredible opportunity.

Causeway: Learning Web Development Through Micro-Roles

Mon, 03 Jun 2024 00:00:00 +0000

Hello! My name is Rishi and I will be contributing to Causeway, a platform for learning to develop web applications using an Angular, RxJS, NgRx, and Firebase stack , during Google Summer of Code 2024. My project is Causeway : Improving the Core Infrastructure and Experience ! , mentored by David Lee. This project aims to modernize the platform by adding various login options (Google, GitHub, email/password, passwordless) using Firebase Authentication, enhancing the landing page with an about section and improved UI, and introducing section quizzes via Firebase Firestore and Cloud Functions. It also involves developing user and learning dashboards with Angular Material UI and Firebase Cloud Functions, improving the overall UI design with application walkthroughs, providing an introductory demo for new users, incorporating generative AI features, automating deployment and monitoring with Vercel Bot, and adding contact and feedback options. These enhancements will boost user engagement, usability, and the overall learning experience. My proposal can be viewed here!

Causeway is a platform for learning to develop web applications using an Angular, RxJS, NgRx, and Firebase stack. It aims to bridge the gap in online coding tutorials by providing a holistic approach to web application development, breaking down the process into a hierarchy of micro-roles. This structure offers learners a clear pathway for learning and translates into a clear process for developing an application. In the longer future, this approach will enable learners to contribute to projects by taking on micro-roles for yet-to-be-developed projects. The platform leverages the Stackblitz WebContainer API to run full applications in the browser for interactive learning.

Enhancing h5bench with HDF5 Compression Capability

Mon, 27 May 2024 00:00:00 +0000

SLICES/pos: Reproducible Experiment Workflows

Fri, 17 May 2024 00:00:00 +0000

Servus everyone!

I’m Kilian Warmuth, currently pursuing my M.Sc. in Computer Science at the Technical University of Munich (TUM) after completing my B.Sc. in Computer Science at the same institution. Throughout my academic education, I have taken courses in Advanced Computer Networks, which have deepened my understanding and expertise in the field. I was involved in an interdisciplinary project where I created a testing toolchain for the packet generator MoonGen using the SCLICES/pos testbed. This experience provided me with extensive hands-on exposure to pos, increasing my interest in reproducible testbeds and the enhancement of pos.

As part of the SLICES/pos: Reproducible Experiment Workflows project, my proposal, under the mentorship of Sebastian Gallenmüller, Kate Keahey, and Georg Carle, aims to address the challenges of managing experiment results within the pos framework.

The project leverages the RO-Crate open standard to organize result data systematically, enhancing accessibility and comprehensibility of research findings. We aim to improve experiment documentation for the pos testbed, providing clear setup and execution instructions to ensure reproducibility. Therefore we need to simplify the dissemination of research findings by automating the creation of RO-Crates, allowing researchers to focus on experiment design without needing to be familiar with RO-Crate standards. Implementing these standards will enhance the sharing of results by automating publication processes for open repositories, promoting transparency and collaboration.

We also aim to enhance the portability of experiments across different testbeds, with a particular focus on the Chameleon Testbed. We will develop introductory examples demonstrating how to use pos in various testbed environments. Additionally, we will design and execute a portable complex network experiment based on SLICES/pos. To validate the portability enhancements, we will perform experiments on the Chameleon testbed. Finally, we will refine the portability of pos experiments within Chameleon to ensure seamless execution.

Stay tuned to explore the future of reproducible testbeds!

Hardware Hierarchical Dynamical Systems

Tue, 14 May 2024 00:00:00 +0000

As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau and Sakshi Garg aims to develop a tree data structure under HHDS to replace the current one offered by LHTree

The tree data structure is to be optimized for typical AST traversal and queries. Some queries that are made to this tree are much more frequent than others. Thus a flattening policy will be used to optimize the tree for these queries, at the potential cost of becoming slow for the infrequent queries. The tree will be benchmarked for scalability and performance and is expected to outperform the current version of the tree. Once the implementation is complete, the tree will be integrated into the LiveHD core repository.

HDEval: Benchmarking LLMs that Generate Verilog/Chisel Modules From Natural Language

Tue, 14 May 2024 00:00:00 +0000

Hi everyone!

I’m Ashwin Bardhwaj, currently pursuing a bachelors in Electrical Engineering and Computer Science at UC Berkeley. I was recently involved in a project to implement a secure hardware encryption enclave in Verilog. That’s why I was excited to work with the MASC group to evaluate how existing generalized LLMs (such as ChatGPT 4 or StarCoder) can generate accurate Verliog/Chisel code from English and assist in the hardware development process.

As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau and Sakshi Garg looks to create a suite of benchmark programs for HDEval.

The deliverable of this project is to create multiple large HDL benchmarks along with a respective set of prompts. Using yosys to implement Logic Equivalence Check, we are able to prove through formal verification that the generated code will exhibit the same behavior as the benchmark. In addition, we can also consider the performance and resource utilization of the generated code as a metric.

These 4 new features will change the way you use OpenROAD

Sun, 29 Oct 2023 00:00:00 +0000

Introduction

Welcome to the final blog post for my GSoC’23! Once again, my name is Jack and I am working under the open-source electronic design automation project - OpenROAD. We are a fast growing leading open-source foundational application for semiconductor digital design, as evidenced from our consistent star growth since inception. You may check us out at this link. Allow me to share the four significant contributions I made in this GSoC project.

1) Improving Ease of Installation

Firstly, OpenROAD is now able to support multiple operating systems. This is essential as one of our primary goals is to democratise chip implementation. And installation is often one of the hardest steps to get right, so that was one of our priorities. Today, we have provided options for different types of installation:

Prebuilt binaries: Local installations can often be riddled with incompatibilities or unexpected bugs, as well as taking a long compilation time. We sidestepped this by providing semi-regular updates to OpenROAD binary, reducing the time to installation.
Docker: Echoing previous concerns, we also enabled Docker installation for 9 major operating systems. Docker is extremely flexible and runs on many operating systems (as long as it is supported by Docker).

With these changes, we have observed 10% reduction of installation related Github issues posted on a weekly basis.

Figure 1: Supported OS matrix

2) Filling Missing Documentation

Next, we have made considerable improvements to over 20 tool-specific documentations, introducing consistent formatting styles for each page. We introduce default values and datatypes to allow users to use the tools with greater ease.

Figure 2: Helpful documentation defaults and datatype

Rather than having all arguments for a function under a common table, we separated out into developer arguments and developer commands. This is to further make our documentation more beginner-friendly to read, while not alienating our technical userbase. We have also added sections for example scripts and regression test, so as to help onboard newcomers to each tool of the flow.

Figure 3: Useful developer commands, example scripts, and regression test instructions

3) Extensible Documentation Framework

Thirdly, we have introduced extensible documentation frameworks. Now, what do we mean by extensible? It means we have created an infrastructure which is easy to use for developers, and allows for greater maintanability. Our goal is to create something that requires minimal changes to add content for documentation.

So, how did we do this?

We introduced 4 initiatives, namely: the warning/error messages glossary. We noticed that people were searching for error and warning messages, but our documentation did not have them. So we added a page where all the error/warning messages along with relevant code line number can be generated automatically. On top of that, developers can add useful debug information to help the end user.

Figure 4: Warning/Error messages glossary.

Next, we also introduced automatically generated Doxygen pages, which integrates nicely into our C++/Tcl source code framework. This automatic generation will make it much more convenient for developers to just insert comments into their source code, and allow Doxygen to generate documentation automatically.

Figure 5: Doxygen pages.

Next, we introduced cloud-based packaging. It is important that our framework is able to runnable on cloud, and the ever-popular notebook format. Our Colab based notebook was created with this in mind, and allows for easy transfer to other notebook providers with some modifications. Check out the notebooks here!

Figure 6: Google Colab can now run OpenROAD scripts.

Lastly, we have the changelog workflow which can be triggered manually. For our open-source project, we have chosen not to do software releases. This means it can be difficult to track the changes between commit numbers. Adding this workflow can help newcomers track the changes easier, by month.

Figure 7: Sample output of github changelog

4) OpenROAD Chatbot

Finally, we are also discussing the potential of creating a chatbot whose purpose is to answer user queries. We were thinking, there are lots of domain knowledge in Slack Channels, Github repos, and so on, so why not create a LLM-based chatbot. Stay tuned for updates!

Personal Reflections

To me, my most valuable takeaway is with regards to code quality. Often times, we as coders tend to opt for the best solution and “hack” something out quickly. Hacking is fine, as a proof of concept - but not for long term code development. Working in open-source projects like this, I have learnt to avoid creating unnecessary files, shortening the code and optimising runtime. In doing our job, we also wish to make life easier, not harder for future developers

Final Words

I would like to express my gratitude to my mentors Indira and Vitor for their guidance and insight throughout the project, as well as the OpenROAD dev team for their assistance. Would also like to thank the Google Summer of Code organising committee, and UCSC for creating such a wonderful program. Being able to contribute to actual real open-source projects with real needs, is truly the best of both worlds for aspiring programmers.

Final Blog Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Wed, 25 Oct 2023 00:00:00 +0000

In my final blog, I will first introduce the project, then describe the achievements after the midterm and summarize our experiments. As part of the Measuring Research Prototypes under Unreported Settings my proposal under the mentorship of Yang Wang and Miao YU aims to understand the impact of missing settings in artifact evaluation.

In my midterm blog(/report/osre23/osu/missingsettings/20230802-ren.450/), I took three paratmeters as the PostgreSQL config to test the performance of TPC-C benchmark and got some initial results about the effect of different parameters separately on throughput performance. After the midterm, I continue doing experiments on these four parameters (shared_buffer, min_wal_size, max_wal_size and effective_cache_size) with more values and associate them to measure the effect on performance. These parameters are related to memory consumption, checkpoints and planner cost in the database server. You can refer to my previous blog for details.

For the experiment, we continue to measure the throughput performanace for the benchmark by setting scalefactor as 10 and incrementing worker terminals. The settings for database server are all default values except the four parameters we choose to tune. For the shared_buffer parameter, we choose from initial 128mb to 8gb, in total 6 values. Then for each shared_buffer setting, effective_cache_size includes three values, from initial 4gb to 16gb. Next, for each effective_cache_size setting we tune the min_wal_size and max_wal_size as a tuple, min_wal_size has two values and max_wal_size has four values, in total 6 values. We conduct the experiments by running three rounds for each setting and get all three throughput numbers and calculate their average values.

Based on the results, the observation holds as the conclusion from midterm blog. The throughput of the benchmark can be affected by tuning shared_buffer and max_wal_size. Effective_cache_size and min_wal_size do not have obvious effect for this benchmark. The improvement is limited after shared_buffer and max_wal_size reach a certain value.

In our experiment, we only choose three possible parameters for one benchmark. The experiment is expensive considering the consuming time. There are also more values of above mentioned parameters to test. This experiment can also indicate we may need to sample a subset of settings to generate observations that match those from a full extensive artifact evaluation.

Public Artifact and Data Visualization: A Journey to Empower

Tue, 24 Oct 2023 00:00:00 +0000

Hola Amigos! As we draw the curtains on our project titled Public Artifact and Data Visualization we’re thrilled to present the incredible advancements we’ve achieved since our mid-term update. Our mission has been to foster a deeper understanding of data and empower users to make informed decisions. Let’s delve into the remarkable evolution of our project.

Unveiling New Functionalities

Modular Architecture: Your Way, Your Choice

At the core of our project is a modular architecture designed to cater to your unique preferences. We firmly believe that choice empowers users. Thus, we’ve given you the option to select between a Graphical User Interface (GUI) and a Command-Line Interface (CLI). It’s about providing a platform that adapts to your specific requirements and style of interaction.

Real-time Backend Environment Monitoring: Data as it Happens

Real-time monitoring of backend environment data is at the heart of our project. It’s not just about collecting data; it’s about providing continuous insights into system performance. This feature empowers you to make real-time, data-driven decisions—an essential capability in today’s fast-paced computing landscape.

Visualizing Environment Variables: Clarity Amidst Complexity

We’ve placed a strong emphasis on user-friendly data visualization. Our enhancements enable you to navigate through detected variables effortlessly and compare iterations within different buckets. The result is a visual representation of complex data, making it easier to comprehend and analyze.

Predefined Monitoring Commands: Your Head Start

We understand that monitoring can be a daunting task. To simplify the process, we’ve introduced predefined monitoring commands such as mpstat and iostat. These templates serve as a launchpad for monitoring common system metrics, helping you get started quickly and efficiently.

Comprehensive Customization: Tailoring the Experience

Recognizing that every user has unique needs, our platform now offers extensive documentation. This documentation serves as a guide, enabling users to fine-tune their monitoring commands. It’s about tailoring the platform to match your specific requirements and preferences. The power to customize is firmly in your hands.

Import and Export Functionality: Seamless Collaboration

In an era where collaboration and data management are essential, we’ve introduced the capability to import and export environment data. This feature simplifies data management and supports collaborative efforts, making it easy to share monitoring data and conduct analysis across various environments.

Exploring Our Repositories

As mentioned earlier, we have completed the core functionalities of our platform, and we would love to have you try it out and provide us with valuable feedback. Here are the links to our repositories where you can explore and experiment with our platform:

GUI Repository and CLI Repository
- The journey begins with a choice. Our repositories cater to a diverse range of user preferences. Inside the README.md file of the GUI repository, you’ll find meticulous installation instructions to guide you through setting up the Graphical User Interface (GUI). It’s your portal to a user-friendly experience
Sample Repository
- For those eager to embark on their monitoring journey, our Sample Repository is a valuable resource. It provides scripts that not only enable you to run our program but also serve as templates. These templates are designed to simplify the monitoring of your own programs, tailored to your unique requirements.

Project Demo

To provide you with a glimpse of what our project can do, here are some demo images showcasing the capabilities and features of “Public Artifact and Data Visualization.”

Thank You for Joining Us

We appreciate your support and participation in this journey of data visualization and empowerment. Our commitment to enhancing the world of data comprehension remains unwavering. As we mark the end of this chapter, we eagerly anticipate the exciting future that awaits in the realm of data visualization. The path doesn’t end here; it’s just the beginning of a new chapter in our collective exploration of data’s potential.`

Final Blog on Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Fri, 20 Oct 2023 00:00:00 +0000

Hello Again!

I’m excited to present my final blog post summarizing the progress and achievements made over the 2023 Summer of Reproducibility Fellowship.I will be sharing the work I’ve created for the Teaching Computer Networks with Reproducible Research: Developing a ‘classroom competition’ for adaptive video delivery.

Recap of the Journey

In my mid-term evaluation, I discussed the initial milestones and challenges I encountered during this program. At that point, I studied the key figures from the research paper ‘Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming’. My primary objectives were to ensure compatibility with both Python 2 and Python 3 and to incorporate an ‘Estimated Download Rate’ metric into the output file generated by the adaptive video client. Furthermore, I expanded the project to include two crucial visualizations: buffer occupancy vs. time and estimated download rate vs. time.

Final Project Progress

In the final weeks of my internship, I worked towards my ultimate goal, which was to reproduce existing work and create a clear guide for future students. I aimed to enable them to build upon and improve this work. To achieve this, I created a new experiment using an existing one,

which I titled “Compare Adaptive Video Policies”

This experiment compares two policies: rate-based (basic) policy and buffer-based (Netflix) policy. In the experiment, I covered the following key aspects:

How Both Policies Work: I detailed the workings of both the rate-based and buffer-based policies, explaining how each policy selects the next bitrate, among other relevant information.

Instructions for Execution of Policies: After conducting several experiments with different settings, I determined the most appropriate settings for this experiment. These settings have been added to the instructions for executing both policies, with a focus on ensuring similar “high” network rates, “low” data rates, similar durations of the “high” data rate before the interruption, and similar durations of the “interruption.” This setup allows for an easy and clear comparison of the two policies.

Discussion Part: In the discussion section, I addressed the differences that students can observe after conducting the experiment and visualising the graphs and videos.

In conclusion, I would like to thank my mentor, Fraida Fund, who has given me excellent guidance and would like to express my gratitude to OSRE23, where I have learned so much. This experience has been amazing for my personal and professional growth.

Final Blog on Using Reproducibility in Machine Learning Education

Wed, 18 Oct 2023 00:00:00 +0000

Welcome back!

In my final blog post for the 2023 Summer of Reproducibility Fellowship, I’ll be sharing my experiences and the materials I’ve created for the Using Reproducibility in Machine Learning Education project. As a quick reminder, my mentor Fraida Fund and I have been working on developing interactive open-source educational resources that teach reproducibility and reproducible research in machine learning. You can find my proposal here.

In this post, I’ll give you a rundown of my experience and share the materials I’ve created. If you haven’t checked out my previous blog posts, definitely take a look before diving into this one. Let’s get started!

Why is this project important 🤔

Reproducibility is an essential aspect of scientific research, and it’s becoming increasingly important in the field of computer science. However, most efforts to promote reproducibility in education focus on students who are actively involved in research, leaving a significant gap in the curriculum for introductory courses. Our project aims to address this issue by incorporating reproducibility experiences into machine learning education.

Why Reproducibility Matters in Education 🎓

There are two primary reasons why we believe reproducibility belongs in the computer science classroom. Firstly, it allows students to experience the process of reproducing research firsthand, giving them a deeper understanding of the scientific method and its importance in the field. This exposure can inspire students to adopt reproducible practices in their future careers, contributing to a more transparent and reliable scientific community.

Source: Fund, Fraida. “We Need More Reproducibility Content Across the Computer Science Curriculum.” Proceedings of the 2023 ACM Conference on Reproducibility and Replicability. 2023.

Secondly, as shown in the figure, involving students in reproducibility efforts can have a significant impact on the reproducibility ecosystem itself. Students can create reproducibility artifacts, such as replicable experiments or data analysis, that can be used by other researchers, including authors and graduate students. Additionally, students can consume reproducibility artifacts created by the research community, provide feedback, and suggest improvements. Authors appreciate this type of engagement, as it adds value to their work and promotes open science.

Focusing on Machine Learning 🧐

Given the growing interest in machine learning and its relevance to reproducibility, our project decided to focus on this area. Machine learning already has a strong culture of reproducibility, with initiatives like Papers with Code and the ML Reproducibility Challenge. These efforts encourage researchers to share their code and reproduce recent machine learning papers, validating their results. By leveraging these existing resources, we can create learning materials that utilize real-world examples and foster hands-on reproducibility experiences for students.

The Interactive Notebooks 📖

We have created two learning materials that focus on machine learning and reproducibility. The first material looks at a paper titled “On Warm Starting Neural Network Training” by Jordan T. Ash and Ryan P. Adams. This paper discusses the concept of warm-starting, which involves using weights from a previously trained model on a subset of the dataset to train a new model. The authors compare the performance of warm-started models with randomly initialized models and find that the warm-started models perform worse as shown in the below figure.

Our material takes students through the process of identifying the different claims made in the paper and finding the corresponding experiments that support them. They will also learn how to use open-source code and available data to reproduce these experiments and understand the computational complexity associated with reproducing each experiment. This material can be found on both github and chameleon where you can use chameleon to run the material on the required resources.

The second material examines the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” by Dosovitskiy et al., which introduces a novel way of applying the transformer architecture, which was originally designed for natural language processing, to image recognition tasks. The paper shows that transformers can achieve state-of-the-art results on several image classification benchmarks, such as ImageNet, when trained on large-scale datasets as shown in the following table.

Our material guides students through the process of understanding which claims can and cannot be validated based on the available datasets and how complex it can be to validate each claim. Additionally, they will learn how to use pre-trained models to replicate computationally expensive experiments. Again this material can be on both github and chameleon.

Both materials are designed to be easy to understand and interactive, allowing students to engage with the content and gain a deeper understanding of the concepts. Instructors can use these materials to assess their students’ understanding of machine learning and reproducibility.

Reflecting on the Journey

As we wrap up our journey of creating beginner-friendly learning materials for machine learning using reproducibility, it’s time to reflect on the rewarding experiences and valuable lessons learned along the way. Our deep dive into the world of machine learning and reproducibility not only enriched our knowledge but also provided us with an opportunity to contribute to the community at the UC Open Source Symposium 2023 at UCSC.

The symposium was a memorable event where we presented our work in a poster session. The diversity of the audience, ranging from professors and researchers to students, added depth to our understanding through their valuable feedback and insights. It was intriguing to see the potential applications of our work in various contexts and its capacity to benefit the broader community.

This project has been a personal journey of growth, teaching me much more than just machine learning and reproducibility. It honed my skills in collaboration, communication, and problem-solving. I learned to distill complex ideas into simple, accessible language and create engaging, interactive learning experiences. The most fulfilling part of this journey has been seeing our work come alive and realizing its potential to positively impact many people. The gratification that comes from creating something useful for others is unparalleled, and we are thrilled to share our materials with the world.

Your time and interest in our work are greatly appreciated! Hope you enjoyed this blog!

GPU Emulator for Easy Reproducibility of DNN Training -- Final Blog Post

Fri, 06 Oct 2023 00:00:00 +0000

Introduction

For the second half of the project, I spent some time reproducing the figures and then focused on hacking the source code of PyTorch to distinguish the Inter-GPU Computation (1GPU vs. 2GPUs).

Summarization

Finished reproducing figure 3, 4, 5, 6 from GPU Emulator for Easy Reproducibility of DNN Training.
Explored into inter-GPU computation in order to reproduce figure 9.

Reporsitory of Reproducing Figures

I have placed the repository of the deliverable here: https://github.com/FarFlyField/OSRE_DELIVERABLE/tree/main To use the repository, you can use any machine with just CPU, or you may check the results by renting GPU from Chameleon, comparing the result with the emulator’s.

The repository covers how to setup and understand the data produced from it. You will need to understand the spreadsheet and some of the graphing files.

Study of Inter-GPU Computation

I have dissected the source code in PyTorch to identify the computation time differences between using 1 GPU and 2 GPUs (inter-GPUs computation time). The one most significant difference is managed throughout the forward process. Here are a few most significant features that make the computation time longer when using 2 GPUs:

When using 1 GPU, PyTorch would put the images used to train the model onto the GPU in the main application once and for all. However, using 2 GPUs, PyTorch will transfer the images before running forward using a parallel function. The function contains two features:
- It dissects the images into multiple sections and puts them onto the GPUs respectively.
- It copied the model we need to train into the # of GPUs and create threads to train the models parallelly on these separate GPUs.
These two steps create a major time difference for the computation time we have for training multiple GPUs and also make the transfer time for using 2 GPUs smaller because we are counting the transferring time of images toward computation time.
After finishing running forward, the parallel function will gather the outputs from the two GPUs and send the output to the first GPU.

After gathering the output to the first GPU, the code will train the next batch and repeat the steps of transferring data, copying images, running parallel forwarding, and gathering the outputs once again.

The second significant difference that I’m working on right now is when PyTorch runs backward functions, which are more or less similar to forward but not the same at all. I have located the function loss.backward() function in our application code as the only contributor to the time difference in computation time. Here are a few tasks I did after locating it:

Recorded the functions’ call stack when using 1 GPU and 2 GPUs.
Recorded the time spent in the functions in the call stack of the functions.
Identified the inconsistency when measuring data, repeated and verified until the consistency is reached.

I have finished the basic measuring and drafting out the call stack but I haven’t figured out the exact differences. Because most of the functions are done in C++, printing out the inputs to evaluate the functions will be slightly harder but doable.

The data recorded and analyzed are placed here: https://docs.google.com/spreadsheets/d/1vFj-UE3mjtsHIc5OesKX1sDvr6fpPwtPUMl0pM3V8SA/edit?usp=sharing Summarized doc: https://docs.google.com/document/d/10XWNwCZ3kLzy4i6WgJ6KPsujEs2X1gzXblZtUoqMuJw/edit

Learning Machine Learning by Reproducing Vision Transformers

Fri, 06 Oct 2023 00:00:00 +0000

Hello again!

In this blog post, I will be discussing the second material I created for the 2023 Summer of Reproducibility Fellowship. As you may recall from my first post, I am working on the Using Reproducibility in Machine Learning Education project with Fraida Fund as my mentor. My goal is to create interactive open-source educational resources that teach reproducibility and reproducible research in machine learning (ML), as outlined in my proposal.

In this post, I will share with you my second material, and how it can be helpful in machine learning class to teach students about vision transformers and reproducibility at the same time. If you haven’t seen my first work, be sure to check out my previous blog post. Without further ado, let’s dive in!

Reproducing “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”

This material is a reproduction of Dosovitskiy et al.‘s 2020 paper, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. This paper introduces the Vision Transformer (ViT), a novel architecture that applies the transformer model, originally designed for natural language processing tasks, to image recognition. The ViT model achieves state-of-the-art performance on several image classification benchmarks, demonstrating the potential of transformers for computer vision tasks.

The figure illustrates the key idea behind ViT, which is to treat an image as a sequence of patches, similar to how a transformer treats a sentence as a sequence of words. Each patch is flattened into a vector and fed into the transformer encoder, which learns to capture the complex relationships between these patches. The resulting representation is then fed into an MLP head, which produces a final prediction for the image. This approach allows ViT to handle large input images and capture both global context and fine-grained details. ViT models can also be pre-trained on large datasets and fine-tuned on smaller datasets for improved performance.

To reproduce this paper, I followed a systematic approach to ensure reliable results:

Critically analyze the paper’s qualitative and quantitative claims.
Identify the necessary experiments to verify each claim.
Determine the required data, code, and hyperparameters for each experiment.
Utilize pre-trained models for validating claims that require high computational resources.
Investigate resources shared by the authors, such as code, data, and models.
Assess the feasibility of verifying different types of claims.
Design new experiments for validating qualitative claims when certain models or datasets are unavailable.

I utilized Chameleon as my platform for conducting and documenting my reproduction experiments. Chameleon is a large-scale, reconfigurable experimental environment that supports computer science systems research. It enables users to create and share Jupyter notebooks capable of running Python code on Chameleon’s cloud servers. For this work, a GPU with 24GB or more memory is required to run the notebooks on GPU, which Chameleon offers in its variety of GPUs.

I have set up a GitHub repository where you can access all of my reproduction work. The repository contains interactive Jupyter notebooks that will help you learn more about machine learning and the reproducibility of machine learning research. These notebooks provide a hands-on approach to understanding the concepts and techniques presented in my reproduction work.

Challenges

Reproducing a paper can be a challenging task, and I encountered several obstacles during the process, including:

The unavailability of pretraining datasets and pretrained models
Inexact or unspecified hyperparameters
The need for expensive resources for some hyperparameters
The use of different frameworks for baseline CNNs and Vision Transformers

These issues posed significant difficulties in replicating the following table, a key result from the Vision Transformer paper that demonstrates its superiority over prior state-of-the-art models.

To overcome these challenges, I used the same models mentioned in the paper but pretrained on different datasets, experimented with various hyperparameter combinations to achieve the best results, and wrote my own code to ensure that both the baseline and Vision Transformer were fine-tuned using the same framework. I also faced other challenges, which I discussed in my notebooks along with the solutions I applied.

How to use this material?

This material consists of a series of notebooks that guide you through the paper, its claims, experiments, and results. You will learn how to analyze, interpret, and validate the authors’ claims. To get started, I recommend briefly skimming the original paper to gain an understanding of the main ideas and public information. This will help you see how the authors could have been more transparent and clear in certain sections. The notebooks provide clear instructions and explanations, as well as details on how I addressed any missing components.

Conclusion

In this blog post, I’ve walked you through the contents of this material and the insights users can gain from it. This material is particularly intriguing as it replicates a paper that has significantly influenced the field of computer vision. The interactive nature of the material makes it not only educational but also engaging and enjoyable. I believe users will find this resource both fun and beneficial.

I hope you found this post informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for reading and stay tuned for more updates!

Writing a blog about your OSRE 2024 project

Fri, 06 Oct 2023 00:00:00 +0000

As last year, the Organization Admins will be asking students and contributors to provide regular status updates which will help us better highlight the work you are doing and track activities within our OSRE projects. These progress reports will also form the basis of blog reports prepared by students in the course of their summer. Blog reports should include links to proposals, presentations, reports, and an overview of the student’s experience.

Your experience is invaluable for future OSRE candidates and for improving the program every year.

Size and content

Keep it short and crisp. Include a short description of your project, a link to your project proposal, and, later in the program, links to the GSoC reports you provided.

Making a pull request for your blog

Fork the git repository
If you haven’t already done so, add your profile using these instructions
- IMPORTANT: Under user_groups: add - 2024 Contributors (as opposed to any of the two mentor groups)
- The short bio and any other information goes below the frontmatter
Post your blog
- Add /content/report/osre24/ORGANIZATION/PROJECTNAME/DATE-USERNAME/index.md
- Add a frontmatter to index.md, using the labels below
- Blog text goes below the frontmatter
- In that same directory include a picture and call it featured.png (also supports .jpg, .jpeg)
Commit to your fork and make a pull request and email OSRE Admins (currently: Stephanie Lieggi, Carlos Maltzahn).

Example frontmatter and text body

---
title: "YOUR TITLE"
subtitle: "YOUR SUBTITLE (OPTIONAL)"
summary:
authors:
 - USERNAME1
 - USERNAME2
tags: ["osre24"]
categories: []
date: YYYY-MM-DD
lastmod: YYYY-MM-DD
featured: false
draft: false

# Featured image
# To use, add an image named `featured.jpg/png` to your page's folder.
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
image:
 caption: ""
 focal_point: ""
 preview_only: false
---

As part of the [PROJECTNAME](/project/osre24/ORGANIZATION/PROJECTNAME) my [proposal](https://...) under the mentorship of MENTOR aims to ...

Final GSoC Blog - Polyglot

Mon, 25 Sep 2023 00:00:00 +0000

As I send in my final work submission for the final GSoC evaluation, I’m excited to share with you the progress we’ve made this summer (and future plans for Polyglot!). You can view the repository and web app here: https://polyphyhub.github.io/PolyGlot/. As a quick reminder of the project, we sought to extend the Polyglot web app, as developed by Hongwei (Henry) Zhou. For context, the web app follows this methodology:

Given a set of words, use an embedding model (such as Word2Vec, BERT, etc.) to generate a set of high dimensional points associated with each word.
Use a dimensionality reduction method (such as UMAP) to reduce the dimensionality of each word-vector point to 3 dimensions
Use the novel MCPM (Monte Carlo Physarum Machine) to compute the similarities between a set of anchor points and the rest of the point cloud. You could use any similarity metric here, too, such as the Euclidean distance.
The web app then displays the point cloud of 3-dimensional embeddings, but uses coloring to indicate the level of MCPM similarity each word has with the anchor point (e.g, if the anchor point is the word “dog”, the rest of the point cloud is colored such that words identified as similar to “dog” by the MCPM metric are brighter, whereas dissimilar words are darker.

The main results since the last blog are summarized as follows:

Novel timeline feature in which users can track the importance of certain words over time by watching the change in size of points (computes the IF-IDF metric for a word across all documents in a given year). Uses linear interpolation for years which do not have an explicit importance score.
An industrial collaboration with UK startup Lautonomy, where we have pre-processed and entered their data into Polyglot. Pre-processing consisted of first computing a high dimensional embedding of their set of words using OpenAI’s CLIP model https://openai.com/research/clip and the CLIP-as-service Python package https://clip-as-service.jina.ai. Next, we used UMAP to reduce the dimensionality of these embeddings to 3D. We computed the Euclidean distance on this data (in place of MCPM metric). Finally, we formatted the data to enter into Polyglot.

Although the app has developed a lot over the summer, we are planning to continue working on Polyglot, particularly with respect to one of our original goals: to set up a pipeline from PolyPhy to Polyglot. Unfortunately, with PolyPhy undergoing refactoring this summer, we weren’t able to set this pipeline up. However, that is one of our goals for the next few months. We are also moving forward with the industrial collaboration with legal analytics startup Lautonomy. We hope to release an output together soon!

If you’re curious about Polyglot or are interesting in getting involved, please feel free to reach out to myself, Oskar Elek, and Jasmine Otto!

noWorkflow as an experiment management tool - Final Report

Thu, 14 Sep 2023 00:00:00 +0000

This post describes our midterm work status and some achievements we have made so far in our project proposal for noWorkflow.

For a more friendly introduction to our work, please, refer to this tutorial available.

Our final code to merge is available in this repository.

Different ways of managing experiments

From our starting point at the midterm, and from our initial aspirations for the SoR, we kept on track with the goal of adding features to noWorkflow related to managing DS/ML experimental setups focusing on reproducibility.

With the emergence of IA across multiple fields in industry and academia, the subject of reproducibility has become increasingly relevant. In [1] we have an interesting description of the sources of irreproducibility in Machine Learning. All these sources are present at different stages during the project's experimental phases and may even persist in production environments, leading to the accumulation of technical debt [2]. The problem of irreproducibility is also discussed in [[3], [4]], pointing out that the velocity of deliverances usually comes at the expense of reproducibility, among other victims.

The CRISP-DM process as reviewed in [5] demonstrates that Data Science experiments follows a typical path of execution. In the same manner, [[3], [6], [7]], points out that Machine Learning pipelines are composed of well-defined layers (or stages) through its lifecycle. The emergence of IA in real world applications stressed the almost artisanal ways of creating and managing analytical experiments and reinforced that there is room to make things more efficiently.

In the search for possible approaches to the problem, we came across several projects that aimed to address these issues. Not surprisingly, multiple authors pursued the same goal, for instance [[9], [10]]. In these references, and confirmed in our survey, we found from targeted solutions to specific steps in modeling to services aiming for end-to-end AIOps management. Some are available as software packages, others as SaaS in cloud environments. In general terms, all of them end up offering features in different layers of the workflow (i.e. data, feature, scoring, and evaluation) or with different conceptualizations of reproducibility/replicability/repeatability as noticed by [11]. On one hand, this lack of standards makes any assessment difficult. On the other hand, it suggests a community in an exploratory process of a hot topic subject.

Specifically for this project, our focus is in the initial stages of computational scientific experiments. As studied in [8], in this phase, experiments are i) implemented by people as prototypes, ii) with minor focus on pipeline design and iii) in tools like Notebooks, that mix documentation, visualization and code with no required sequential structure. These three practices impact reproducibility and efficiency and are prone to create technical debts. However, tools like noWorkflow show a huge potential in such scenarios. It is promising because they i) demands a minimal setup to be functional, ii) works well with almost nonexistent workflows iii) require minimal additional intrusive code among the experimental one and iv) integrates well with Notebooks that are the typical artifact in these experiments.

According to its core team, the primary goal of noWorkflow is to "...allow scientists to benefit from provenance data analysis even when they don't use a workflow system.". Unlike other tools, "noWorkflow captures provenance from Python scripts without needing a version control system or any other environment". It is particularly interesting when we are in the scenario described above, where we lack any structured system at the beginning of experiments. In fact, after going through the docs, we can verify that noWorkflow provides:

Command-line accessibility
Seamless integration with Jupyter Notebooks
Minimal setup requirements in your environment
Elimination of the need for virtual machines or containers in its setup
Workflow-free operation
Open source license
Framework-agnostic position

Finally, in our research, we confirmed that there is an open spot in the management of scientific experiments that needs to be occupied by reproducibility. Provenance tools can help the academy and industry groups in this goal, and in this summer we focused on adding relevant features to leverage the noWorkflow in this direction.

Different tools for different needs

In our research phase, we didn't find any taxonomy that fully accommodated our review of different categories of tools providing reproducibility and experimental management. So, we could describe some tools in the following categories (freely adapted from this online references [here] and [here]):

Data and Pipeline Versioning: Platforms dealing with ingestion, processing, and exposing of features for model training and inference. They enable collaboration and discoverability of already existing Feature Sets throughout the teams and organizations. Provide provenance and lineage for data in different levels of complexity.

Metadata Stores/Experiment Trackers: They are specifically built to store metadata about ML experiments and expose it to stakeholders. They help with debugging, comparing, and collaborating on experiments. It is possible to divide them into Experiment Trackers and a Model Registry. Moreover, there are projects offering reproducibility features like hyperparameter search, experiment versioning, etc. However, they demand more robust workflows and are better suited for projects in the production/monitoring phases.

Pipeline frameworks: They operate within the realm of production, similar to Data Engineering workflows. Their usual goal is to allow any ML/AI products to be served across a wide range of architectures, and integrate all the low-hanging fruits along the way. For instance, pipelines adding hyperparameter optimization tasks, experiment tracking integrations, boilerplate containerized deployment, etc.

Deployment and Observability: They focus on deploying models for real-time inference and monitoring model quality once they are deployed in production. Their aim is to facilitate post-deployment control tasks such as monitoring feature drifts, conducting A/B testing, facilitating fast model shifts, and more.

The most remarkable aspect of this survey is that there are different tools for different phases in the life cycle of AI products. There are tools like DVC and Pachyderm that are Metadata Stores, allowing Experiment Tracking with features of tagging variables, as well as Data and Pipeline tracking. They are the most similar tools to noWorkflow in functionality. However, DVC possesses a more complex framework in dealing with different 'types' of tags, and relies on command line tools to extract and analyze tagged variables. Also, it depends strongly on git and replicate the git logics. Pachyderm requires a more sophisticated setup at the start, relying on containers and a server. It is an obstacle to small and lean prototypes, requiring installation of a docker image, and all friction on managing it.

There are other tools, like MLFlow and Neptune that pose themselves as Model Experiment Versioning with features of Monitoring and Deployment. They also have elements of pipeline frameworks, offering full integration and boiler plates for seamless integration with cloud platforms.

Pipelines are a vast field. They are AWS SageMaker, Google Vertex, DataRobot and Weights & Biases, among others. All of them offer features helping in all categories, with a strong focus on exploring all automation that can be offered to the final user, suggesting automatic parameter tuning, model selection, retraining, data lineage, metadata storing, etc.

Finally, Deployment and Observability frameworks are in the deployment realm, which is another stage far removed from prototypical phases of experiments. They come into the scene when all experimental and inferential processes are done, and there is an AI artifact that needs to be deployed and monitored. Such tools like Seldon, H2O, Datarobot do this job, again, with some features of Hyperparameter tuning, pipeline frameworks, data and pipeline tracking.

In light of this, when considering management and operation of experiments, we have a reduced sample of alternatives. Among them, Notebook integration/management are rare. Some of them rely on other tools like Git or enforces an overhead in the coding/setup with reserved keywords, tags and managerial workflows that hinder the process.

At first sight, our "informal" taxonomy positions noWorkflow as a Data/Pipeline Versioning and Metadata Store/Experiment Tracker. It is not a Pipeline Framework which works like a building block, facilitating the integration of artifacts at production stages. It is not a Deployment and Observability framework, because they are in the post-deployment realm, which is another stage far removed from prototypical phases of experiments.

Desiderata

As mentioned earlier, a typical workflow in DS/ML projects is well described by the CRISP-DM [5] and precede phases of deployment and production in the whole lifecycle of DS/ML projects.

Fig 1: CRISP-DM example of trajectory through a data science project

Briefly speaking, a workflow starts when a user creates a Jupyter Notebook and starts writing code. Usually, he/she imports or selects data from a source, explore features which are expected to have the highest inference potential, tunes some parameters to set up its training, trains and evaluates the predictive power of the model through different metrics. At this final step, we have delineated a trial. This trial result can suggest further improvements and new hypotheses about data, features, model types and hyperparameters. Then, we have a new experiment in mind that will result in a new trial.

When this process repeats multiple times, a researcher may end with different notebooks storing, each one, a different experiment. Each notebook has multiple hyperparameters, modeling choices and modeling hypotheses. Otherwise, the experimenter may have a unique notebook where different experiments were executed, in a nonlinear order between the cells. This former case is pointed out in [8], where Notebook flexibility makes it difficult to understand which execution order resulted in a specific output.

In a dream space, any researcher/team would have benefited at most if they could

a) in a running Notebook, being able to retrieve all the operations that contributed to the result of a variable of interest. In this case, modifications applied in the inputs or in the order of operations would be easily detectable. In the same way, any nonlinear execution that interferes in a control result.

b) Compare trials after different experiments. After experimenting with different hypotheses about hyperparameters, features or operation order, the user should easily compare the history of two trials and spot differences.

c) Retrieve a target variable among different trials that were executed in the context of an experiment. After proceeding with multiple experimental trials, users should be able to compare the results that are stored in different Notebooks (or even not).

d) Be as much "no workflow" as possible. All the former requisites should be possible with minimal code intervention, tags, reserved words or any active coding effort.

With these goals in mind, we worked on our deliverables and used the experiment carried out by [12] as a guideline to validate the new noWorkflow features.

Deliverables

In this session, we will describe what we have implemented during this summer.

We started on tagging cells and variables and then navigating through its pre-dependencies, or all other variables and function calls that contributed to its final value. This was a fundamental step that allowed us to evolve to create features that are really useful in day-to-day practice.

From the features of tagging a cell and tagging a variable, we evolved to the following features (an interactive notebook is available here):

backwards_deps('var_name', glanularity_level) : returns a dictionary storing operations/functions calls and their associated values that contributed to the final value of the tagged variable. Glanularity_level allows to set if the internal operations of the functions must be included or not.

global_backwards_deps('var_name', glanularity_level) : does the same as backwards_deps, but from all different tagging and re-tagging events in the notebook. It allows to retrieval of the complete operation of a tagged variable across all executed cells in the notebook
store_operations(trial_id, dictionary_ops) : save the current trial in order to make further comparisons with other experiments. The dictionaries aren't stored in the .noworkflow/db.sqlite, but in a shelve object named *ops.db* in the current notebook local folder.
resume_trials() : to support the management of experiments, the user can see the trial_ids of all experiments stored in the ops.db available for comparison/analysis.
trial_intersection_diff(trial_id1, trial_id2) : all mutual variables/funcion_calls between two experiments have its scalar values compared

trial_diff(trial_id1, trial_id2) : The values of variables and function calls are exhibited in a diff file format, emphasizing the operations' order. The goal here is to show that between the two experiments, the order of operations was different. Again, only scalar values are exhibited. More complex data structures (matrices, vectors, tensors, etc.) are only signaled as 'complex_type'

var_tag_plot('var_name') : Chart the evolution of a given variable across multiple trials in the database. In this case, all experiments stored in ops.db and tagged as *target_var* have their values plotted

var_tag_values('var_name') : Provides access to pandas.dataframe var_name entries with correspondent values across different trials.

Challenges

As expected, we had unexpected findings along the project. Bellow, we delve into the most significant challenges we had to face:

Jupyter notebooks allow a nonlinear execution of small parts of code through cells. More than once, we had to align about how to create functionalities to attend different scenarios that were unexpected. One example was the backwards_deps() and global_backwards_deps() functions. The latter function was born to cover the case where the user wants all dependencies rather than the local cell dependencies.
Despite the high quality of the current version of the package, the project needs documentation, which slows down the analysis of any new development. In this project, the aid of mentors was crucial at some points where a deeper knowledge was needed.
What is the vocation of noWorkflow? At some points in the project, we had to discuss forcing some kind of workflow over the user. And it would go against the philosophy of the project.
When working on comparing results, especially in DS/ML fields, complex types arise. Numerical vectors, matrices, and tensors from NumPy and other frameworks, as well as data frames, can't be properly manipulated based on our current approach.
The dilemma of focusing on graphic visual features versus more sophisticated APIs. More than once, we needed to choose between making a visual add-on to Jupyter or implementing a more complete API.
The current version of Jupyter support in noWorkflow doesn’t integrate well with Jupyter Lab. Also, even the IPython version has new versions, and noWorkflow needs to adapt to a new version.

Future Improvements

Given our current achievements and the insights gained along the project, we would highlight the following points as crucial future roadmap improvements:

Add a complex type treatment for comparisons. Today, visualizing and navigating through matrices, data frames, tensors, isn't possible with noWorkflow, although the user can do by its own means.
Integrate the dictionaries storing sequences of operations from shelve objects to a more efficient way of storage and retrieval.
Make it easier for users to manage (store, retrieve, and navigate) through different trials.
Add graphical management instead of relying upon API calls only.
Evolve the feature of tagging cells.
When tagging a model, save its binary representation to be recovered in the future.
Adding the capability of tracking the local dataset reading. Currently, it is possible to track changes in the name/path of the dataset. However, any modification in the integrity of a dataset is not traceable.

What I've learned

This was a great summer with two personal discoveries. The first one was my first formal contact with the Reproducibility subject. The second was to fully contribute with an Open Source project. In the research phase, I could get in touch with the state-of-the-art of reproducibility research and some of it nuances. In the Open Source contributing experience, I could be mentored by the core team of the noWorkflow and exercise all the skills required in doing high level software product.

Acknowledgments

I would like to thank the organization of Summer of Reproducibility for aiding this wonderful opportunity for interested people to engage with Open Source software. Also, thanks to the core team of noWorkflow for supporting me in doing this work.

Bibliography

[1] [O. E. Gundersen, K. Coakley, C. Kirkpatrick, and Y. Gil, “Sources of irreproducibility in machine learning: A review,” arXiv preprint arXiv:2204. 07610.]

[2] [D. Sculley et al., “Machine Learning: The High Interest Credit Card of Technical Debt,” in SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.]

[3] [P. Sugimura and F. Hartl, “Building a reproducible machine learning pipeline,” arXiv preprint arXiv:1810. 04570, 2018.]

[4] [D. Sculley et al., “Hidden technical debt in machine learning systems,” Adv. Neural Inf. Process. Syst., vol. 28, 2015.]

[5] [F. Martínez-Plumed et al., “CRISP-DM twenty years later: From data mining processes to data science trajectories,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 8, pp. 3048–3061, 2019.]

[6] [N. A. Lynnerup, L. Nolling, R. Hasle, and J. Hallam, “A Survey on Reproducibility by Evaluating Deep Reinforcement Learning Algorithms on Real-World Robots,” in Proceedings of the Conference on Robot Learning, L. P. Kaelbling, D. Kragic, and K. Sugiura, Eds., in Proceedings of Machine Learning Research, vol. 100. PMLR, 30 Oct--01 Nov 2020, pp. 466–489.]

[7] [A. Masood, A. Hashmi, A. Masood, and A. Hashmi, “AIOps: predictive analytics & machine learning in operations,” Cognitive Computing Recipes: Artificial Intelligence Solutions Using Microsoft Cognitive Services and TensorFlow, pp. 359–382, 2019.]

[8] [J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire, “Understanding and improving the quality and reproducibility of Jupyter notebooks,” Empirical Software Engineering, vol. 26, no. 4, p. 65, 2021.]

[9] [D. Kreuzberger, N. Kühl, and S. Hirschl, “Machine Learning Operations (MLOps): Overview, Definition, and Architecture,” IEEE Access, vol. 11, pp. 31866–31879, 2023.]

[10] [N. Hewage and D. Meedeniya, “Machine learning operations: A survey on MLOps tool support,” arXiv preprint arXiv:2202. 10169, 2022.]

[11] [H. E. Plesser, “Reproducibility vs. replicability: a brief history of a confused terminology,” Front. Neuroinform., vol. 11, p. 76, 2018.]

[12] [Z. Salekshahrezaee, J. L. Leevy, and T. M. Khoshgoftaar, “The effect of feature extraction and data sampling on credit card fraud detection,” Journal of Big Data, vol. 10, no. 1, pp. 1–17, 2023.]

KV store final Blog

Fri, 25 Aug 2023 00:00:00 +0000

Hello again! Before we get started, take a look at my previous blogs, Introduction and Mid Term. The goal of the project was to implement io_uring based backend driver for client side, which was at that time using traditional sockets. The objective was improving performance from the zero copy capabilities of io uring. In the process, I learnt about many things, about libkinetic and KV stores in general.

I started by writing a separate driver using io_uring in libkinetic/src in ktli_uring.c, most of which is similar to the sockets backend in ktli_sockets.c. The only difference was in the send and receive functions. For more detailed description about the implementation, refer to the mid term blog.

After the implementation, it was time to put it to test. We ran extensive benchmarks with a tool called fio, which is generally used to run tests on filesystems and other IO related things. Thanks to Philip, who had already written an IO engine for testing kinetic KV store (link), I didn’t have much problem in setting up the testbench. Again thanks to Philip, He set up a ubuntu server with the kinetic server and gave me access through ssh. We ran extensive tests on that server, with both socket and uring backends, with several different block sizes. The link to the benchmarks sheet can be found here.

We spent a lot of time in reading and discussing the numbers, probably the most time consuming part of the project, we had several long discussions analyzing numbers and their implications, for example in the initial tests, we were getting very high std dev in mean send times, then we figured it was because of the network bottleneck, as we were using large block sizes and filling up the 2.5G network bandwidth quickly.

In conclusion, we found out that there are many other major factors affecting the performance of the KV store, for example the network, and the server side of the KV store. Thus, though io_uring offers performance benefit at the userspace-kernel level, in this case, there were other factors that had more significant effect than the kernal IO stack on the client side. Thus, for increasing the performance, we need to look at the server side

I would like to thank Philip and Aldrin for their unwavering support and in depth discussions on the topic in our weekly meetings, I learned a lot from them throughout the entire duration of the project.

Grammar, Parsers, and Queries

Sat, 12 Aug 2023 00:00:00 +0000

Update on tree-sitter-pyrope

The pyrope hardware description language now has syntax highlighting available for neovim users. The repository includes a guide to installing the parser, and activating highlights. After we have tested the syntax highlighting, a pull request will be made to the nvim-treesitter repository. In this post, I will outline the highlighting process and reflect on a useful feature of neovim.

Syntax Trees

The pyrope language is described by a grammar. A grammar is a set of rules that describes the allowed structure of a language. A parser uses the grammar to generate a syntax tree. For example, consider this line of pyrope code.

var a:u32 = 0

Using the pyrope parser, we can get a syntax tree for this statement. The command tree-sitter parse file.prp gives us the following output.

(statement [1, 0] - [1, 13]
 (assignment_or_declaration_statement [1, 0] - [1, 13]
 decl: (var_or_let_or_reg [1, 0] - [1, 3])
 lvalue: (complex_identifier [1, 4] - [1, 5]
 (identifier [1, 4] - [1, 5]))
 type: (type_cast [1, 5] - [1, 9]
 type: (primitive_type [1, 6] - [1, 9]
 (sized_integer_type [1, 6] - [1, 9])))
 operator: (assignment_operator [1, 10] - [1, 11])
 rvalue: (constant [1, 12] - [1, 13])))

The nvim-treesitter syntax highlighting is based on this tree structure.

Queries

A query is an expression that selects nodes from the tree. For example,

(complex_identifier (identifier))

matches any identifier that is the child of a complex_identifier. Color schemes in neovim assign colors to different highlight groups. So, we can assign highlight groups to tree queries.

(constant) @number

Now, when a constant shows up in the syntax tree, it will highlight according to the @number group. Most of the work I did on this project involved studying the pyrope grammar, and writing queries based on it.

neovim

The text editor neovim is a popular choice among programmers. It allows advanced user control with configuration files. It also has an active community working on plugins to extend its functionality. Tools such as lazyvim allow for features like code completion and file management that give neovim the same functionality as IDEs. However, because neovim configuration is unique to each user, this may make it difficult to reproduce neovim instructions. For example, Professor Renau was going to test pyrope syntax highlighting in neovim. However, I did not know what configuration was necessary for him to see highlights in neovim. While I knew that syntax highlighting worked on my setup, I have lots of configuration files that may have contributed to that success. There is no guarantee that Professor Renau, or other potential users, have the same neovim configuration that I do.

NVIM_APPNAME

So, Professor Renau suggested I use the $NVIM_APPNAME variable to test the process on a fresh configuration. This feature allows the user to specify the configuration files used to launch neovim. For example, I installed lazyvim to the folder ~/.config/lazy. Then, I launched neovim with NVIM_APPNAME=lazy nvim. So instead of using my default configuration from ~/.config/nvim, the lazyvim configuration was used. This allowed me to use a neovim instance that was unaffected by my configuration files. I was able to preview the process of setting up syntax highlighting from the perspective of a lazyvim user. Similarly, the process can be done with an empty folder to mimic a brand new neovim installation The point is, configuration files can impact reproducibility in neovim. However, this feature allows us to bypass our individual configurations, and create reproducible guidelines.

Conclusion

In conclusion, most of my work involved writing queries for the pyrope tree-sitter grammar. This was for the purpose of syntax highlighting in neovim. However, an important part of any open source project is communicating the results and providing documentation. The NVIM_APPNAME feature helps view neovim from the perspective of different users, which helps for writing useful documentation.

Reproducible Evaluation of Multi-level Erasure Coding (Midterm)

Sat, 05 Aug 2023 00:00:00 +0000

Hi Everyone,

I hope everything goes well! This is my second blog post for my project Reproducible Evaluation of Multi-level Erasure Coding under the mentorship of John Bent, Anjus George, and Meng Wang. In summary, my project aims to build a platform to reproducibly evaluate the performance and durability of MLEC (Multi-Level Erasure Coding) for large-scale storage systems under different design configurations. The details are in this proposal.

In the course of these few weeks, I’ve completed several tasks to achieve the aim of this project, including

Literature Review
Studying the Erasure Coding Simulator and Creating Reproducible Evaluations, with the following policies
- Clustered/Declustered Local-level SLEC
- Clustered/Declustered Network-level SLEC
- MLEC with C/C, C/D, D/C, D/D configuration

Literature Review

Prior to developing the simulator, my first step was to delve into various literature related to distinct Erasure Coding policies. To understand a simulator for complex Erasure coding policy such as MLEC, I want to start from the simpler EC policies, and then extend my knowledge to more complex ones such as MLEC. Moreover, I also aimed to contrast the durability of MLEC with other comparable EC policies like LRC in my evaluations, making it vital to understand the implementation of these policies.

Over the first week, I read several papers regarding different chunk placement policies regarding erasure coding, including LRC (Local Reconstruction Codes), CL-LRC (Combined Locality for Local Reconstruction Codes), SODP (Single Overlap declustered parity), and MLEC (Multi-Level Erasure Coding). These papers offered a fundamental comprehension of each policy, their respective advantages and drawbacks, and their practical usage in production environments.

Simulator Reproduction

After gaining some understanding with the papers I read, I started to study the EC simulator by building the simulator myself. I got the MLEC simulator from the mentors. However, the simulator lacks documentation and guides, making it hard for others to reproduce evaluation results. The simulator is also complicated to understand, as it simulates various EC schemes, chunk placements, and rebuild policies, which results in 13,000 LOC. Therefore, my goal is to understand the design and implementation details of the simulator, after which I will create guides for reproducible evaluations.

In order to fully understand the simulator, the best way is to rebuild the simulator by myself. The simulator is designed to mimic disk failures over the span of a year under varying chunk placement policies. Once successfully rebuilt, the simulator will enable me to assess the durability of MLEC in relation to other widely-used chunk placement policies. I followed the given simulator and rewrote it on my own in Python.

Based on the skeleton of the given simulator, I first rebuilt a simple simulator that simulates SLEC (single level erasure coding, in both local and network settings) with clustered parities. With the arguments given, the simulator can run arbitrary numbers of iterations that simulate disk failures in one year. The simulator then collects iterations in which there is a data loss. The ratio of failed iterations to total executed iterations is the durability of the erasure coding policy. This simulation allows us to evaluate the durability of SLEC, laying foundations for later evaluation of MLEC.

Next, I extended my simulator from local-level SLEC implementation by adding more policies. I began by introducing a network-level SLEC policy with clustered parities. This differs slightly from the local-level EC as it necessitates the consideration of factors like network bandwidth within the simulator.

In addition, I have delved deeper into simulating declustered parities and successfully discovered a method to simulate disk failures. Basically, the simulator generates failures within a one-year timeframe and subsequently repairs them using priority queues. The disks associated with stripes experiencing the most failures are given the highest repair priority. With this construction, the simulator is capable of simulating local-level declustered parities, with the ability to specify parameters.

Upon successfully simulating local-level declustered parities, the construction of the simulator for network level declustered parities was rather straightforward. I then validated it using the simulator and math models provided by the mentors. The results perfectly agree with each other, which proves the correctness of my understanding for the SLEC declustered placements. By implementing the simulator myself, I strengthened my understanding of erasure coding designs and the simulation techniques, which equipped me with a solid foundation to continue to reproduce MLEC simulations.

Based on my knowledge gained from implementing SLEC simulators myself, I then reverse-engineered the MLEC simulator provided by the mentors from their MLEC paper. I choose to start from the simplest policy, which is clustered parities in both levels. After spending a considerable time digging into the simulator source codes, I was able to understand the simulation workflows, different repair methods that it implements, and the splitting method that it uses to simulate high durabilities. I then revised my simulator based on my understanding. I also tried to run a few experiments using the same configuration setups as specified in the paper. The results agree well with those in the paper, which verified the success of my reproducing work.

Technical Issues

In the process of building the MLEC, I’ve encountered many issues, conceptual or technical. The mentors are super helpful and responsive in the process, so I was able to have steady progress.

Summary

Overall, I’ve rebuilt a python simulator for various EC policies, and the simulator can successfully reproduce the results from paper.

Next Steps

My next step would be to package the simulator into reprodTrovi artifact, so others can reproduce evaluations on performance and durability of various EC policies, in particular MLEC

Mid Term Blog : Using Reproducibility in Machine Learning Education: Reproducibility with Incomplete Methodology Descriptions

Fri, 04 Aug 2023 00:00:00 +0000

Hey,

I am Shekhar and I am one of several students who are working on developing materials for reproducibility in machine learning education, under the mentorship of Fraida Fund. My Proposal aims to develop interactive educational materials about reproducibility in machine learning, for use in graduate and undergraduate classes. Our goal is to help students and researchers (1) understand some of the challenges they may face when trying to reproduce someone else’s published result, and (2) in their own publications, to specify the methodology so that the result will be more easily reproduced by others.

Motivation

My work is inspired by my participation in the 2022 Machine Learning Reproducibility Challenge, where I was reproducing a result related to bias in hate speech classifiers. The paper seemed at first to have complete methodology details. However, when I tried to implement their approach based on the description of the paper, I realized some important details were missing - for example, in the part where they replaced swear words in the text with other words having similar meaning. I wasn’t able to identify the exact list of swear words they used, or what approach they followed if the selected replacement was also a swear word. The choices I made when the authors’ approach was left ambiguous had a significant impact on the magnitude of the final result.

Milestones and Accomplishments

To inform researchers and students about this problem, I created a fictitious machine learning research paper, and a sequence of accompanying Python notebooks to highlight various choices that can be made to fill in the gaps, and explore how these choices can impact the overall results of the research. Our “research paper” is about the impact of data augmentation on few-shot learning for intent classification. We implemented a basic data augmentation strategy with synonym replacement using the HWU64 dataset and a BERT classifier, and the results suggest that synonym replacement as a data augmentation technique leads to only minor improvement in accuracy. In the fictitious paper, we left some of the methodology details ambiguous. When reproducing the results using the accompanying notebooks, the reader follows a “Choose Your Own Adventure” format, selecting a path through a tree, where each node represents ambiguous methodology details and branches out to different choices that are made at that instance. The leaf nodes will represent the final results, providing insights into the magnitude of the differences resulting from each node selection. Some of the choices that the reader makes are -

what subset of the source dataset to use.
some of the details of data pre-processing.
some of the details of the synonym replacement data augmentation strategy.
some training hyperparameters and the details of the hyperparameter search.

During the first phase of our project, we have implemented an initial draft of these notebooks, to explore various scenarios and see their impact on results. Next, we will further develop the interactive educational material around them.

Challenges

During the first half of the project, I faced two main challenges. First, I had to come up with a hypothetical research scenario that was realistic, yet easy for students without much expertise to understand. Attaining the right balance was essential to make it engaging and educational. The second challenge was to deliberately leave some details unclear in a realistic way while ensuring that the choices based on that ambiguity had a significant impact on the results. Fortunately, I had the guidance and support of my mentor, which allowed me to successfully tackle these challenges.

Throughout this project, I faced various challenges and obstacles, but it turned out to be an incredible learning experience. I had the opportunity to dive deep into the domains of few-shot learning and meta-learning, which were entirely new to me. Moreover, I was able to find ambiguous methodologies present in academic papers and explore diverse scenarios related to them. Looking ahead, I am eager to continue working on this project throughout the summer, as it promises further learning and personal growth.

Midpoint Blog Interactive Exploration of High-dimensional Datasets with PolyPhy and Polyglot

Thu, 03 Aug 2023 00:00:00 +0000

The last few months of my GSoC project have been very exciting and I hope to share why with you here in this blog post! To briefly summarize, my project has been focused on further developing the Polyglot app, a tool for visualizing 3D language embeddings. One important part of Polyglot is its utilization of the novel MCPM metric, where points are colored according to their MCPM similarity to a user-chosen “anchor point” (e.g., if “hat” is our anchor point, then similar words like “cap” or “fedora” will be colored more prominently).

The first issue we wanted to tackle was actually navigating the point cloud. With hundreds of thousands of points, it can be difficult to find what you’re looking for! Thus, the first few features added were a search bar for points and anchor points and a “jump to point” feature which changes a user’s center of rotation and “jumps” to a chosen point. There were a few hiccups with implementing these features, mainly due to the large number of points and the particular quirks of the graphics library Polyglot uses. In the end though, these simple features made it feel a lot easier to use Polyglot.

The next set of features related to our desire to actually annotate the point cloud. Similar to how one might annotate a Google doc (ie., highlight a chunk of text and leave a comment), we wanted to set up something similar, but with points! Indeed, this led to the development of a cool brush tool for coloring points, named and commented annotations (up to 5), a search bar within annotations, and finally a button to export annotations and comments to a CSV.

The next few weeks are looking bright as we strive to finish the PolyPhy-Polyglot pipeline (a notebook for quickly formatting MCPM data from PolyPhy and getting it into Polyglot). We also hope to add a unique “timeline” feature in which users can analyze sections of the point cloud based on the associated time of each point. Overall, it’s been a very stimulating summer and I’m excited to push this project even further!

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

Thu, 03 Aug 2023 00:00:00 +0000

Introduction

As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time, our goal was to characterize the tools on genomic workflows in terms of system metrics and data quality to build machine learning models to predict the elapsed time of genomic workflows. While Shayantan (another contributor) did the analysis on data quality metrics, I contributed to the system metrics analysis. We are getting closer to that goal because we have managed to collect datasets and do some analysis.

Steps

In this project, we selected DNA-Seq Pipeline as the workflow to be analyzed. This pipeline consists of four tools for processing single-end reads, namely BWA-mem, Samtool-view, Picard-SortSam, Picard-MarkDuplicates. For each tool we executed it using various configurations and stored system metrics for each execution. To do this, we have to take two steps:

Step 1: Building the tools execution environment.
Step 2: Developing a program to execute tools using some configurations and collect runtime parameters (eg. CPU, RSS, VSZ, and IO) automatically.

Execution Environment

Tools are executed on Chameleon instances by submitting them using Slurm. The machine used in collecting system metrics is a Haswell instance of the Chameleon Texas server. This instance uses Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz with following detailed specifications.

Number of CPUs	48
Number of threads per core	2
Number of cores per socket	12
Number of sockets	2

In this experiment, we use n+1 instances, where there are n compute nodes and 1 master node. Each execution is done by submitting a job, which is a tool with a certain configuration, from a master node and it will be processed by one of the compute nodes. In order for the tool to be executed, we need to set the master node to be a common container using NFS. This common container is used to store input files and commands for executing tools so that all nodes can access them without having to download and install them.

Executing and Collecting System Metrics

Tools will be executed in various specific configurations by varying parameters such as input size, number of CPU allocation, memory allocation and threads. For example, for BWA-mem respectively the number of variations in values for the number of CPU allocations, memory allocations, and threads is 5, 4, and 5 using 10 different files so that there are 5 x 4 x 5 x 10 = 1000 configuration combinations. For each configuration will be executed 8 times so that there are 8000 data points. Configuration details can be seen in the following table.

	#repetions	#files	#allocated CPU	#allocated memory	#threads	total
BWA-mem	8	10	2, 4, 8, 16, 32	8, 16, 32, 64	2, 4, 8, 16, 32	8000
Samtool-view	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000
Picard-Sortsam	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000
Picard-MarkDuplicates	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000

Meanwhile, to run the tools, we use the following commands:

BWA-mem

$BWA mem -t $threads $REF_DIR/hg19.fa ${INPUT_DIR}/${sra_id}*.fastq > ${OUTPUT_DIR}/${sra_id}.sam

Samtool-view

$SAMTOOLS view $INPUT_DIR/${sra_id}.sam -Shb -o $OUTPUT_DIR/${sra_id}.bam

Picard-SortSam

java -jar $PICARD SortSam \
CREATE_INDEX=true \
INPUT=$INPUT_DIR/${sra_id}.bam \
OUTPUT=$OUTPUT_DIR/${sra_id}.bam \
SORT_ORDER=coordinate \
VALIDATION_STRINGENCY=STRICT

Picard-MarkDuplicates

java -jar $PICARD MarkDuplicates \
CREATE_INDEX=true \
INPUT=$INPUT_DIR/${sra_id}.bam \
OUTPUT=$OUTPUT_DIR/${sra_id}.bam \
METRICS_FILE=$OUTPUT_DIR/${sra_id}_rmd.txt \
VALIDATION_STRINGENCY=STRICT

In Slurm, each job has a job id. In addition, there is a scontrol listpids command to see the job id to PID mapping. Using this, we can obtain system metrics for a job by gathering information from the /proc/$PID system file. Information that can be obtained from it is the use of CPU, physical memory, virtual memory, read bytes, and write bytes at a particular time. So that in collecting this data, we will record these features along with the timestamp at 1 second intervals throughout the execution process.

Results

We also have calculated the correlation for each feature with the elapsed time. For BWA-mem, the features that correlate more than absolute of 0.5 are Input size, Average CPU Usage, and Output file size , which is in SAM format. For samtools there are input size, average cpu usage and output size in BAM. For Sortsam, there are input size, write operation, and BAM output size. For MarkDuplicates, there are input size and BAM output size.

Features\Tools	BWA-mem	Samtool-view	Picard-SortSam	Picard-MarkDuplicates
Allocated CPU	-0.145	-0.095	-0.179	-0.156
Allocated physical memory	-0.010	-0.038	-0.069	0.132
Input size	0.583	0.651	0.937	0.922
Threads	-0.072	-	-	-
Average CPU	-0.607	-0.567	-0.479	-0.480
Peak CPU	-0.175	0.174	-0.170	0.046
Average RSS	0.040	0.034	0.131	0.182
Peak RSS	0.068	0.046	0.314	0.175
Average VSZ	0.032	-0.349	-0.127	0.090
Peak VSZ	0.048	0.074	-0.130	0.088
Write bytes	0.037	0.190	0.735	0.244
Read bytes	-0.031	0.109	0.070	0.110
Output SAM size	0.589	-	-	-
Output BAM size	-	0.763	0.934	0.923
Output BAI size	-	-	0.400	0.399

Future Works

For further work, we will analyze the correlation between elapsed time and features whose scores are below an absolute 0.5. Because there is a possibility that these features are actually correlated with the elapsed time but do not appear to be correlated because the measurements are made by calculating the overall data. So we also need to calculate the feature correlation for each data grouped by input file. Then, we create a machine learning model to predict elapsed time.

[FLASHNET]: Leveraging ML-augmented I/O in Linux

Wed, 02 Aug 2023 00:00:00 +0000

Hello everyone,

This is my second blog post for SoR 2023. As you may recall from my initial blogpost, I am working on the Flashnet project under the mentorship of Haryadi S. Gunawi.

I’ve been assigned two major tasks under Flashnet:

Perform post-training quantization (PTQ) on existing Flashnet models
Implement a rocksDB client (to interface with the Flashnet kernel) with 3-way replication

Task 1: Perform post-training quantization (PTQ) on existing Flashnet models

Since all of our models are currently built using the keras API, I decided to use the tensorflow-lite library, which supports direct conversion. Unfortunately, I encountered several persistent bugs while attempting to apply full-integer quantization on our binary neural network model:

Shape/dimension distortion:

Bug description: The quantized tflite model produces outputs of shape (8, 1) –– same as input shape–– when the original model produces single-value outputs (1, 1).

Status: Resolved

The original model has an input dimension of 8 for each input/x-value and there could be several inputs grouped in a single batch.
Input/batch size is also determined implicitly in the normalization layer of the original model
However, the “interpreter” in the quantized model runs inference one by one, and so batch size needs to be explicitly set to “1” i.e. a shape of single input, (1,8)
Doing so resolves the model distortion

Incorrect y-value range:

Bug description: There are no variation in the quantized model outputs (i.e. it spits out the same value for each input row)

In the original model, each inference output is a floating point value between 0 and 1. Outputs also vary according to input. This output is rounded towards 0 or 1 using a 0.5 standard cutoff (i.e. x > 0.5 → x = 1). Since the quantized model condenses 32-bit floats into 8-bit integers, we should expect a similar variation in output values across an 8-bit integer range.

Printing the quantized model weights, I discovered that weight burst/exploding gradient may be occur during quantization process i.e. the values of weights are exploding to infinity or vanishing to 0, and therefore unable to deliver any meaningful value. The likely consequence of this is that the inference output always equals the bias matrix (since the Wx term in y = Wx + B gets zeroed out).

Status: Open

Multiple potential causes were considered, without any success:
- Improper quantization of inputs/outputs
- Insufficient training time/number of epochs
- Incompatible model type/structure
- Incompatible tensorflow-lite version
At this point, I concluded that tensorflow-lite is too bug-ridden to make making any further attempts with the library not worthwhile.

Task 2: Implement a rocksDB client (to interface with the Flashnet kernel) with 3-way replication

rocksdb is an embedded database for key-value data. Our Flashnet team is currently implementing a Flashnet client in ceph, and so they have tasked me to explore an implementation in rocksdb as an alternative.

I’ve started on this segment of the project only recently, so my current work is still in its formative stages. As of writing, I’ve been primarily concerned with setup of software (on a new chameleon instance), running toy db examples, and educating myself on basic terminology/rocksdb documentation.

Future work

I expect to continue working on Task 1 (do quantization from ground-up or use a different library) and Task 2 as detailed above. I also hope to implement a transformer-based model to supplement our existing suite of Flashnet models.

[Midterm] FlashNet: Towards Reproducible Continual Learning for Storage System

Wed, 02 Aug 2023 00:00:00 +0000

Mid-Term Report

As part of the FlashNet my proposal under the mentorship of Haryadi S. Gunawi and Daniar Kurniawan aims to implement and optimize the FlashNet model in real-world storage systems using continual learning techniques. We focus on predicting I/Os latency to decide whether or not the I/O should be failovered to other SSD. The following sections elaborates the work description, major milestones achieved, accomplishments, and challenges during the first half of summer.

Work Description, Major Milestones Achieved, and Accomplishments

For the first half of the summer, I implemented continual learning pipeline of the model and several drift detection algorithms. After that, I evaluated the effectiveness. Below are the detailed description for each subtask.

1. Continual Learning pipeline

Firstly, I designed the pipeline. As shown on the graph below, the pipeline contains 4 main modules, namely initial train, retrain, inference, and monitor.

The modules were first developed in Python using linear regression model. Turns out, linear regression model is not good enough that it gave bad accuracy. To overcome this problem, I introduced more models and learning task.

Hence, in the final implementation, we have random forest and neural networks model for both regression and classification task. Aforementioned models outperforms linear regression. The pipeline is also already optimized.

2. Drift detection algorithms

Sometimes, the built model’s performance may degrade when facing recent I/Os having different characteristics than what it was trained upon. Hence, there should be a retrain process. Retrain should be triggered. The trigger could be as simple as periodically, or using technique called drift detection. While retraining too often might cause big overhead for computation, retraining too seldom might also cause performance degradation. Hence, we should build a good and reliable drift detection algorithm that can sense the presence of concept and covariate drift in recent data.

In order to build a good algorithm, I used heuristics derivated from the understanding about latency and throughput change over time. However, the result turns out not really good. Thus, I’ve been relying on using statistical tests as the drift detector. By far, Kalmogorov-Smirnov Test–commonly known as ks-test–is the best drift detector.

3. Evaluation

The featured image in the headline of this blog, also shown below, is the result of the evaluation. I evaluated the models and drift detection algorithms using Cumulative Distribution Function (CDF) graph, to see if any tail cut is made.

Challenges

During the implementation, I encountered several challenges as follows,

1. Choice of Model

Since we want to integrate the pipeline to real storage systems, we had to be mindful of model choice. Machine learning based models are lighter than deep learning based models. However, deep learning based models offer higher accuracy, thus more preferable. Hence, I implemented both and examine the effectivity of the models.

2. Choice of Drift Detection Algorithm

Continual learning technique is chosen for this task may require the model to be retrained since the workload may change over time. However, the implication is we need to have a condition that triggers the retraining to be done. As training model is costly, we need to retrain it mindfully. Thus, we use drift detection algorithm to detect whether or not retraining is needed.

There are two types of drift detection algorithms, namely statistical based test and model based drift detection. For minimizing overhead reason, we pick statistical tests. There exists various algorithms of choice. I picked 5 of them to be implemented and evaluated.

Plan

For the second half of the summer, I am going to study Riak and create Chameleon Trovi artifact for deploying Riak in a cluster.

Introducing Levels of Reproduction and Replication in Machine Learning

Wed, 02 Aug 2023 00:00:00 +0000

Hello again,

I am Mohamed Saeed and this is my second blog post for the 2023 Summer of Reproducibility Fellowship. As you may recall from my previous post, I am working on the Using Reproducibility in Machine Learning Education project with Fraida Fund as my mentor. My goal is to create interactive open educational resources that teach reproducibility and reproducible research in machine learning (ML) as I proposed.

In this post, I will share with you some of the progress I have made so far, as well as some of the challenges I have faced and how I overcame them. I will also highlight some of the specific accomplishments that I am proud of and what I plan to do next.

Reproducing “On Warm Starting Neural Network Training”

This material is a reproduction of the paper “On Warm Starting Neural Network Training” by Jordan T. Ash and Ryan P. Adams (2020). This paper investigates the effect of warm-starting neural networks, which means using the weights of previous models trained on a subset of the data, to train on a new dataset that has more data.

The figure illustrates how the new model uses the weights from the previous model as its initial values. This allows the new model to train on both the “Original” data, which it has already seen, and the new data, which it has not encountered before. In contrast, the randomly initialized model treats the entire data as unfamiliar and starts from scratch.

The paper also shows that this method can lead to lower test accuracy than starting from scratch with random weights, even though the training loss is similar. The paper also proposes a simple way to improve the test accuracy of warm-starting by adding some noise to the previous weights.

To reproduce this paper, I followed a systematic approach that ensured reliable results. This approach involved:

Reading the paper and its main claims carefully.
Finding out what resources the authors shared, such as code, data, and models.
Looking for additional materials online that could help me save time and fill in the gaps left by the authors.
Setting up the environment and dependencies needed to run the code smoothly.
Writing code and updating any outdated functions that might cause errors.
Running the code and verifying that it matched the results reported in the paper.
Analyzing and interpreting the results and comparing them with the paper’s findings.

I used Chameleon as my platform for running and documenting my reproduction experiments. Chameleon is a large-scale, reconfigurable experimental platform that supports computer science systems research. It allows users to create and share Jupyter notebooks that can run Python code on Chameleon’s cloud servers.

I created a GitHub repository where you can find all related to my reproduction work in the form of interactive jupyter notebooks that will help you learn more about machine learning and reproducibility of machine learning research.

Challenges

Reproducing a paper is not an easy task. I faced several challenges along the way. One of the biggest challenges was the lack of code and pretrained models from the authors. This is a common problem for many reproducibility projects. Fortunately, I found a previous reproducibility publication for this paper on ReScience journal. I used some of their code and added some new functions and modifications to match the original paper’s descriptions. I also encountered other challenges that I discussed in the notebooks with the solutions that I applied.

How to use this material?

This material is a series of notebooks that walk you through the paper and its claims, experiments, and results. You will learn how to analyze, explain, and validate the authors’ claims. To get started, I suggest you skim the original paper briefly to get the main idea and the public information. This will help you understand how the authors could have been more clear and transparent in some sections. I have given clear instructions and explanations in the notebooks, as well as how I dealt with the missing components. You can use this material for self-learning or as an assignment by hiding the final explanation notebook.

Conclusion and Future Work

In this blog post, I have shared with you some of my work on reproducing warm starting neural network training. I have learned a lot from this experience and gained a deeper understanding of reproducibility and reproducible research principles in ML.

I am very happy with what I have achieved so far, but I still have more work to do. I am working on reproducing the Vision Transformer: An Image is Worth 16x16 Words paper by Alexey Dosovitskiy et al. This time my approach is to use the available pretrained models provided by the authors to verify the claims made in the paper. However, there are some challenges that I face in reproducing the paper. For example, some of the datasets and code that the authors used are not publicly available, which makes it hard to replicate their experiments exactly. These challenges are common in reproducing research papers, especially in computer vision. Therefore, it is important to learn how to deal with them and find ways to validate some of the claims.

I hope you enjoyed reading this blog post and found it informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for your attention and stay tuned for more updates!

Midterm Blog Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Wed, 02 Aug 2023 00:00:00 +0000

As part of the Measuring Research Prototypes under Unreported Settings my proposal under the mentorship of Yang Wang and Miao YU aims to understand the impact of missing settings in artifact evaluation.

Based on our project proposal, the first step is to test the benchmark application on targeted systems. We pick open-source database system PostgreSQL as the target system. We test the TPC-C benchmark on PostgreSQL under default settings. We measure the throughput performanace for the benchmark by setting scalefactor as 10 and incrementing worker terminals. The settings for database server are all default values. We will take these results as baseline. In order to test on more parameters and system settings, we need to choose an association of parameters to get optimal throughput.

We use an online tool PGTune, which aims to tune PostgreSQL config by the hardware. We select shared_buffer, min/max_wal_size and effective_cache_size as first set of parameters to measure. They are related to memory consumption, checkpoints and planner cost in the database server. Based on PostgreSQL official documentation, shared_buffer sets the amount of memory the database server uses for shared memory buffers. Max_wal_size sets the maximum size to let the WAL grow during automatic checkpoints. Larger settings for shared_buffers usually require a corresponding increase in max_wal_size, in order to spread out the process of writing large quantities of new or changed data over a longer period of time. Effective_cache_size sets the planner’s assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used.

We conduct the experiments by setting the parameters with increments and compare the throughput performance with each other and the baseline. Based on the results, the throughput of the benchmark with larger shared_buffer and max_wal_size is up to 1.5X of the performance under default settings. The improvement by tuning max_wal_size is larger than that of tuning shared_buffer. The increased effective_cache_size does not have effect for this benchmark workload compared to its default value of the system.

There are more values of above mentioned parameters to test. Next, I will test those parameters with increments of the values. Furthemore, we need to choose an association of more parameters to get optimal throughput. Also, the tuning tool may not generate optimal values for very high memory systems based on its description. This requires we test more possible parameters and their values for better performance.

Midterm: High Fidelity UAV Simulation Using Unreal Engine with specular reflections

Wed, 02 Aug 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aims to Develop a Unreal Engine based simulator for testing. The simulator will be using Unreal Engine for the physics and visualization.

What we have done so far

We found that we can use Unreal Engine as a physics simulator and co-simulate with Simulink using the tools provided by MathWorks.
Simulated a example provided by MathWorks but i wasn’t getting the expected behaviour and there were very few resource available.
So we decided with using Gazebo and ROS for simulation instead of Unreal Engine and Simulink for the example of a balancing bot which had been designed in Solidworks.
For using Gazebo, i had converted the Solidworks model into an URDF and imported it into Gazebo.

Future Work

Currently, i am working on using Gazebo and ROS for controling a balancing bot using a PID control algorithm. Afterwards document the process of import a model into Gazebo for testing a control algorithm.

ScaleBugs: Reproducible Scalability Bugs

Wed, 02 Aug 2023 00:00:00 +0000

Introduction

As part of the Scalebugs Project, we have worked on building a dataset of reproducible scalability bugs. To achieve this, we go through existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. Workloads are designed to reproduce these scalability bugs by triggering some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance.

So far we have worked on packaging the buggy and fixed versions of scalability systems, a runtime environment that ensures reproducibility, and the workloads used to trigger the symptoms of the bug inside docker containers. By packaging these versions together, we are simplifying the process of deployment and testing. This enables us to switch between different versions efficiently, aiding in the identification and comparison of the bug’s behavior. For each scalability system, we have carefully built a runtime environment that is consistent and reproducible. This approach ensures that each time we run tests or investigations, the conditions remain identical.

New Terms

In order to make sense of the various bug reports, we had to learn some terminologies associated with scalability systems:

Clusters: Clusters are groups of related or connected items, often found in various fields such as computer science, data analysis, or even social sciences. For example, in data analysis, clusters might represent groups of data points with similar characteristics, making it easier to understand patterns or trends in the data.

Cluster Membership: Cluster membership refers to the process of determining which items or entities belong to a particular cluster. This task can be done based on various criteria, such as similarity in attributes, spatial proximity, or shared characteristics.

Locks: In computer programming, locks are mechanisms used to manage access to shared resources, such as files, data structures, or hardware devices. When multiple processes or threads need to access a shared resource simultaneously, locks ensure that only one process or thread can access it at a time, preventing data corruption or conflicts.

Lock Contentions: Lock contention occurs when multiple processes or threads attempt to acquire the same lock simultaneously. When this happens, one process or thread must wait until the lock becomes available, leading to potential delays and reduced performance.

Critical Paths: In project management or process analysis, a critical path is the longest chain of dependent tasks that determines the overall duration of the project or process. Any delay in tasks along the critical path will directly impact the project’s completion time.

Tokens: Tokens can have various meanings depending on the context. In computer programming, tokens are the smallest units of source code recognized by a compiler or interpreter. In cryptography, tokens can represent digital certificates or authentication data used for secure communication.

Nodes: In the context of network theory or graph theory, nodes are individual points or entities that form a network or graph. In a computer network, nodes can be devices like computers or routers, and in a social network, nodes can represent individuals or entities.

Peers: Peers are entities within a network that have the same status or capabilities. In peer-to-peer networks, each node can act as both a client and a server, enabling direct communication between nodes without relying on a central server.

Gossipers, Gossip Protocol: In distributed systems, gossipers are nodes that share information with each other using the gossip protocol. The gossip protocol involves randomly selecting peers and exchanging information in a decentralized manner, allowing information to spread quickly across the network.

Threads: Threads are the smallest units of execution within a process in computer programming. Multiple threads can run concurrently within a single process, enabling multitasking and parallel processing. Threads can share the same resources within the process, making them more lightweight than separate processes. However, proper synchronization is essential to prevent data corruption or conflicts when multiple threads access shared resources.

Flush and Writes Contention: This refers to a situation where simultaneous operations involving data flushing (saving data to a storage medium) and data writing (updating or adding data) are causing conflicts or delays. This contention can arise when multiple processes or threads attempt to perform these operations concurrently, leading to performance bottlenecks or potential data integrity issues.

Accomplishments

We have been able to build docker containers for the following scalability bugs:

IGNITE 12087

This bug stems from the resolution of the IGNITE-5227 issue (another bug), which has led to a significant decline in the performance of a particular operation. Prior to addressing IGNITE-5227, the insertion of 30,000 entries displayed remarkable efficiency, completing in roughly 1 second. However, post the resolution, executing the same insertion process for 30,000 entries witnessed a considerable slowdown, taking approximately 130 seconds – a performance degradation of nearly 100 times.

CASSANDRA 14660

This bug is related to how clusters work together and how a lock is causing conflicts with the critical path. The issue arises from a method call that uses O(Peers * Tokens) resources while contending for a lock, which is causing problems in the write path. The lock is used to protect cached tokens that are essential for determining the correct replicas. The lock is implemented as a synchronized block in the TokenMetadata class.

How was this fixed?

It was fixed by reducing the complexity of the operation to O(Peers) taking advantage of some properties of the token list and the data structure.

CASSANDRA 12281

This bug is also related to how clusters work together and a lock conflict. The issue arises when a specific method is trying to access a lot of resources (O(Tokens^2)) while contending for a read lock. As reported, a cluster with around 300 nodes has around 300 * 256 (assuming the default number of tokens) tokens, thus joining a new member reportedly is taking more than 30 mins. This happens because due to the long execution time here, this lock makes every gossip message delayed, so the node never becomes active.

How was this fixed?

The granularity of the lock is decreased, meaning that the expensive function calls now do not take the problematic read lock and simply use a synchronized block, synchronizing on a specific field, that does the job much better.

HA16850

This is a bug related to obtaining thread information in the JvmMetrics package. When obtaining thread information, the original buggy version used MXBeans to obtain thread information. The call uses an underlying native implementation that holds a lock on threads, preventing thread termination or creation. This means that the more threads that we have to obtain information for, the longer the function call will hold a lock. The result is that the execution time scales on the number of active threads O(threads).

How was this fixed?

Developers utilized a ThreadGroup to keep track of obtaining metrics for threads. The result is that there is no lock held for every thread.

CA13923

This issue revolves around conflicts between the “flush” and “writes” processes. The main problem is that during the “flush” process, a resource-intensive function called “getAddressRanges” is invoked. This function has a high computational cost and its complexity is O(Tokens^2). In other words, the time it takes to complete this function grows quickly as the number of “tokens” increases. This situation is causing challenges and delays in the overall process.

How was this fixed?

This function call affected many paths and they made sure no one calls getAddressRanges in critical paths.

Challenges

Demanding Memory Requirements: Running certain builds consumes a significant amount of memory. This places a strain on system resources and can impact the overall performance and stability of the process.

Little Issues Impacting Execution: Often, seemingly minor details can obstruct the successful execution of a build. Resolving such issues requires thorough investigation and extensive research into similar problems faced by others in the past.

Complexities of Scalability Bugs: Identifying the underlying causes of scalability-related bugs is intricate. These bugs exhibit unique characteristics that can complicate the process of pinpointing and comprehending their root origins.

What is Docker? ( For those who don’t know about it )

Docker is a platform that facilitates the containerization of applications, leading to consistent and efficient deployment across diverse environments. Its benefits include portability, resource efficiency, isolation, and rapid development cycles. DockerHub complements Docker by providing a centralized hub for sharing and accessing container images, fostering collaboration and ease of use within the Docker ecosystem.

More about docker https://docs.docker.com/get-started/overview/

Mid-term blog post for Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Tue, 01 Aug 2023 00:00:00 +0000

Hello!

I am Srishti Jaiswal and this is my second blog post for the 2023 Summer of Reproducibility Fellowship.

Introduction

As I reach the halfway mark of my internship journey, I have had the incredible opportunity to work on a project that revolves around reproducing an adaptive video research result using cloud-based experimentation. This blog post delves into my exciting work so far, the significant milestones achieved, specific accomplishments to celebrate, and the challenges overcome. Utilizing CloudLab and FABRIC, I embarked on a journey to reproduce essential figures from the research paper Downton Abbey Without the Hiccups: Buffer-Based Rate Adaptation for HTTP Video Streaming, ensure Python2 and Python3 compatibility and incorporate an Estimated Download Rate column in the log file produced by the video client. Let’s explore the details of this captivating internship experience.

Major Milestones Reached

Here are the milestones we have reached so far:

Familiar with CloudLab and Fabric Testbeds: I learned how to run an adaptive video experiment, which is the jumping-off point for my project, on the CloudLab and FABRIC platforms.
Python2 and Python3 Compatibility: My first task was to port an existing open-source code base developed for Python2 (which is no longer supported) so that it can run in Python3. Now code is running successfully in both versions for all the policies of the existing open source, i.e. Basic, Netflix and Sara. Fixed issue#1 .
Estimated Download Rate for Basic Policy: To make it easier for users to understand and visualize how the adaptive video policy works, I added an additional metric, “Estimated Download Rate”, to the output file produced by the adaptive video client.
Graphing Buffer Occupancy and Estimated Download Rate: I extended the existing experiment to show two additional visualizations that are important for understanding how the adaptive video client works: buffer occupancy vs time and estimated download rate vs time.

Overcoming Challenges

I encountered several challenges throughout this project, especially as it was my first time working independently on a research paper as a third-year engineering student. However, with my mentor’s guidance and support, I persevered and learned to tackle each obstacle with determination.

One significant challenge was porting the entire code from Python2 to Python3. This transition resulted in numerous errors, and I often found it challenging to pinpoint where the mistakes occurred. To overcome this, I adopted a step-by-step approach, fixing errors one by one and verifying them using Python2 for comparison.

Understanding the complex codebase was another hurdle that led to moments of feeling stuck in an infinite loop. But every time I faced such situations, I sought my mentor’s advice, and together, we made strategic changes to overcome these challenges.

I am immensely grateful for my mentor’s expertise and support throughout this internship. Her guidance played a crucial role in helping me navigate through the challenges and grow both professionally and personally. I eagerly look forward to the rest of the journey, knowing that I can continue making meaningful contributions to this research project with her inspiring mentorship.

Future Prospects

As the second half of my internship approaches, I am eager to refine further and expand our experimentation. Our main aim is to reproduce the existing work and provide a clear guide for other students to do the same for this, I have to create a framework that helps them improve and build upon this work.

I hope you enjoyed reading this blog post.If you have any questions or feedback, please feel free to contact me. Thank you for your attention and stay tuned for more updates!

Midterm: Open Source Autonomous Vehicle Controller

Tue, 01 Aug 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller Project my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aimed to create comprehensive technical documentation to help onboard new users of the OSAVC controller.

I have accomplished the following:

From the KiCad Schematic Editor, created pinouts of the I/O connectors on the OSAVC.
Detailed a hardware overview of the OSAVC by labeling and describing each electrical component.
Documented the setup for loading code on the OSAVC, including software such as Git, MPLAB X, XC32 Compiler, and serial terminal and hardware by showing how to connect the PICKit3 and OSAVC to a PC.
Tested the OSAVC by receiving and transmitting characters in the serial port into a buffer.
Fixed bugs/errors in the NEO_M8N GPS module library and PWM motors library.
Created a new library for the uni and bidirectional ESC brushless motors.
Created a user-interfaced test harness for all peripherals: serial, IMU, GPS, encoder, PWM actuators, radio telemetry, Mavlink heartbeat, radio controller, and LIDAR.
Incorporated new user interface element and fixed video streaming errors in the Flask app running on the Raspberry Pi 4 communicating with the OSAVC.
Documented both software and hardware steps to run the OSAVC with a companion computer such as a Raspberry Pi 4.
Highlighted common problems encountered with the OSAVC.
Created a contributor’s guide for others to create new libraries or contribute to the OSAVC project.
Designed a switching voltage regulator in SOLIDWORKS
Designed a self balancing bot that employs the OSAVC in SOLIDWORKS

Future Work

Currently, the laser cutter at UCSC is in maintenance, so we couldn’t assemble the self balancing bot yet. Once we assemble it, I will finish and document the control algorithms. We can also try incorporating ML models on the Raspberry Pi with the Coral USB accelerator on the self balancing bot.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

Tue, 01 Aug 2023 00:00:00 +0000

We are currently midway into the OSRE 2023 program and the following post lists the progress that I have made on the project so far. As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time our overall goal was to enumerate the effect of sequence data quality on execution times. Towards that end, we decided to first identify suitable datasets from the two commmonly available -omics data modalities - transcriptomics and genomics. Albrecht et al. [1] developed seqQscorer to automate the quality control step of NGS data analysis through predictive modeling. They have also published the list of ENCODE datasets used for training the models. Quality label has been assigned as 0 for released files or 1 for revoked files. Based on the guidelines set forth by ENCODE’s Data Coordination Centre (DCC) a comprehensive manual annotation of the data was done by scientists and the resulting quality variable “status” was published to serve as an indication of the quality of the data. The following steps outline the process of generating the data table for building the machine learning models.

Step 1: Programmatically accessed 86 (34 released ; 34 revoked) RNA-seq files from ENCODE database. All the fastq files were single ended.
Step 2: Programmatically accessed 288 (144 released ; 144 revoked) DNA-seq files from ENCODE database. All the fastq files were paired ended.
Step 3: Implemeted the STAR aligner for RNA-seq and the BWA aligner for DNA seq. The resulting outputs contained the alignment times for both the “revoked” and “released”.
Step 4: Ran statistical tests to determine whether there is any significant differences in the runtimes of both types of files.

Currently I am running the FASTQC tool to extract data quality metrics for the same set of files as discsussed above. Once I have collected those metrics, I can start building regression models to determine whether there is any significant impact of data quality on execution time. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Through our analysis we aim to develop a reproducible ML model that will give the user an estimate of the runtime based on the raw FATSQ file as input.

References

[1] Albrecht, S., Sprang, M., Andrade-Navarro, M.A. et al. seqQscorer: automated quality control of next-generation sequencing data using machine learning. Genome Biol 22, 75 (2021). https://doi.org/10.1186/s13059-021-02294-2

[Mid-term] Capturing provenance into Data Science/Machine Learning workflows

Mon, 31 Jul 2023 00:00:00 +0000

This post describes our midterm work status and some achievements we have done so far in the project for the noWorkflow package.

The initial weeks

I started doing a bibliographical review on reproducibility in the Data Science (DS) and Machine Learning (ML) realms. It was a new subject to me, and I aimed to build a more robust theoretical background in the field. Meanwhile, I took notes in this series of posts.

Then, as planned, I integrated with the current noWorkflow supporters in order get a broader view of the project and their contributions. Additionally, Juliana Freire, João Felipe Pimentel, and I set up a weekly one-hour schedule to keep track of my activities.

Brainstormed opportunities

At the beginning of June, we also met with other project supporters to brainstorm about our initial proposal. From this meeting, we came up with a plan on how technically approach a noWorkflow new feature in Data Science and Machine Learning experimental management.

In this brainstorm, we aligned that Jupyter Notebooks are, by far, the most frequent set up in DS/ML computational experiments. They established themselves as the fundamental artifact by embedding code, text and enabling execution and visualization. Entire experiments are created and kept in Jupyter notebooks until they are sent to production. And the opportunity at hand is to integrate noWorkflow with Jupyter Notebooks. Then, our mid-term goal was adapted from the original plan of only selecting and executing a prototypical ML experiment. We added the goal of paving the way for providing a tagging feature for Notebook cells.

More specifically, DS/ML experimental workflows usually have well-defined stages composed of data reading, feature engineering, model scoring, and metrics evaluation. In our dream space, the user would tag a cell in their experiment, enabling the capture of the tagged metadata into a database. This step integrates the ultimate goal of facilitating comparisons, management, and even causal inference across different trials of a DS/ML experiment.

Current deliverables

So, based on our plans, we create a separate table to store the metadata from cell tagging. This table stores the cell hash codes and information to match the code executed within a cell. As a result, we can store tags and the activation ids of the cells enabling us to identify a cell containing a given stage in a DS/ML experiment.

The second feature implemented was tagging a specific variable. In the same way for a cell, now it is possible to stamp a given variable with a tag, keeping its name, id, and received value in this separated table.

Finally, we worked on displaying the dependencies of a given variable. In this case, by tagging a given variable, we can display the other variables, values, and cells activated in its construction. Then, we can visualize the dependencies that contributed to its final value.

For an overview of current developments, please refer to my fork of the main project.

Challenges

During this period, we had to make choices along the way. For instance, capturing the provenance of cells through tags is a different solution than tagging code chunks in scripts. In this case, we decided to stick with tagging Notebook cells at this moment. We also opted to start storing the metadata to enable comparisons between trials rather than focus on a sophisticated graphic and user-friendly cell tagging system. We also opted to keep this metadata info stored in a separate table in the database.

Next steps

In the second half of the summer, our goal is to integrate these features in order to proceed with comparisons among experiments. Such comparisons would use the tagged variables as the hyperparameters of DS/ML experiments or key variables to assess the experiments, such as errors or scores. As a result, we will be able to compare the results of two trials in a more accurate, and easily reproducible experiment.

Implemented IO uring for Key-Value Drives

Mon, 31 Jul 2023 00:00:00 +0000

Hi everyone!

I’m Manank Patel, (link to my Introduction post) and am currently working on Efficient Communication with Key/Value Storage Devices. The goal of the project was to leverage the capabilities of io_uring and implement a new backend driver.

In the existing sockets backend, we use non-blocking sockets with looping to ensure all the data is written. Here is a simplified flow diagram for the same. The reasoning behind using non blocking sockets and TCP_NODELAY is to get proper network utilization. This snippet from the code explains it further.


NODELAY means that segments are always sent as soon as possible,
even if there is only a small amount of data. When not set,
data is buffered until there is a sufficient amount to send out,
thereby avoiding the frequent sending of small packets, which
results in poor utilization of the network. This option is
overridden by TCP_CORK; however, setting this option forces
an explicit flush of pending output, even if TCP_CORK is
currently set.

In the above figure, we have a loop with a writev call, and we check the return value and if all the data has not been written, then we modify the offsets and then loop again, otherwise, if all the data has been written, we exit the loop and return from the function. Now this works well with traditional sockets, as we get the return value from the writev call as soon as it returns. In case of io_uring, if we try to follow the same design, we get the following flow diagram.

Here, as you can see, there are many additional steps/overhead if we want to check the return value before sending the next writev, as we need to know how many bytes has been written till now to change the offsets and issue the next request accordingly. Thus, in every iteration of the loop we need to to get an sqe, prep it for writev, then submit it, and then get a CQE, and then wait for the CQE to get the return value of writev call.

The alternate approach would be to write the full message/iovec atomically in one call, as shown in following diagram.

However, on trying this method, and running fio tests, we noticed that it worked well with smaller block sizes, like 16k, 32k and 64k, but was failing constantly with larger block sizes like 512k or 1m. This was because it was not able to write all the data to the socket in one go. This method showed good results as compared to sockets backend (for small BS i.e). We tried to increase the send/recv buffers to 1MiB-10MiB but it still struggled with larger blocksizes.

Going forward, we discussed a few ideas to understand the performance trade-offs. One is to use a static variable and increment it on every loop iteration, in this way we can find out if that is really the contirbuting factor to our problem. Another idea is to break down the message in small chunks, say 256k and and set up io uring with sqe polling and then link and submit those requests in loop, without calling io_uring_submit and waiting for CQE. The plan is to try these ideas, discuss and come up with new ideas on how we can leverage io_uring for ktli backend.

Improving Video Applications' Accuracy by Enabling The Use of Concierge

Mon, 31 Jul 2023 00:00:00 +0000

Introduction

Hello, it’s me again, Faishal, a SoR project contributor for the edgebench project. For the past these two months, my mentors and I have been working on improving the performance of our system. In this report, I would like to share with you what we have been working on.

Motivation

Edgebench is a project that focuses on how to efficiently distribute resource (bandwidth and cpu usage) across several video applications. Nowaday’s video applications process its data or video on a server or known as edge computing, hence bandwidth or compute unit may be the greatest concern if we talk about edge computing in terms of WAN, because it is strictly limited.

Consider the following case, suppose we have 3 video applications running that is located in several areas across a city. Suppose the total bandwidth allocated to those 3 video applications is also fixed. Naively, we may divide the bandwidth evenly to every camera in the system. We may have the following graph of the allocated bandwidth overtime.

They are fixed and won’t change. However, every video application has its own characteristic to deliver such a good result or f1-score. It is our task to maintain high average f1-score. Therefore we need to implement a new solution which is accuracy-oriented. The accuracy-gradient[1] comes into this.

System Design

On our current design, we need a resource allocator, namely concierge. This concierge determines how much bandwidth is needed for every video application (vap) in the system. Concierge will do the allocation at a certain time interval that has been determined before. This process is called profiling, on this process, the concierge will first ask every vap to calculate their f1-score at a certain video segment when the bandwidth is added by profile_delta. Then the difference of this f1-score is substracted by the default f1-score, namely f1_diff_high. After that, the concierge will ask to reduce its bandwidth by profile_delta and do the same process as before, this result will be named f1_diff_low. Those two results will be sent to the concierge for the next step. On the concierge, there will be sensitivity calculation, where sensitivity is

This equation tells us which video application will give us the best f1-score improvement if we add more bandwidth to one vap while reducing other’s bandwidth. From this, we will optimize and the concierge will give the bandwdith to the one with the highest sensitivity and take the bandwidth from the app with the lowest sensitvity.

Results

As aforementioned, our main objective is to improve the accuracy. However, there are two parameters that will be taken into account which are improvement and the overhead of its improvement. We first choose 3 dds apps[2] that we think will be our ideal case. The following graphs show the profile of our ideal case

We can see that two of them have high sensitivity especially on lower bandwidth and one of them has low sensitivity. This is a perfect scenario since we may sacrifice one’s bandwidth and give it to the app that has the highest sensitivity at that iteration. We will do the experiment under the following setup

DATASETS=("" "uav-1" "coldwater" "roppongi")
MAX_BW=1200
PROFILING_DELTA=80
MI=5

That setup block tells us we will use the total bandwith of 1200 kbps, that means at first we will distribute the bandwidth evenly (400 kbps). The profiling_delta will be 80 kbps and profiling interval (MI) will be 5 seconds.

Mode	DDS (uav-1)	DDS (coldwater)	DDS (roppongi)	Average
Baseline	0.042	0.913	0.551	0.502
Concierge	0.542	0.854	0.495	0.63 (+25.5%)

From the result, we managed to improve the average f1-score by 0.1 or 25.5%. This is obviously a very good result. There are a total of 10 videos in our dataset, for the next experiment, we first will generate 6 combinations of dds apps. Noted that for each combination, one video will be uav-1 since we know that it has the highest sensitivity. We will the experiment with 4 bandwidth scenarios (1200, 1500, 1800, 2100) in kbps.

The left figure depicts the average improvement of the concierge. Here we can see that the improvement decreases when the total bandwidth increases. The reason behind this is at a higher bandwidth, the sensitivity tends to be closer to 0 and the concierge won’t do any allocation. Overall, this confirms our previous result that with the help of uav-1, the concierge can improve the f1-score up to 0.1. The next experiment is to randomly pick 3 dds videos out of 10 videos that will be generated 10 times. We would like to see how it perfoms without any help of uav-1.

From the result, we still managed to get the improvement. However, it seems that average improvement decreases compared to the previous one. The reason of this phenomenon will be discussed later.

Overhead Measurement

From the graph above, each graph represents the total bandwidth used. In this experiment, it is clearly known that the lower MI leads to higher overhead since there would be more profiling process than higher MI. From the 4 graphs above, it can be known that there would be a significant trade off if we lower the MI since the improvement itself is not highly significant. The highest improvement is at 1200kbps. Hence, for higher bandwidth, there is no need to do the profiling too often

Discussion

There are some limitations of our current design. If we have a look at box-plot in figure 5 above, we can see that there is some combinations where the improvement is negative.

The figure above depicts the profiling process from the segment 6 to determine the bandwidth used at segment 7. Here we can see that the f1-score at that bandwidth for (jakarta) drops significantly. Our current design cannot address this issue yet since we only consider current video segment. There is a need to not only look at current segment, but also the previous and the future segment should be taken into account as well.

Regarding the overhead, we are aware that 50% overhead is still considered bad. We might as well try the dynamic MI or skip the profiling for certain video if not neccesarry.

Conclusion

Regardless the aforementioned limitations, this report shows that the concierge is generally capable of giving an f1-score improvement. The update of the next will be shown in the final report later.

References

[1] https://drive.google.com/file/d/1U_o0IwYcBNF98cb5K_h56Nl-bQJSAtMj/view?usp=sharing
[2] Kuntai Du, Ahsan Pervaiz, Xin Yuan, Aakanksha Chowdhery, Qizheng Zhang, Henry Hoffmann, and Junchen Jiang. 2020. Server-driven video streaming for deep learning inference. In Proceedings of the Annual conference of the ACM Special Interest Group on Data Communication on the applications, technologies, architectures, and protocols for computer communication. 557–570.

Mid-term blog post for Public Artifact Data and Visualization

Mon, 31 Jul 2023 00:00:00 +0000

Over the past few weeks, our platform development has been progressing steadily, and we are excited to share the milestones we have achieved so far. As planned in our introductory blog, we have successfully laid the groundwork for the platform with the guidance and support of our mentor.

Milestones and Accomplishments

Here are some of the key functionalities we have implemented so far:

Modular Architecture: We successfully designed the platform with a modular architecture, separating the Graphical User Interface (GUI) and Command-Line Interface (CLI) functionalities. This modularity allows users to interact with the platform in their preferred way.
Experiment and Bucket Creation: Users can now create experiments, buckets (for storing different implementations of experiments), and iterations using either the GUI or CLI.
Real-time Backend Environment Monitoring: Through the command line interface, users have the capability to control the monitoring of backend environment data, allowing for real-time tracking and analysis of important metrics.
Visualizing Environment Variables: Users can now visualize detected environment variables on the platform. Moreover, they can compare iterations within different buckets and gain more insights by observing the timeseries data, such as CPU usage, in a graphical format.

Challenges

In the early stages of designing our platform, we encountered significant challenges at the system design level. One of the most daunting obstacles we faced was devising an effective method to monitor backend environment variables. To tackle this obstacle, we engaged in extensive discussions and sought guidance from our mentor. After careful consideration, we decided to adopt a multi-process approach to monitor the backend environment variables effectively. Specifically, we devised a meticulous strategy of creating a separate process in the background for each specific metric we needed to monitor. By allocating a dedicated process to each metric, we ensured a streamlined and efficient monitoring process.

Currently, we are facing a challenge related to monitoring metrics. Since different users have varying monitoring requirements, it is impractical for us to manually write monitoring solutions for each user. To address this issue, we are actively working on implementing a pluggable design that allows users to configure their own monitoring preferences.

Our approach involves providing users with the flexibility to define their custom configuration files or write monitoring programs following our documented guidelines. This way, users can specify the specific metrics they wish to monitor and tailor the monitoring process to their individual needs.

Try it Out!

GUI Repository and CLI Repository
- In the README.md file of GUI repo, you will find detailed installation instructions to set up the Graphical User Interface (GUI). Follow the steps provided to get started with our platform.
Sample Repository
- In this repository, we have included scripts that allow you to run our program. Additionally, you can use these scripts as templates to monitor your own programs according to your specific requirements.

We welcome you to take the platform for a test drive and feel free to raise any issues you encounter during the installation process. Your feedback is invaluable to us, as it helps us identify and address any potential installation challenges and improve the user experience.

Enhancing Drift Detection through Fine-Tuning Llama2

Sun, 30 Jul 2023 00:00:00 +0000

Greetings everyone, I’m Kangrui. Over the past few weeks, we’ve dedicated our efforts and have consequently made significant progress in our drift detection methods. Now, I’m excited to present to you a detailed elaboration on how we prompted and fine-tuned Llama2 to efficiently carry out the drift detection task.

Motivation

Why LLM in drift detection method?

The use of large language models (LLMs) in drift detection methods presents numerous benefits that place it as a prominent solution in this domain.

Rapid Development: LLMs are in the vanguard of technological advancement. This field is evolving rapidly with continuous enhancements in model architecture, training techniques, and data handling. With every new version, these models are showing an increasing capacity to understand and generate human-like text, pushing the limits of what is achievable in Natural Language Processing (NLP) and Artificial Intelligence (AI) as a whole.
Superior Performance: Traditional drift detection methodologies such as Page-Hinkley, EDDM, and HDDM have their merits and have found success in numerous scenarios. Even Deep Learning (DL) techniques, like training a predictive model based on error rates, have made significant strides in the field. However, when handling complex, high-dimensional, and real-time data, LLMs have demonstrated exceptional results. They are not only able to effectively predict and respond to drifts but also adapt to new trends more swiftly. Our experiments using LLMs like GPT-3.5-turbo have yielded impressive results, notably outperforming other methods.

Fig. 1: Concept dirfts detected by GPT-3.5-turbo in Cori dataset

Flexibility: One of the major advantages of using LLMs is their flexibility in dealing with different types of input and output. In contrast to traditional methods, which are confined to single feature concept drift detection and can only process numerical values, LLMs can handle a range of input types including text, numbers, and more complex data structures. This capability allows them to detect multi-feature concept drifts, thereby broadening the scope and complexity of problems they can tackle. Moreover, the generation capability of LLMs can provide rich and detailed output, facilitating more comprehensive insights into the detected drifts.

Why Llama2 in drift detection method?

Llama2 presents a series of advantages that make it an excellent choice for applying llm in drift detection. Here’s a breakdown of the key reasons:

Performance Guarantee: As a newly released model, Llama2 has undergone extensive development and testing, providing a reliable guarantee of performance. It represents the cutting edge in AI technology, having benefited from the latest research and advancements in language model design.
Accessibility Guarantee: One significant advantage of Llama2 is that it is open-source. It is readily accessible on HuggingFace, which also provides a range of mature tools to fine-tune and deploy the model.
Flexibility for Fine-Tuning: Llama2 comes in different sizes, such as 7B, 13B, and 75B parameters, which allows for flexibility in model selection based on the task’s requirements and computational resources.

Data

Dataset

In our study, we employed Synthetic data streams for the fine-tuning of Llama2. Synthetic data streams serve as an invaluable resource for controlled experiments in the domain of drift detection. These curated datasets encompass varied types of drifts, providing us with the capability to assess the efficacy of our detection algorithms under diverse scenarios.

Here is a brief introduction to the synthetic datasets we used:

Sine1 & Sine2: These datasets induce abrupt concept drift within a two-dimensional feature space. The classification rule, a sine function, dictates the instance labels, which are flipped at every drift point.
Mixed: This dataset, characterized by its combination of numeric and boolean features, uses a composite classification rule. The abrupt concept drift is simulated via a periodic reversal of class labels.
Stagger: This categorical dataset incorporates abrupt concept drift by periodically altering the classification rules tied to the features.
Circles & LED: These datasets are designed to simulate gradual concept drift. In Circles, the classification of instances is determined by their spatial relation to specific circles. LED imitates a seven-segment digit display, introducing drift by interchanging the pertinent attributes.

Typically, the synthetic datasets contain 100,000 or 1,000,000 instances. The concept drift happens every 25000 or 33333 instances each portraying either abrupt (with drifting period of 50 instances) or gradual concept drifts (with drifting period of 500 instances).

Data Preprocessing and Metrics

Given the token limit of Llama2 and the specific requirements of our project, we needed to transform the data into an appropriate format.

As such, we processed each data stream into three sections: the ‘undrifted’ period, the ‘drifting’ period, and the ‘drifted’ period. All instances in each section were randomly and independently drawn from the original data stream, summing up to a maximum of 100 instances. The number of instances for the undrifted and drifted periods ranged from 20 to 50, and for the drifting period, it ranged from 10 to 20.

For instance, let’s consider a dataset containing 100,000 instances where the concept drift occurs every 25,000 instances, causing abrupt concept drift. To format a data point, we could draw 20 to 50 instances from the first 25,000 as the undrifted period. Then, we could draw 10 to 20 instances from the 25,001st to 25,050th instance as the drifting period. Finally, we would draw 10 to min(100 - num(undrifted period) - num(drifting period), 50) from the 25,051st to 50,050th instance as the drifted period. This newly formatted data stream would then be fed into Llama2.

We also included some additional information to assist Llama2’s inference process. A typical data point in our processed dataset includes:

{
 "before_period": [0, 31],
 "transition_period": [32, 38],
 "after_period": [39, 59],
 "before_index": [196, 19963],
 "transition_index": [20002, 20030],
 "after_index": [20310, 39984],
 "meta": "Dataset: MIXED\n\tv's type is nominal, range is ('False', 'True')\n\tw's type is nominal, range is ('False', 'True')\n\tx's type is numeric\n\ty's type is numeric\n\tclass's type is nominal, range is ('p', 'n')\n",
 "data_stream": ...
}

From this dictionary, the “meta” and “data_stream” entries are fed into Llama2. The “transition_period” serves as the criterion: if Llama2’s answer lies within the “transition_period”, we deem it correct.

Llama2

Inference

We experimented with three variations of prompts during the inference phase.

Prompt Version 1:

[INST] <<SYS>>
 You are a helpful, respectful, and honest assistant. Always provide the most helpful responses possible while ensuring safety. Ensure that your responses are socially unbiased, positive, and free from harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. If a question lacks coherence or sense, explain why instead of providing incorrect information. If you are uncertain about an answer, refrain from sharing false information.
 <</SYS>>
 Your task is to identify the index in a given data stream where the relationship between the features and labels begins to change. The data stream is formatted as a list, with each element being a two-element list: the first represents the features (also a list), and the second is the label. If your answer is 'x', it indicates that the data pattern starts shifting at the xth data point in the stream.
 Here's an example of the data's metadata: Dataset: SINE1
 x's type is numeric
 y's type is numeric
 class's type is nominal, range is ('p', 'n')

 The given data stream is: [[[0.7, 0.07], 'p'], [[0.45, 0.78], 'n'], ..., [[0.64, 0.45], 'n']]
 Your task is to respond with a single index. No additional information is required.
[/INST]

Prompt Version 2:

The same as Prompt 1, but with a specific range for the index response:

Please provide an index ranging from 0 to 96. No additional information is required.

Prompt Version 3:

This prompt uses an instruction-input-output design, which we adopted for fine-tuning:

Below is an instruction paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Identify the index in a given data stream where the relationship between features and labels begins to change. The data stream is formatted as a list, each element being a two-element list: the first represents the features (also a list), and the second is the label. For instance, if the response is 'x', it means that the data pattern starts shifting at the xth data point in the stream. Only respond with an index, no further information is necessary.

### Input:
Meta Data:
Dataset: SINE1
 x's type is numeric
 y's type is numeric
 class's type is nominal, range is ('p', 'n')

Data stream:
[[[0.7, 0.07], 'p'], [[0.45, 0.78], 'n'], .., [[0.64, 0.45], 'n']]

### Response:

Despite minor differences between Prompt Version 1 and Version 2, both suggested by Meta, the results varied significantly, a topic we will delve into in the following section. Prompt Version 3, employing the instruction-input-output structure, was used during our fine-tuning process.

Fine-Tuning

We utilized the tools provided by llama-recipes to fine-tune Llama2. The key command used to initiate the fine-tuning process is illustrated below:

python llama_finetuning.py --use_peft \
 --peft_method lora \
 --quantization \
 --model_name meta-llama/Llama-2-13b-chat-hf \
 --output_dir ./fine_tuned_model/Llama-2-13b-chat-hf-test_finetune \
 --dataset alpaca_dataset \
 --batch_size_training 40 \
 --num_epochs 1

Some explaination about the parameters:

--use_peft: This flag indicates the use of the Parameter-Efficient Fine-Tuning (PEFT) method. PEFT allows us to fine-tune the model more efficiently.
--peft_method lora: Here, we specify that the Lora (Layer-wise Optimal Brain Surgeon with Relevance-based Adjustment) method should be used for PEFT.
--quantization: The quantization flag is used to reduce the memory footprint of the model during the inference stage. It does so by reducing the precision of the model's weights.
--dataset alpaca_dataset: Specifies the dataset setting used for fine-tuning, in this case, the 'alpaca_dataset' indicates the instruction-input-output structure for fine-tuning.

Results

The performance of various models and prompt versions is depicted in Fig. 2.

Fig. 2: Performance comparison of different models and prompt versions.

It is evident from the results that the design of the prompt has a significant impact on Llama2’s performance. Furthermore, due to computational resource constraints, we have only managed to fine-tune Llama2 on a portion of our dataset (approximately 1,000 instances). The entire training set consists of 19,000 instances, and the test set includes 5,000 instances. Despite these limitations, a performance increase is noticeable after fine-tuning.

GPU Emulator for Easy Reproducibility of DNN Training -- Interim Blog Post

Sun, 30 Jul 2023 00:00:00 +0000

Introduction

Motivation

The growing popularity of Deep Neural Networks has resulted in a substantial increase in demand for Graphics Processing Units (GPUs). GPUs are crucial for conducting matrix computations in DNN training and inference. However, they are expensive to purchase for personal use, and the limited availability of GPU resources in public research clouds like Chameleon further exacerbates the issue. This scarcity of resources can cause delays in DNN-related research projects. Therefore, building an emulator can ameliorate the trouble of reserving GPUs, and the emulator can be modified to gather the profiles needed for optimization much quicker.

Overture

The follwing sections will introduce the completed tasks and specify the details within each. The contents are briefly summarized and will try to present the necessary information only. We finished the following tasks:

Literature Review
Emulator implementation:
- Time Profiling
- Pinned Memory
- Inter-GPUs Computation
Reproducing Figures

I will introduce them and the importance of each one.

Tasks + Reason

Literature Review

While waiting for the measurements, I started reading about other GPU-related papers, especially the ones about GPU Schedulers. We found that besides emulating computation and transfer time, we should also emulate the GPU memory profile in order to reproduce some other papers. Fortunately, it’s doable. In fact, without actually using a GPU, we can emulate many aspects of the GPU, more than just its timing. I found several papers that are reproducible theoretically, but they use Tensorflow while my current work targets Pytorch. Therefore I need to keep looking for the ones that use Pytorch.

Afterwards, we started doing more paper reviews and looked over the papers about GPU Scheduling from 2018-2023 to see if we can reproduce figures from other papers. We went over 150 papers to search for the ones that do have implementation in PyTorch and the complemented GitHub page. We managed to find about 15 papers built in PyTorch and 6 of them were published on GitHub.

We found the paper “CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs” and its GitHub page. The paper has three badges of “Artifacts Available, Evaluated, and Reproduced.” The paper’s content is implemented in PyTorch which means we can probably emulate this paper’s result with the emulator we already have by adding more features. We have started testing out to see if we can set up a similar environment and reproduce the experiments in the paper. After checking out the reproducibility of the paper, we will try to reproduce it using our emulator, and we might add new features to our emulator during this process.

Firstly, I tried to reproduce the figures in the paper “CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs”, but stopped after a considerable number of attempts because the README was incomplete and too hard to follow. I first headed to the GitHub of the paper. I read the paper and understood that the GNN’s training was not the same as regular deep learning training, because it had input irregularity, and CoGNN helped better schedule the jobs to the machines by their algorithm. However, when I tried to install the software by the requirement of their environment README in order to reproduce the figures, many dependency issues were there, and barely any packages required were installed successfully. Their README in the software module was unclear on how to run the experiments too. Following the experiment setup did not give me the expected results. After a set of struggles with even completing one suggested experiment, we eventually decided to move on with other papers and abandoned this paper, reminding me the importance of reproducibility again.

Secondly, we found another paper “Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent”. After reading the paper, we figured that the main focus was on distributing the resources (CPU, GPU) of the nodes to the jobs that were distributed by the Kubernetes Scheduler. In this way, there would be less GPU fragmentation and a higher utility rate of the resources. The paper used a simulator to simulate a large number of nodes and run the jobs by simulation. I successfully ran the experiments demonstrated in the repo and even created a smaller sample so that we could gain the result faster, because their original experiment takes 1020 times which will take about a month. However, when we dug deeper into their paper, we soon realized that their emulator is not a “real” one. Although their emulator is built off Kubernetes, the side where they used to create the figures are mere simulators and therefore doesn’t fit with our goal of emulating only GPU-related parts while running other real-system parts.

Reason:

The purpose is to figure out which papers can be reproduced using the emulator, and what other features are needed for the emulator to work.

Emulator implementation

Time Profiling

I did the performance profiling of different GPUs, which included CPU-to-GPU data transfer time and GPU computation time. These two elements will always be rather constant on GPUs so they can be easily emulated by profiling first and then utilized in the emulation. We did it for 6 different GPUs including k80, rtx6000, m40, a100pcie, v100, and p100.

After having the performance profiling information of a few types of GPU nodes, I implemented the first naive version of the emulator. I used the profile recorded and sleep() function to represent the amount of time that each step needs to accomplish. Meanwhile, the time also varies with the command given so some simple arithmetics were implemented too. It’s implemented on a CPU node yet if we want to know the time profile of a GPU, we can still get them just like on a real GPU node.

Reason:

The time profile collected can be compared with Data Wait Time to conduct research on minimizing pipeline stall across different GPUs and models.

Pinned Memory

Pin memory threads – GPU-based Pytorch utilizes such threads to copy data from SHM to pinned memory, but CPU-based Pytorch doesn’t do so. Therefore, I need to implement an emulation of the pin mem threads. Fortunately, the data copy time is predictable. I have already found out that pin mem time has little to do with # of workers or the model type but only the batch size. I still need to find out if it has anything to do with the GPU nodes, which I assume not at this point.

While implementing the features, We first emulated the CPU-to-GPU transfer time and GPU computation time for the p100 GPU based on the profiled information. Another CUDA behavior that requires emulation is that CUDA copies data from shared memory to pinned memory. In order to emulate it, we measured and emulated the time for copying such data (pinned memory). However, the emulator did not behave exactly as the real GPU. This was because we only emulated the time cost of using pinned_memory, but didn’t emulate its memory cost. In order to resolve the problem above, we wrote a CPython module to manually allocate page-locked memory (which behaves the same as CUDA’s pinned_memory). After we implemented this mechanism, the emulator’s fundamental functions were equipped and properly mimicked CUDA’s behaviors.

Reason:

After collecting the GPU profile, I did a comparison with the actual GPU but noticed some differences in their IO time, meaning there was a difference between the emulation-based Pytorch and the actual GPU-based Pytorch.

Inter-GPUs Computation

We worked on the emulation of inter-GPU computation time in order to emulate Figure 9 in the DNN stall paper. This is one of the influential factors in multi-GPU training and we decided to first figure out how to implement this feature. As claimed in the paper, the larger the batch size, the less time it took to update the model. However, our current emulator would give out the same computation time since we have not added features to emulate inter-GPU behaviors. The smaller the batch size, more overheads were proven to be larger. The first step was to rent a lease that had 2 GPUs and saw the effects of inter-GPUs on computation time. We found that there was a small amount of overhead when running two GPUs instead of 1 GPU on the p100 node. My job was to find out where and how these overheads happened and find ways to emulate these features in order to reproduce Figure 9. We used resnet18, 4 workers, 10 batches to separately run 128 batch-size with 1 GPU (Group A) and 256 batch-size with 2 GPUs (Group B). With our current emulator, we would get the same computation time for both experiments to finish 1 batch. However, we saw that the computation time of Group B was longer than Group A, meaning there were some overheads in computation time. I then hacked into the source code of PyTorch and successfully figured out one part of the overhead contributing factors.

Reason:

To better complete the emulator so that it can procide accurate emulation even when using more than 1 GPU on a machine.

Reproducing Figures

After implementing the emulator, we managed to use it to reproduce Figures 3, 4, 5, and 6 in the paper “Analyzing and Mitigating Data Stalls in DNN Training” after a series of experiments and testing. It was noted that some environments in the paper were not the same as what we ran in the past week, but general patterns did apply to the expected hypothesis and measurements. We double checked all the data and figures produced and found out that our prototype meets our expectations, and it was time to look for other papers to reproduce to make the emulator more interesting. The orginial comparing with the reproduced figures are demonstrated as below, you can notice that the patterns do reflect our expected results: Original Figure 3:

Reproduced Figure 3:

Original Figure 4:

Reproduced Figure 4:

Original Figure 5:

Reproduced Figure 5:

Original Figure 6:

Reproduced Figure 6:

Reason:

Our origninal goal was to reproduce papers. Therefore, reproducing figures is a really good step to achieve that.

Summary + Coming Future

We will keep on trying to complete the emulator and figure out the exact mechanisms needed for the implementation. We will also seek for more features and see if it’s possible to add in better features into the emulator.

PDC Midterm Evaluation

Sun, 30 Jul 2023 00:00:00 +0000

Mid-Term Evaluation Update

Hello! I’m Nick, a GSoC contributor for the Proactive Data Containers (PDC) Project. Over the past few weeks I’ve worked on verifying the functionality of the Python API for the PDC project and ensuring the smooth onboarding for new users of the data containers.

I began by documenting the installation of the Ubuntu virtual machine in order to run the PDC repository, since the project wasn’t initially supported on Apple silicon hardware. The installation notes that I recorded for PDC would help contribute towards a more refined and precise process that can be seen updated on the github webpage.

After installing the dependencies of the project onto the VM, I would begin maintaining the existing Python API and making changes that would allow the tests to compile and run successfully. The manual setup had a few problems with file directories paths that prevented the installation of a few files on new devices, which I fixed by manually by linking the path and removing a few header files. However, this proved to only be a temporary fix as the prior issues was evidence of a hardcoded path, which was resolved by some alteration and fishing in the source code.

Now the PDC and PDCpy installations should go smoothly regardless of what OS is being used, and the instruction documentation can be found from the github page which should allow any user to access the data containers.

Building extensions between Python libraries for Biotechnology laboratories

Fri, 28 Jul 2023 00:00:00 +0000

Hello again! This is Luiza, a GSoC contributor for the LabOp Project. My task is to build bridges between programming languages for Biotechnology Laboratory automation.

When talking about life sciences, reproducibility is a issue amongst most research centers. Biotechnology focused laboratories usually have their own protocols developed in house for their own applications. Researchers rely on such protocols to perform their experiments and collect data but when it comes to sharing those protocols and performing them in different laboratories many difficulties arise. Whether it is by lack of equipment, reagents or even by having different orders of execution, replicating a protocol in another laboratory is a challenge. To address this issue LabOp was developed to represent a protocol and convert it in many ways possible, so it can be executed by humans and by machines.

PylabRobot and PyHamilton also come to the picture as such libraries exist to make it possible to write protocols for Hamilton robots(and Tecan machines as well for PylabRobot) but those libraries share the limitation of being able to only represent laboratory protocols at their lower levels, with the user having to write every single command in Python for the protocol to be executed. Thus I’m currently developing an extension for LabOp protocols to be converted into PylabRobot/PyHamilton scripts. This way the researcher writing the protocol can do it in a friendlier fashion, using human-friendly terms to write protocols for robot execution.

BehaviourSpecialization for Liquid Handling class

The first step is building a correspondence spreadsheet with a hello world protocol written in both languages (LabOp | PylabRobot ). This way we can make an equivalence between the functions, parameters and default commands of both Libraries, as well as their structure. This spreadsheet will serve as guidance for the conversion of the Liquid handling steps from their representation in LabOp to their representation in Pylabrobot.

The second step is to create a file that’ll do execute the conversion. In this file I will define a Labware map that’s basically a dictionary translating the resources LabOp names into Labware IDs recognizable by PylabRobots “resource” classes and a Behaviourspecialization class that should convert LabOp actions into PylabRobots Liquid Handler class operations as they’ll coordinate the commands sent from the script to the machines.(see featured images)

Dictionary for LabOp to Pylabrobot container correspondence

Then we move to the protocol that will be tested on the Hamilton Machines, this is a Plasmid purification protocol that is usually performed by a human at a very lower level, one sample at a time. This limitation is not present on Hamilton robots as they can handle many samples at the same time with only one protocol execution. The robot that will be running this protocol has two modules that are not yet present in PylabRobot’s extensions, a pressure pump module and a on deck heatershaker. I’ll be implemmenting this modules in PylabRobot based on their default commands present in PyHamilton and run the protocol on a Hamilton Starlet unit.

The steps of the protocol have been decoupled to facilitate the pilot testing, they are as follows:

Liquid handling - GOOD TO GO
Pressure pump module- requires adjustments
plate grippers(necessary to move the plasmid plate from one module to another) - requires adjustment
On deck heaterShaker- GOOD TO GO

The first pilot tests of the protocol will be run with water instead of plasmid to verify that all the steps are going smoothly, when that’s out of the way we will perform the protocol with dirty plasmids that require purification (which is what the protocol is for). The measurements for success will be sequencing the plasmid (if possible), performing a gel eletrophoresis and measuring absorbance of the DNA.

The goal of this tests is to gather data from the efectiveness of the protocol and its execution on the machine, thus confirming that it is in fact a useful mechanism for DNA purification.

PolyPhy Infrastructure Enhancement

Thu, 27 Jul 2023 00:00:00 +0000

As part of the Polyphy Project, my proposal was aimed at improving various aspects of the project, including CI/CD workflows, encapsulation, and security. Under the mentorship of Oskar Elek, I have made significant progress in the following areas:

Fixed GitHub CI Workflows and Release to PyPI: During the first phase, I focused on refining the GitHub CI workflows by implementing new flows that facilitate seamless releases to PyPI. This ensures that the project can be easily distributed and installed by users, making it more accessible and user-friendly.
Encapsulation from Jupyter into Module: I successfully encapsulated the code from Jupyter notebooks into a module. This step is crucial as it prepares the codebase to be released as a standalone module, making it easier for developers to use and integrate into their own projects.
SonarCloud Integration for Better Code Analysis: To ensure the codebase’s quality, I set up SonarCloud to perform comprehensive code analysis. This helps in identifying potential issues, bugs, and areas of improvement, leading to a more robust and reliable project.
Migration to Docker from Tox: In order to improve the containerization process, I replaced the existing solution, Tox, with Docker. Docker provides better container management and ensures a consistent development and deployment environment across different platforms.
Research on Community Platforms for Self-Hosting: I conducted extensive research on various community platforms suitable for self-hosting. This will enable the project to establish a thriving community and foster active collaboration among users and contributors.
Enhanced Security Measures: I implemented several security improvements to safeguard the project and its users. These include setting up a comprehensive security policy, implementing secret scanning to prevent unintentional exposure of sensitive information, code scanning to identify potential vulnerabilities, private vulnerability reporting to handle security issues responsibly, and Dependabot integration for monitoring and managing dependencies.
Upgraded Taichi to Utilize Class-Based Features: As part of the project’s development, I successfully upgraded Taichi to utilize class-based features available, thereby enhancing the codebase’s organization and maintainability.

Moving forward, I plan to continue working diligently to achieve the goals outlined in my proposal. The improvements made during the first half of the GSoC program have laid a strong foundation for the project’s growth and success.

Stay tuned for further updates and exciting developments as the project progresses!

Uncovering Actionable Insights using ReadTheDocs Analytics

Thu, 27 Jul 2023 00:00:00 +0000

Introduction

Hello again! This is Jack, a GSoC contributor for the OpenROAD Project. My task is to update and optimise the documentation to encourage user adoption and engagement.

For open-source repo maintainers, readthedocs is a godsend. One of its more underrated features are in providing search and traffic analytics of up to 90 days for the Community tier users. This is awesome, because ReadTheDocs is “always free for open source and community projects”.

Motivation

Why are analytics important?

Analytics are great as a proxy indicator for documentation engagement. For instance, traffic to a page, could highlight how popular the tool is, or it could also mean the tool is unclear and therefore people might need more visits to the page to further understand usage. But overall, it still indicates that the page needs to be taken care of due to the increased visits.

In what follows we aim to provide a quick tutorial as well as list out some of the actionable insights we uncovered in the OpenROAD/OpenROAD-flow-scripts documentation project.

Preamble

To download the analytics raw csv files, refer to this website.

You should also have the following packages installed: pandas, numpy, matplotlib, scipy.

Traffic Analytics

Traffic analytics are easy to understand. It comes in the format Date, Version, Path, DailyViews as follows:

df = pd.read_csv('ta_or.csv')[::-1].reset_index(drop=True)
df.Date = df.Date.apply(lambda x: x.split()[0])
df.head()

Figure 1: Loading traffic analytics dataframe

The raw data is not all that informative. Let us aggregate the data to obtain the weekly views.

weeklydf = df.copy()
weeklydf.Date = pd.to_datetime(weeklydf.Date) - pd.to_timedelta(7, unit='d')
weeklydf = weeklydf.groupby(['Path', pd.Grouper(key='Date', freq='W')])['Views']\
 .sum()\
 .reset_index()\
 .sort_values('Date')
weeklydf[weeklydf.Path == '/index.html']

Figure 2: Aggregated weekly traffic

Note that we can replace the page path with any interesting page path we desire. A useful command to obtain all possible page paths in this dataset is to use:

weeklydf.Path.unique()

Figure 3: Unique paths in dataset

With these neat data in our arsenal, let us do some plotting! For the visualisation, we have chosen to use the traffic aggregated on a daily scale. On top of this, we also plot a linear best-fit line of all the points to track the trendline over time.

The code below shows how to plot the top 20 pages.

def plot_views(df, numPages = 20):
 # Groupby Path, sum views
 pathResults = df.groupby('Path').Views.sum().sort_values(ascending=False)
 fig, ax = plt.subplots(numPages, figsize = (15,30))
 fig.tight_layout()

 for i in range(numPages):
 key = pathResults.index[i]
 temp = df[df.Path == key]
 ax[i].scatter(temp.Date, temp.Views)
 ax[i].set_xticks(np.arange(0,90, 7)) # this line is to not clutter the x-axis too much.
 ax[i].set_ylabel('Views')
 ax[i].set_title(key)

 # linear regression
 x, y = temp.Date, temp.Views
 bestfit = stats.linregress(range(len(y)),y)
 print(bestfit)
 equation = str(round(bestfit[0],2)) + "x + " + str(round(bestfit[1],2))
 ax[i].plot(range(len(y)), np.poly1d(np.polyfit(range(len(y)), y, 1))(range(len(y))), '--',label=equation)
 ax[i].legend(loc='upper right')

Figure 4: Top 20 pages by daily view counts (in descending order)

Also, we can aggregate the total views by day to plot daily traffic:

def plot_daily_traffic(df):
 # Groupby Date, sum views
 fig = plt.figure(figsize = (15,10))
 dateResults = df.groupby('Date').Views.sum()
 x, y = dateResults.index, dateResults.values
 plt.scatter(x, y)
 plt.xticks(np.arange(0,90, 7))
 plt.ylabel('Views')
 plt.title('Traffic by Day')

 # linear regression
 bestfit = stats.linregress(range(len(y)),y)
 print(bestfit)
 equation = str(round(bestfit[0],2)) + "x + " + str(round(bestfit[1],2))
 plt.plot(range(len(y)), np.poly1d(np.polyfit(range(len(y)), y, 1))(range(len(y))), '--',label=equation)
 plt.legend(loc='upper right')

Figure 5: Daily aggregated traffic

Key Trends:

Notice how there seems to be a cyclical pattern every week - rise in average view counts during Mon-Fri, then a falloff on weekends. This is most evident in the pages /index.html, /main/README.html. This could be attributed to the standard work or study week of Mon-Fri.
According to the gradient of the best-fit line for Figure 2, there seems to be a slow decline of traffic for the OpenROAD docs. For a gradient of -0.77, it translates roughly to decline of 22 views per month. The small decline could be attributed to the higher traffic from 19-29 March 2023, the dates for the OpenROAD 7nm design contest. Contest are always good for driving traffic.

Actionable insights:

Top pages are usually landing pages: index.html, main/README.html, main/src/README.html. We thus prioritised making these pages more readable and concise.
This is followed by tutorial /tutorials/index.html and /search.html. The prominence of the tutorials page made us shift the tutorials link to a higher position on the left navigation sidebar. Search tips were also included to obtain better search results. More about search in the next section.
Next, as OpenROAD consists of 20 tools: traffic analytics helps us come up with an order to update: ifp, gui, odb, ppl, sta, grt, mpl, gpl, rsz, rcx. pdn, cts, psm

Search Analytics

Search analytics come in the form of: Date, Query, TotalResults. Contrary to traffic analytics, TotalResults do not refer to search count for the query that day, but rather it corresponds to the total results returned by that query on that day. Separate aggregation still needs to be done to obtain the final count.

Firstly, let us load the dataset and perform a groupby on the column Date to obtain the daily count aggregates.

df = pd.read_csv('sa_or.csv')[::-1].reset_index(drop=True)
df = df.rename(columns ={'Created Date': 'Date', 'Total Results': 'TotalResults'})
df.Date = df.Date.apply(lambda x: x.split()[0])

dateResults = df.groupby('Date').TotalResults.count()
dateResults

Figure 6: Code output for daily aggregated search counts.

Now we are ready to plot the daily aggregated searches. This represents the number of times a search was performed on the documentation website.

def plot_daily_searches(df):
 dateResults = df.groupby('Date').TotalResults.count()
 x, y = dateResults.index, dateResults.values
 plt.scatter(x, y)
 plt.xticks(np.arange(0,90, 7))
 plt.ylabel('# Times Searched')
 plt.title('Search count by day')

 # linear regression
 bestfit = stats.linregress(range(len(y)),y)
 print(bestfit)
 equation = str(round(bestfit[0],2)) + "x + " + str(round(bestfit[1],2))
 plt.plot(range(len(y)), np.poly1d(np.polyfit(range(len(y)), y, 1))(range(len(y))), '--',label=equation)
 plt.legend(loc='upper right')

Figure 7: Daily aggregated search counts

We can also do an additional plot for queries that return zero results. In other words, we are interested in the terms people are curious about; but is not covered by our documentation currently. Think of it as an on-site search engine optimisation.

zeroResults = df[df.TotalResults == 0]
zeroResults = zeroResults.groupby('Query').Date.count().sort_values(ascending=False)
print('\nAll 0 results queries (desc)\n')
print(zeroResults.index.tolist())

Example output as follows:

['autotuner', 'tdms', '*macro*', 'rtlmp_max_inst', 'get_property',
'check_setup', 'centos', 'initialize_padring', 'core_utilization',
'pin_access', 'read_libraries', 'config', 'eco', 'rpt',
'improve_placement', 'define_process_corner', 'global_place',
'report_worst_slack', 'max_phi_cof', 'report_power', 'get_pins',
'registerfile', 'set_global_routing', 'prebuilt', 'env',
'repair_clock_inverters', 'set_thread_count', 'report_',
'partition_design', 'place_cell', 'blockage', 'partitionmgr',
'nmos', 'tuner', 'write_sdf', 'place_density', 'place_pins_args',
'size_cell', '*macor*', 'repair_clock_inverter', 'misk',
'readhaty', 'readhat', 'obstruct', 'odbpy', 'openpdn', 'openram',
'placement_cfg', 'read_macro_placement', 'output_drc', 'positon',
'pct', 'qrctechtable', 'qrctechfile', 'qrctech', 'qrc',
'properly covered', 'precision innovations', 'repeater', '"rcx-0487"',
'report_worst', 'report_area', 'report_clock_properties', 'skywater',
'study', 'sv', 'synth', 'synth_hierarchical', 'systemverilog',
'tdm', 'tdms_place', 'triton', 'ungroup', 'verilog_files',
'wrc', 'write_lef', 'write_partition_verilog', 'שואם',
'si2', 'sever', 'setrc', 'rtl_macro', 'report_dcalc', 'report_design',
'report_design_info', 'report_instance', 'report_slews', 'resize',
'rtlmp', 'set_power_activity', 'rtree', 'run_all', 'run_all.tcl',
'sc', 'set_all_input_output_delays', 'set_io_pin_constraints', 'metis',
'lefdef', 'make_result_file', 'macro_placement_cfg', 'clock__details',
'clocks__details', 'combinational', 'config.mk', 'coord',
'core_margin', 'db_process_node', 'dbblocjs', 'dbdatabase',
'dbr', 'dbrt', 'dbrttree', 'debian', 'define_pin_shape',
'densiy', 'desgin', 'diff_file', 'clk_period', 'clk_io_ptc',
'cdl', 'analog', './env.sh', '178', '6_final',
'6_final.odb', '_placement', 'abat', 'add_stripe', 'arch',
'ccs', 'binaries', 'bookshelf', 'buff_cell', 'buildwithdocker',
'busbitchars', 'buschar', 'captable', 'directoryobject',
'disallow_one_site_gaps', 'distribute', 'is_port', 'hierarch',
'hop', 'hyper', 'initialie_flooorplan', 'initialize_flooorplan',
'instance_count', 'is_chip', 'lean', 'gui_final', 'lec',
'*def*', 'limitation', 'lyp', 'maco', 'macro_pin',
'macro_place', 'harness', 'gui.py', 'dont', 'fill_cell',
'dreamplace', 'em', 'enable_dpo', 'energy', 'env.sh', 'erc',
'export', 'findmaste', 'grt_layer_adjustments', 'findmaster',
'freepdk45', 'gdt', 'global_', 'global_place_db',
'global_placementy', 'graph', '갲']

For our case we can roughly the problem with these zero-result queries fall under one of these categories:

Missing documentation: Either the parameter of functionality
Typo: User has the right keyword, but did not type it correctly. We will therefore provide them with search tips such as using fuzziness ~N operator for better matches.

Future Work

ReadTheDocs could also be linked with Google Analytics, but this remains for more advanced users.

Another rich source of information helpful to open-source maintainers are GitHub issues. These are the direct platform where users discuss their problems. Another great way to track documentation engagement is to use metrics such as: installation issues per unit week, or user-issue retention rate, which tracks the number of users that continue to file issues after their first.

Conclusion

This post showcases the amount of insight one can gather from parsing traffic and search analytics. It also provides useful Python functions that can be applied to the analytics dataset for fast prototyping and experimentation. If you are a contributor to open-source projects, try uncovering some insights for your doc pages today!

Halfway Through GSOC: My Experience and Learnings

Mon, 17 Jul 2023 00:00:00 +0000

Hello there! I’m Jonathan Edwin, all the way from the beautiful archipelago of Indonesia. This year, I got the exciting chance to jump on board the 2023 Summer of Reproducibility initiative. It’s been quite the adventure! Right now, I’m pouring my energy into a fascinating project titled Using Reproducibility in Machine Learning Education project. I’m thrilled to be able to make my own little mark on it.

For those of you who are not familiar with what I’m working on, let me shed some light. My project, as part of the “Using Reproducibility in Machine Learning Education” initiative under guidance of Fraida Fund, focuses on creating educational resources that center around reproducing some key machine learning techniques. These include Cutout data augmentation, U-Net, and Siamese networks, to name a few. The end product will be a series of interactive Jupyter notebooks that provide step-by-step guidance for students, helping them not only understand these complex models but also gain hands-on experience in achieving research reproducibility.

Progress and Challenges

Embarking on this project, I dove headfirst into the world of Cutout data augmentation, immersing myself in the many experiments outlined in the foundational paper. This initial study proved to be an intricate blend of multiple datasets, two network architectures, and a performance evaluation of models with and without Cutout data augmentation. Additionally, it included the exploration of these models in combination with other data augmentation techniques.

One of our main objectives has been to help students visualize how the model interacts with the data, and for this, we’ve been leveraging a tool called Grad-CAM. The initial paper provided a rich landscape for exploration and learning, leading us to segment our journey into five interactive Jupyter notebooks - Introduction, CutOut, ResNet, WideResNet, Regularization, and Grad-CAM.

I’m excited to share that, as we’ve hit the mid-term milestone, I’ve managed to make significant strides and completed the notebooks up to the WideResNet section. It’s been a journey full of learning and growth, overcoming various challenges along the way - understanding the intricacies of the experiments, deconstructing complex architectures, and distilling all this into digestible, interactive notebooks for students. Despite the challenges, the process has been incredibly rewarding. As we gear up for the next half of the project, I’m eager to tackle the remaining sections and share my work with the community.

Learnings and Skills Gained

Embracing the Iterative Process of Open Source Development: My initial foray into open source development had me writing and running code in one environment, then copying parts of it to another environment and pushing it from there to GitHub. This occasionally led to mistakes during the code migration. However, I’ve since learned to write or change a little bit of code, run the new version directly from GitHub, catch errors, and improve. In open source development, the end goal is to ensure everything works flawlessly, even if it involves several iterations. This is especially true considering the code from GitHub might directly run on platforms like Chameleon or Google Colab.

Understanding the Distinction between Reproducing Experiments and Crafting Educational Content: There’s a stark difference between merely reproducing an experiment from a research paper and creating an educational resource around that experiment. The former generally involves cloning and running the code, verifying it against the claims in the paper with minimal modifications. The latter, however, necessitates adapting and simplifying the code, regardless of the learner’s skill level, to ensure their comprehension. It’s about carefully guiding learners through each step for a more profound understanding.

The Power of ‘Show, Don’t Tell’: This priceless lesson was imparted by my mentor, Ms. Fraida Fund. Rather than telling me what to do when I erred or needed to learn something new, she demonstrated the correct way first-hand. This hands-on approach made understanding far easier. This principle is also reflected in the creation of our notebooks. For instance, we chose to include the Grad-CAM notebook. Although not directly referenced in the paper, it offers students a clear visual understanding of the impact of the Cutout technique, embodying the “show, don’t tell” philosophy.

Next Steps

As we step into the second half of this thrilling journey, our primary goal is to complete the remaining sections of our Cutout project. We’re setting our sights on the final notebook - Grad-CAM. The Grad-CAM notebook will offer a visual exploration of how our models interpret and interact with data, thereby solidifying the students’ understanding of Cutout data augmentation. So, stay tuned for more as we plunge into these fascinating topics!

Conclusion

Looking back, my time with the Summer of Reproducibility initiative has been nothing short of a profound learning experience. Working on the “Using Reproducibility in Machine Learning Education” project has been both challenging and rewarding, and I am incredibly grateful for this opportunity.

I’ve gained valuable insights into open-source development, delved deeper into the intricacies of machine learning techniques, and experienced firsthand the transformative power of a ‘show, don’t tell’ teaching approach. Moreover, I’ve learned that the creation of educational resources requires a delicate balance between preserving the essence of original research and adapting it to foster easy understanding.

As we press forward, I’m excited about the prospects of the coming weeks. The completion of the Grad-CAM notebook lies ahead, marking the final pieces of our Cutout project. Beyond this project, the skills and lessons I’ve acquired during this initiative will undoubtedly guide me in future endeavours.

I can confidently say that my GSOC journey has been a remarkable chapter in my growth as a developer and researcher. Here’s to more learning, more coding, and more breakthroughs in the future!

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Wed, 12 Jul 2023 00:00:00 +0000

As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim, Martin Putra and collaborator Charis Christopher Hulu (another OSRE fellow) aims to analyze large-scale sequencing datasets in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times.
Recent advancements in Next-Generation Sequencing (NGS) technologies have resulted in massive amounts of nucleotide sequence data and automated genomic workflows to streamline analysis and data interpretation. The success of NGS-driven research has also led to a sudden increase in data of varying size and complexity, making it more time-consuming for researchers to test hypotheses. Analyzing high-throughput genomic data requires a step-by-step execution of dedicated tools - also known as workflows. The first step toward the execution of a typical genomic analysis workflow is quality control of the raw data - a crucial step in removing low-quality data instances that may significantly impact the downstream analysis. Prior work in this area has suggested that the runtimes of genomic workflows get affected due to qualitative differences in the data. Additionally, there is very little consensus on what constitutes “input quality” regarding data from large genomic experiments. In this proposal, we hypothesize that genomic data quality significantly impacts the genomic workflows’ execution time. We aim to leverage machine learning techniques to extract predictive features from quality control tools that robustly predict workflow execution time.

Highlighting and Formatting Pyrope HDL

Thu, 22 Jun 2023 00:00:00 +0000

As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau aims to develop syntax highlighting and a vertical alignment tool for Pyrope. Pyrope is a modern hardware description language under development by MASC. Code is parsed with the tree-sitter grammar for Pyrope. I am working on developing a query file for the nvim-treesitter plugin. This gives neovim users Pyrope syntax highlighting based on the parse tree. In addition to syntax highlighting, I am working on a vertical alignment tool to improve code readability. These features will improve the usability and convenience of Pyrope.

Proactive Data Containers

Tue, 20 Jun 2023 00:00:00 +0000

As part of the Proactive Data Containers (PDC) my proposal under the mentorship of Houjun Tang aims to novel data abstraction for managing science data in an object-oriented manner. PDC’s will provide efficient strategies for moving data in deep storage hierarchies and techniques for transforming and reorganizing data based on application requirements. The functionality of the container object themselves are already well developed, so my goal will be to verify the functionality tests regarding the Python API to ensure that it can be used with ease, as well as create command line tools so that it is a complete data object that can be used across platforms and is simple and helpful for the users.

Public Artifact Data and Visualization

Sat, 17 Jun 2023 00:00:00 +0000

Hello! As part of the Public Artifact Data and Visualization our proposals (proposal from Jiayuan Zhu and proposal from Krishna Madhwani) under the mentorship of Anjo Vahldiek-Oberwagner aims to design a system that allows researchers to conveniently record and compare the environmental information, such as CPU utilization, of different iterations and versions of code during an experiment.

In academic experiments, there is often a need to compare results and performance between different iterations and versions. This comparative analysis helps researchers evaluate the impact of different experimental parameters and algorithms on the results and enables them to optimize experimental design and algorithm selection. However, to conduct effective comparative analysis, it is essential to record and compare environmental information, alongside the experimental data. This information provides valuable insights into the factors that may influence the observed outcomes.

Through this summer, we aim to develop a system that offers a streamlined interface, enabling users to effortlessly monitor their running programs using simple command-line commands. Moreover, our system will feature a user-friendly dashboard where researchers can access historical runtime information and visualize comparisons between different iterations. The dashboard will present comprehensive graphs and charts, facilitating the analysis of trends and patterns in the environmental data.

Interactive Exploration of High-dimensional Datasets with PolyPhy and Polyglot

Fri, 16 Jun 2023 00:00:00 +0000

Hello! My name is Kiran and this summer I’ll be working with Polyphy and Polyglot under the mentorship of Oskar Elek. The full proposal is available online.

For a brief overview, the Polyglot app allows users to interact with a 3D network of high-dimensional language embeddings, specfically the Gensim Continuous Skipgram result of Wikipedia Dump of February 2017 (296630 words) dataset. The high-dimensional embeddings are reduced to 3 dimensions using UMAP. The novel MCPM slime mode metric is then used to compute the similarty levels between points (much like how you might compute the Euclidean distance between two points). These similarity levels are used to filter the network and enable users to find interesting patterns in their data they might not find using quantitative methods alone. For example, the network has a distinct branch in which only years are nearby! Users might find other clusters, such as ones with sports words or even software engineering words. Although such exploration may not lead to quantitatively significant conclusions, the ability to explore and test mini hypotheses about the data can lead to important insights that go on to incite quantitatively significant conclusions.

In our project, we aim to expand Polyglot such that any user can upload their own data, once they have computed the MCPM metric using PolyPhy. This will have important applications in building trust in our data and embeddings. This could also help with research on the MCPM metric, which presents a new, more naturalistic way of computing similarity by relying on the principle of least effort. Overall, there is an exciting summer ahead and if you’re interested in keeping up please feel free to check out the Polyglot app on Github!

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Fri, 16 Jun 2023 00:00:00 +0000

Hi! I’m Charis, an undergraduate student in the IT and Big Data Analytics program at the Calvin Institute of Technology. As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim and Martin Putra aims to gain insight into features that are highly correlated with execution times of genomics workflows and build machine learning models for predicting workflow execution time.

Genomics workflows exhibit a long-tail pattern in their execution times. According to the previous project team’s findings, approximately 2% of genomics workflows had a median execution time of up to 15%, resulting in weeks of execution. Interestingly, it was observed that input quality plays a role in these execution time differences. Therefore, we will analyze features such as the quality of input data as well as the amount of resources allocated in the execution of genomics workflows to find features that correlate with execution time. Based on these features we will build a machine learning model that can predict the execution time of genomics workflows.

By collaborating with Shayantan Banerjee (another contributor) who will study data quality, I will study the system metrics of genomics workflows both at workflow-level and tool-level. Metrics will be collected by running genomics workflows using the Slurm workload manager under various resource allocation conditions. Genomics workflows will be executed on Chameleon clusters of different sizes.

GPU Emulator for Easy Reproducibility of DNN Training

Tue, 13 Jun 2023 00:00:00 +0000

Hi! I’m Haoran Wu, a third year at the University of Chicago majoring in Economics and Computer Science. With my proposal, I’m working on the GPU Emulator for Easy Reproducibility of DNN Training project with Professor Vijay Chidambaram. A Deep Neural Network (DNN) is an advanced artificial neural network that employs multiple layers to process intricate patterns and relationships within data. It finds applications in various fields such as image and speech recognition, natural language processing, and predictive modeling. The layers in a DNN progressively extract higher-level features from raw input data, enabling the network to learn and generalize patterns effectively.

Nevertheless, not all DNN research experiments require the use of a GPU. System researchers, for instance, may be primarily interested in performance profiles and not necessarily in the accuracy of training or inference. These researchers might focus on optimizing the storage layer and data loading of DNN training. In such cases, a GPU emulator that accurately replicates GPU behavior without needing a physical GPU can fulfill their requirements. By utilizing a GPU emulator, system researchers can evaluate their system optimizations’ performance without competing for limited GPU resources in the cloud, thereby avoiding unnecessary delays in their research progress. Our work will eventually be open source and benefit the community.

Optimizing FasTensor: Enabling Efficient Tensor Execution on GPUs

Mon, 05 Jun 2023 00:00:00 +0000

Greetings,

I am Rishabh Singh, and I am excited to be part of the 2023 Google Summer of code program. My proposal under the mentorship of John Wu and Bin Dong focuses on optimizing the FasTensor tensor computing library for efficient usage on GPUs, specifically targeting tensor contraction while preserving structure-locality. This optimization is crucial for scientific applications and advanced AI model training. Throughout the project, I will develop custom computational operations for GPUs, implement FasTensor on GPUs, assess its performance, and provide comprehensive documentation. By the end, I aim to deliver a working implementation, a performance report, and a detailed execution mechanism guide. Leveraging my background in software engineering and machine learning, I will utilize languages like C++ and OpenMP to ensure efficient memory management and data movement. Stay tuned for regular updates and informative blogs as I progress through the summer.

Using Reproducibility in Machine Learning Education

Mon, 05 Jun 2023 00:00:00 +0000

I am Jonathan Edwin, coming from Indonesia, and I am extremely thrilled to be involved in the 2023 Summer of Reproducibility initiative. I am actively contributing to the project by making valuable contributions to the Using Reproducibility in Machine Learning Education project.

As part of the Using Reproducibility in Machine Learning Education my proposal under the mentorship of Fraida Fund aims to develop educational resources focusing on reproducing and replicating fundamental machine-learning techniques, such as Cutout data augmentation, U-Net, and Siamese networks. The project aims to provide students with a hands-on learning experience that enhances their understanding of the models and their underlying principles while imparting valuable skills in ensuring research reproducibility. The project will involve the creation of a series of interactive Jupyter notebooks covering the selected papers, guiding students through reproducing results, and focusing on best practices for ensuring reproducibility. Upon completion, the notebooks will provide a comprehensive and accessible learning experience for students while emphasizing the importance of reproducibility in machine learning education. The proposal also identifies potential challenges associated with the project and proposed solutions to address them. Challenges include incompatibility issues with the original code and current frameworks or environments, difficulty in reproducing the exact results due to factors such as randomness or lack of specific details in the paper, and ensuring that the interactive elements in the Jupyter Notebooks are engaging and effective in teaching reproducibility concepts.

FlashNet: Towards Reproducible Continual Learning for Storage System

Sun, 04 Jun 2023 00:00:00 +0000

Hello! I’m Rani, a third year undergraduate student at Institut Teknologi Bandung majoring at Informatics. As part of the FlashNet my proposal under the mentorship of Haryadi S. Gunawi and Daniar Kurniawan aims to implement and optimize the FlashNet model in real-world storage systems using continual learning techniques.

In real world workloads, it is known that the I/O stream changes and varies. Hence, the performance of I/O read/write could vary and introduce the tail latency. We would like to predict the latency of I/O read to cut the tail and improve the system’s performance. This project focuses on improving the FlashNet pipeline and introducing adaptability to the machine learning models built.

During the summer, we planned to implement the continual learning pipeline using machine learning models we have built previously in the project. Of course, continual learning isn’t a continual learning without the ability of self-motivated retraining. Thus, we will implement several drift detection algorithms, evaluate, and test them. Besides, we will also build a visualization platform to evaluate and monitor the performance of the models built. Lastly, we planned to create Chameleon Trovi artifacts to demonstrate our experiments and make these implementations available and reproducible to the public.

Introducing Levels of Reproduction and Replication in Machine Learning

Thu, 01 Jun 2023 00:00:00 +0000

Greetings everyone,

I am Mohamed Saeed and I am delighted to be part of the 2023 Summer of Reproducibility program, where I am contributing to the Using Reproducibility in Machine Learning Education project.

My proposal was accepted, and I am fortunate to have Fraida Fund as my mentor. The objective of my project is to develop highly interactive open educational resources that can be utilized by instructors teaching graduate or undergraduate machine learning courses. These resources will focus on integrating instruction on reproducibility and reproducible research principles.

Understanding and practicing reproducibility in machine learning (ML) research is of utmost importance in today’s scientific and technological landscape. Reproducibility ensures the reliability, transparency, and credibility of ML findings and discoveries. By learning the principles of reproducibility, students from different levels can validate research results, test introduced methodologies, and understand level of reproducibilty of research.

My contribution will involve developing interactive educational resources that encompass code examples, writing exercises, and comprehensive explanations of key concepts of reproducing ML research. These resources will be carefully crafted to assist students at various levels of expertise. Our aim is for these resources to be widely adopted by instructors teaching graduate or undergraduate machine learning courses, as they seek to enhance the understanding of reproducibility and reproducible research principles.

I think this is a great opportunity to learn more about ML research reproducibility. I’ll be posting regular updates and informative blogs throughout the summer, so stay tuned!

ScaleBugs: Reproducible Scalability Bugs

Thu, 01 Jun 2023 00:00:00 +0000

Hello! As part of the ScaleBugs project our proposals (proposal from Goodness Ayinmode and proposal from Zahra Nabila Maharani) under the mentorship under the mentorship of Cindy Rubio González,Haryadi S. Gunawi and Hao-Nan Zhu aims to build a dataset of reproducible scalability bugs by analyzing bug reports from popular distributed systems like Cassandra, HDFS, Ignite, and Kafka. For each bug report, we will analyze whether the reported bug is influenced by the scale of the operation, such as the number of nodes being used or a number of requests. The resulting dataset will consist of bug artifacts containing the buggy and fixed versions of the scalability system, a reproducible runtime environment, and workload shell scripts designed to demonstrate bug symptoms under different scales. These resources will help support research and development efforts in addressing scalability issues and optimizing system performance.

Reproducible Evaluation of Multi-level Erasure Coding

Wed, 31 May 2023 00:00:00 +0000

Hi! My name is Alex, an undergraduate student at the University of Chicago. As part of the Reproducible Evaluation of Multi-level Erasure Coding, my proposal under the mentorship of John Bent and Anjus George aims to build a platform to reproducibly evaluate the performance and durability of MLEC (Multi-Level Erasure Coding) for large-scale storage systems under different design configurations.

To provide some context, Erasure Coding (EC) is a common approach to protect data from disk failures. Data centers nowadays increasingly use Multi-Level Erasure Coding (MLEC), a newly developed erasure coding method that aims to deal with the drawbacks of Single-Level Erasure Coding (SLEC). Despite its increasing popularity, there have not been many systematic studies to analyze and evaluate MLEC, which is the focus of this project.

The evaluation will primarily be conducted through simulations, since modifying configurations in a real large-scale system is costly and impractical. The expected deliverables of this project will be:

An MLEC simulator that can reproducibly simulate different configurations of the MLEC system, e.g. coding parameter selection, chunk placement scheme, repair method choice, etc.
An analysis of the performance and durability tradeoffs between different MLEC design choices based on the evaluation results from the simulation
Reproduced SLEC evaluation results using existing SLEC simulators
A comparison between MLEC and SLEC on performance and durability tradeoffs
Well-written documents and detailed guides on how to reproduce the evaluation results

Our plan is to build the simulator throughout the summer. We hope our simulator and evaluation results can provide designers of large-scale storage systems with valuable insights on choosing the most appropriate erasure coding configuration per their needs.

[FLASHNET]: Leveraging ML-augmented I/O in Linux

Tue, 30 May 2023 00:00:00 +0000

Hi! I’m Justin, an undergraduate at the University of Chicago. As part of the Flashnet project my proposal under the mentorship of Daniar Kurniawan and Haryadi S. Gunawi aims to port the Flashnet model into the Linux kernel.

In this attempt, I will borrow architecture/design choices from LAKE (to take advantage of its integration of ML-focused hardware acceleration in the kernel) and evaluation criteria from LinnOS to test for model inference accuracy. I also plan to support latency “bucket” inference output to improve accuracy. Ultimately, my goal is to gain further insight into best practices for integrating ML models into real-life operating systems like Linux and to inform general design choices for the Flashnet pipeline.

Intro: Open Source Autonomous Vehicle Controller

Tue, 30 May 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller Project my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aims to create comprehensive technical documentation to help onboard new users of the OSAVC controller. I will be writing tutorials and examples to demonstrate how to start with an OSAVC, programming it with the robotic equivalent of HelloWorld and later moving onto more sophisticated explanations. Hence, this will encourage more applications and wider adoption in the field of autonomous vehicles and expand the community of OSAVC users.

Reproduce and benchmark self-adaptive edge applications under dynamic resource management

Tue, 30 May 2023 00:00:00 +0000

Hello there!

I am Faishal Zharfan, a senior year student studying Telecommunication Engineering at Bandung Institute of Technology (ITB) in Bandung, Indonesia, my proposal. I’m currently part of the Edgebench under the mentorship of Yuyang Huang. The main goal of this project is to be able to reproduce and benchmark self-adaptive video applications using the proposed solution.

The topic that I’m currently working on is “Reproduce and benchmark self-adaptive edge applications under dynamic resource management” or known as edgebench is led by Prof. Junchen Jiang and Yuyang Huang. Edgebench is a project that focuses on how to efficiently distribute resource (bandwidth and cpu usage) across several video applications. Nowaday’s video applications process its data or video on a server or known as edge computing, hence bandwidth or compute unit may be the greatest concern if we talk about edge computing in terms of WAN, because it is strictly limited. We may distribute the bandwidth evenly across the cameras, however the needs of bandwidth/compute unit of each camera is different. Therefore we need another solution to tackle this problem, the solution proposed recently is called “accuracy gradient”, with this solution, we can tell how much of one application needs the bandwidth on a certain time to achieve higher accuracy. The goal of this solution is to allocate more bandwidth to the apps which has the higher f1-score improvement and reduce the other which doesn’t have a significant diminishment of f1-score. Henceforth, in the end we would have a higher total f1-score.

Throughout this summer, we have planned to implement the “accuracy gradient” and test several baselines to be compared with the solution. As for the implementation, we are currently implementing the latency measurement. We are aware that there is an overhead over this solution, therefore the latency should be taken into account.

Enhancing and Validating LiveHD's Power Modeling Flow

Mon, 29 May 2023 00:00:00 +0000

As part of the Enhancing and Validating LiveHD’s Power Modeling Flow my proposal under the mentorship of Jose Renau and Sakshi Garg aims to enhance and validate LiveHD’s power modeling flow, a critical feature for estimating power consumption in modern hardware designs. The existing flow requires further refinement to ensure its stability, accuracy, compatibility with a wider range of netlists and VCD files, and overall performance. To address these challenges, the project will focus on methodically debugging the current implementation, establishing a comprehensive validation methodology for verifying the accuracy of power estimates, and optimizing the flow to handle larger netlists and VCD files efficiently. Additionally, the project aims to improve existing documentation by providing detailed explanations, examples, and tutorials to facilitate user adoption and understanding. Upon successful completion, the project will deliver a more reliable, accurate, and efficient power modeling flow within LiveHD, contributing to the development of energy-efficient hardware designs. This refined flow will not only enhance the capabilities of LiveHD but also encourage wider adoption and utilization by the hardware design community, fostering innovation in the field of energy-efficient devices and systems.

High Fidelity UAV Simulation Using Unreal Engine with specular reflections

Mon, 29 May 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aims to Develop a unreal engine based simulator for testing. The simulator will be using unreal engine for the physics and visualization.

The existing framework uses gazebo simulator with ROS which limit the developement to only Python and C++ programing languages. I intend to develope this simulator with intention connecting it with Python and C++, additionaly expanding support to Matlab so that in future the control algorithm design and validation process becomes easier. To smoothen future developement, i intent to add detailed documentation consisting of the developement period weekly report, examples and tutorial. Upon succesful completion, the project will deliver a powerful simulator with realistic simulation using unreal engine and additional support other programming languages like matlab.

For more information about the Open Source Autonomous Vehicle Controller and the UC OSPO organization, you can visit the OSAVC project repository and the UC OSPO website.

OpenRAM Layout verses Schematic (LVS) visualization

Mon, 29 May 2023 00:00:00 +0000

As part of the OpenRAM Layout verses Schematic (LVS) visualization my proposal under the mentorship of Jesse Cirimelli-Low and Matthew Guthaus aims to develop a comprehensive Python-based graphical user interface (GUI) with a robust backend system to effectively analyze, visualize, and debug layout versus schematic (LVS) mismatches in the OpenRAM framework. The proposed solution focuses on efficiently processing LVS report files in JSON format, identifying mismatched nets in the layout, and visually representing extra nets in the schematic graph using advanced backend algorithms. By implementing a powerful backend system, the GUI will streamline the debugging process and improve overall productivity, while maintaining high performance and reliability. The deliverables for this project include a fully-functional GUI with a performant backend, features for visualizing and navigating through LVS mismatches, comprehensive documentation, and user guides.

Automatic Cluster Performance Shifts Detection Toolkit

Sat, 27 May 2023 00:00:00 +0000

Hi! I am Kangrui, a Pre-doc student at the University of Chicago. As part of the Automatic Cluster Performance Shifts Detection Toolkit my proposal under the mentorship of Sandeep Madireddy and Ray Andrew aims to design a real-time performance shift detection algorithm for high-performance computing clusters, ensuring minimal overheads.

This project focuses on developing a real-time performance shift detection algorithm tailored to heterogeneous workloads, aiming to promptly inform administrators about performance changes. The primary goal is to design an algorithm that efficiently detects shifts in real-time, with minimal system overheads.

In addition to algorithm development, we plan to enhance the Darshan toolkit’s functionality by integrating our algorithm, offering users early performance shift detection. This integration will aid administrators in making informed system utilization and scheduling decisions.

To promote transparency and reproducibility, we’ll encapsulate our findings, scripts, and profiling data within a Jupyter notebook, especially Chameleon Trovi, enabling other researchers to reproduce our experiments easily.

Looking ahead, we plan to expand the algorithm’s applicability to cater to diverse HPC workloads and infrastructures. Other areas of interest include its use in detecting shifts in financial markets or monitoring IoT data streams. Further refinement of our algorithm, to reduce overheads and improve real-time detection capabilities, is also a part of our future endeavours. This task may involve evaluating various shift detection methods and noise filtering techniques.

Using Reproducibility in Machine Learning Education: Reproducibility with Incomplete Methodology Descriptions

Sat, 27 May 2023 00:00:00 +0000

Hey,

I am Shekhar and I am one of several students who are working on the project Using Reproducibility in Machine Learning Education under the mentorship of Fraida Fund. My Proposal aims to develop interactive educational materials about reproducibility in machine learning, for use in graduate and undergraduate classes. My project is inspired by my experience in the Machine Learning Reproducibility Challenge, where I found that a major challenge for reproducibility was that some details were left ambiguous in the paper I was trying to reproduce. For my project, I will develop an interactive tutorial to help demonstrate how if the methodology details are not fully specified in a publication, then someone trying to reproduce the result will have to make choices that may not match the authors’, and these choices will affect whether or not the final result is validated.

Efficient Communication with Key/Value Storage Devices

Fri, 26 May 2023 00:00:00 +0000

Hi everyone!

I’m Manank Patel, and am currently an undergraduate student at Birla Institute of Technology and Sciences - Pilani, KK Birla Goa Campus. As part of the Efficient Communication with Key/Value Storage Devices my proposal under the mentorship of Aldrin Montana and Philip Kufeldt aims to implement io_uring based communication backend for network based key-value store.

io_uring offers a new kernel interface that can improve performance and avoid the overhead of system calls and zero copy network transmission capabilities. The KV store clients utilize traditional network sockets and POSIX APIs for their communication with the KV store. A notable advancement that has emerged in the past two years is the introduction of a new kernel interface known as io_uring, which can be utilized instead of the POSIX API. This fresh interface employs shared memory queues to facilitate communication between the kernel and user, enabling data transfer without the need for system calls and promoting zero copy transfer of data. By circumventing the overhead associated with system calls, this approach has the potential to enhance performance significantly.

Update OpenROAD Documentation and Tutorials

Fri, 26 May 2023 00:00:00 +0000

Hi! I am Jack, a Masters student at the National University of Singapore. In GSoC 2023, I will be undertaking the project entitled Update OpenROAD Documentation and Tutorials to improve the user experience and documentation of this exciting open-source RTL-to-GDSII framework, jointly mentored by Indira Iyer Almeida and Vitor Bandeira. Check out my proposal here!

This project aims to review and update missing documentation and tutorials in OpenROAD-flow-scripts. A key focus will be on increasing ease-of-setup by updating documentation, setup scripts and docker-based commands. Next, we will also update documentation for the following OpenROAD components: Makefile flow variable, distributed detailed routing, Hier-RTLMP, Autotuner. If time permits, cloud enablement will be implemented, alongside notebook-based packaging to further increase ease of adoption.

Advancing Reproducible Science through Open Source Laboratory Protocols as Software

Thu, 25 May 2023 00:00:00 +0000

Hello everyone!

My name is Luiza, I am an eighth-semester Bsc Biological Sciences student from São Paulo, Brazil. As part of the LabOp working group, my proposal under the mentorship of Dan Bryce and Tim Fallon aims to build a conversor that takes normal laboratory protocols and translates them into machine executable protocols. This is possible thanks to LabOP’s versatility to represent what a Laboratory protocol should look like. I´ll be testing this specialization in Hamilton machines that are great for experimenting scalling up.

Nowadays we face a very common issue between Biotechnology laboratories, that is that protocols are difficult to share and to adapt for machine execution. Laboratory protocols are critical to biological research and development, yet complicated to communicate and reproduce across projects, investigators, and organizations. While many attempts have been made to address this challenge, there is currently no available protocol representation that is unambiguous enough for precise interpretation and automation, yet simultaneously abstract enough to enable reuse and adaptation.

With LabOP we can take a protocol and convert it in multiple ways depending on the needs of the researcher for automation or human experimentation and allowing flexibility for execution and experimentation so I`ll be building a specialization that translates protocols in a way that they can be executed by Hamilton machines.

Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Thu, 25 May 2023 00:00:00 +0000

The project plans to measure the impact of different missing settings for open-source database systems, such as MySQL and PostgreSQL particularly under the TPC-C Benchmark. The objective requires to run experiments on popular settings that are not reported and fix any problems during the experiments for the target systems. The project will compare the performance characteristics, and analyze the impact of missing settings on the performance of the target systems.

PolyPhy Infrastructure Enhancement

Thu, 25 May 2023 00:00:00 +0000

Hey!

I’m Prashant Jha, from Pune, a recent undergraduate student from BITS Pilani. As part of the Polyphy my proposal under the mentorship of Oskar Elek aims to develop and improve the current infrastructure.

Polyphorm / PolyPhy - which is led by Oskar Elek. PolyPhy is an organization that focuses on developing a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. With its roots in astronomy and inspiration drawn from nature, PolyPhy has been instrumental in discovering network-like patterns in natural language data and reconstructing the Cosmic web structure using its early prototype called Polyphorm. The organization aims to provide a richer 2D / 3D scalar field representation of the reconstructed network, making it a toolkit for a range of specialists across different disciplines, including astronomers, neuroscientists, data scientists, and artists. PolyPhy’s ultimate purpose is to create quantitatively comparable structural analytics and discover connections between different disciplines. To achieve its goals, PolyPhy requires a robust infrastructure that is engineered using DevOps, Code Refactoring, and Continuous Integration/Continuous Deployment (CI/CD) practices. You can see an instructive overview of PolyPhy in our workshop and more details about our research here.

Strengthening Underserved Segments of the Open Source Pipeline

Thu, 25 May 2023 00:00:00 +0000

Namaste everyone🙏🏻!

I’m Nandini Saagar, from Mumbai. An undergraduate student at the Indian Institute of Technology, Banaras Hindu University, IIT (BHU), Varanasi. As part of the Strengthening Underserved Segments of the Open Source Pipeline my proposal under the mentorship of Emily Lovell aims to strengthen the underserved segment of the open source pipeline.

My interest in Open Source was first piqued as a freshman when I was introduced to Open Source as a place where people from all communities and backgrounds come together to create software that can have real-world impact, that too in a completely autonomous and self-governed manner! I am so glad that I could transition from just a person who imagined Open Source to be a fair-eyed dream to being a part of multiple such communities. This journey has been life-defining for me, and that’s why I want to help deliver the message of Open Source to all teenagers!

This project seeks to invite and support broader, more diverse participation in open source by supporting early contributors, especially those who have been historically minoritized within tech. It will aim to create content that anyone with some Open Source experience can use to help and guide new students to the journey of OpenSource, GitHub, and all the relevant technologies, provide a medium and platform for all contributors to share their various OpenSource experiences and testimonials, conduct an Open Source Themed Hackathon/Scavenger Hunt, and leverage the power of social media engagement to get young and brilliant minds acquainted with the technical and open-source world at an early age.

Stay tuned to explore the enormous world of Open Source with me!

Open Source Autonomous Vehicle Controller

Wed, 24 May 2023 00:00:00 +0000

As part of the Open Source Autonomous Vehicle Controller Project my proposal under the mentorship of Aaron Hunter and Carlos Espinosa aims to Develop a tutorial that serves as a comprehensive guide for new users of the OSAVC controller. The tutorial will start from scratch, demonstrating how to initialize and program the controller using the equivalent of a “Hello, World!” program. Subsequently, it will progress to more advanced applications.

Throughout the project, I will work closely with my mentors to ensure the accuracy, clarity, and usability of the documentation. Their guidance and expertise will be instrumental in achieving the project’s objectives effectively.

By creating comprehensive technical documentation, this project aims to empower new users to harness the capabilities of the OSAVC controller. It will facilitate their understanding of the controller’s functionalities and enable them to leverage its potential in the field of autonomous vehicle applications.

I am excited to embark on this journey, contribute to the open-source community, and make a valuable impact in the field of autonomous vehicles. Stay tuned for regular updates and progress reports as I work towards achieving the goals set forth in this project.

For more information about the Open Source Autonomous Vehicle Controller and the UC OSPO organization, you can visit the OSAVC project repository and the UC OSPO website.

Stay connected and join me in this exciting endeavor!

Verify the reproducibility of an experiment

Wed, 24 May 2023 00:00:00 +0000

Hello everyone, my name is Jesse and I’m proud to be a fellow in this 2023 Summer of Reproducibility program, contributing to noWorkflow project.

My proposal was accepted under the mentorship of João Felipe Pimentel and Juliana Freire and aims to work mapping and testing the capture of the provenance in typical Data Science and Machine Learning experiments.

What…

Although much can be said about what reproducibility means, the ability to replicate results in day-to-day Data Science and Machine Learning experiments can pose a significant challenge for individuals, companies and researche centers. This challenge becomes even more pronounced with the emergence of analytics and IA, where scientific methodologies are extensively applied on an industrial scale. Then reproducibility assumes a key role in productivity and accountability expected from Data Scientists, Machine Learning Engineers, and other roles engaged in ML/AI projects.

How…

In the day-to-day, the pitfalls of non-reproducibility appear at different points of the experiment lifecycle. These challenges arise when multiple experiments need to be managed for an individual or a team of scientists. In a typical experiment workflow, reproducibility appears in different steps of the process:

The need to track the provenance of datasets.
The need to manage changes in hypothesis tests.
Addressing the management of system hardware and OS setups.
Dealing with outputs from multiple experiments, including the results of various model trials.

In academic environments, these issues can result in mistakes and inaccuracies. In companies, they can lead to inefficiencies and technical debts that are difficult to address in the future.

Finally…

I believe this is a great opportunity to explore the emergence of these two hot topics that are IA and reproducilibity! I will share more updateds here throughout this summer and hope we can learn a lot together!

Teaching Computer Networks with Reproducible Research: Developing a 'classroom competition' for adaptive video delivery

Tue, 23 May 2023 00:00:00 +0000

As part of the Teaching Computer Networks with Reproducible Research project my proposal under the mentorship of Fraida Fund aims to develop a classroom competition for adaptive video delivery policies, leveraging an existing open-source reproducible result. The competition will challenge students to extend the original work and design their adaptive policies for head-to-head competition against their classmates.The project will involve packaging the existing result for easy reproducibility and building on it by implementing other adaptive video policies from the literature, developing different network settings for evaluating student submissions, and creating an evaluation framework for scoring submissions based on various criteria (so that competition remains fair and unbiased). The deliverables include a functional submission and evaluation process, an evaluation framework, and documentation and materials for course instructors to use in the classroom.

Writing a blog about your OSRE 2023 project

Sun, 06 Nov 2022 11:15:56 -0700

Starting in 2023 the Organization Admins will be asking students and contributors to provide regular status updates which will help us better highlight the work you are doing and track activities within our OSRE projects. These progress reports will also form the basis of blog reports prepared by students in the course of their summer. Blog reports should include links to proposals, presentations, reports, and an overview of the student’s experience.

Your experience is invaluable for future OSRE candidates and for improving the program every year.

Size and content

Keep it short and crisp. Include a short description of your project, a link to your project proposal, and, later in the program, links to the GSoC reports you provided.

Making a pull request for your blog

Fork the git repository
If you haven’t already done so, add your profile using these instructions
- IMPORTANT: Under user_groups: add - 2023 Contributors (as opposed to any of the two mentor groups)
- The short bio and any other information goes below the frontmatter
Post your blog
- Add /content/report/osre23/ORGANIZATION/PROJECTNAME/DATE-USERNAME/index.md
- Add a frontmatter to index.md, using the labels below
- Blog text goes below the frontmatter
- In that same directory include a picture and call it featured.png (also supports .jpg, .jpeg)
Commit to your fork and make a pull request and email OSRE Admins (currently: Stephanie Lieggi, Carlos Maltzahn).

Example frontmatter and text body

---
title: "YOUR TITLE"
subtitle: "YOUR SUBTITLE (OPTIONAL)"
summary:
authors:
 - USERNAME1
 - USERNAME2
tags: ["osre23"]
categories: []
date: YYYY-MM-DD
lastmod: YYYY-MM-DD
featured: false
draft: false

# Featured image
# To use, add an image named `featured.jpg/png` to your page's folder.
# Focal points: Smart, Center, TopLeft, Top, TopRight, Left, Right, BottomLeft, Bottom, BottomRight.
image:
 caption: ""
 focal_point: ""
 preview_only: false
---

As part of the [PROJECTNAME](/project/osre23/ORGANIZATION/PROJECTNAME) my [proposal](https://...) under the mentorship of MENTOR aims to ...

Reports | UCSC OSPO

Final Report for Smart Environments

Introduction

Method

EnvAgent

EnvEval Benchmark

Evaluation

Conclusion

Thank you!

Final Blog: Rectilinear Floorplans in OpenROAD

Final Progress: Enabling Rectilinear Floorplanning in OpenROAD

Project Overview

Pull Requests made

Key Contributions

Phase 1: Init Floorplan (IFP) Module Support

1. Polygonal Die Definition

2. Standard Cell Row Generation

3. Testing and Validation

Demo: U-Shaped Die Row Generation

Phase 2: Pin Placement (PPL) Module Support

1. Core Data Structure Migration

2. Pin Slot Calculation

3. Pin Orientation Algorithm

4. Hungarian Matching and Simulated Annealing for Polygons

Demo: T-Shaped Die Pin Placement

Code Quality

Testing and Validation

Future Work

Acknowledgements

Scenic-RoboSuite Integration: Building the First Working Prototype

Major Achievements

MJCF XML Injection

Complex Mesh Object Support

Custom Arena Definition

Multi-Robot Support

Built-in Manipulation Behaviors

Extended Environment Configuration

Example: Probabilistic Pick-and-Place

Challenges Overcome

Understanding Dual Architecture Paradigms

Discovering and Extending ManipulationEnv

XML to 3D Mesh Pipeline

File Path Resolution Discrepancies

Impact and Applications

Documentation and Resources

Current Status and Future Work

Conclusion

Final Report — RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Introduction

Methods

Results

Future Work

Acknowledgments

Links

Final Update: Building Intelligent Observability for NRP

How Our Novel InfoAgent Architecture Advances the Observability Mission

1. Prometheus Metrics Analysis Agent

2. Query Refinement Agent (CROQ)

3. Explanation Generation Agent (AIS)

Completed Integration: The Novel InfoAgent Pipeline

Hardware Testing Results

Learning Journey and Novel Contributions

Ongoing Work: Continuing Beyond OSRE

Acknowledgments

[Final] Building PeerSky’s Extensions System

Project Overview

Key Design Goals

Highlights

Preinstalled MV3s

Electron Integration

Toolbar & Puzzle Menu

Security Highlights

Example: Installing from the Web Store

Reflection

Final Report : Streamlining Reproducible Machine Learning Research with Automated MLOps Workflows

Final Report: Applying MLOps to Overcome Reproducibility Barriers in ML

Background

Progress Since Mid-Report

Migration from Cookiecutter to Copier

Support for Multiple Setup Modes