Projects | UCSC OSPO

NETAI: AI-Powered Network Anomaly Detection and Diagnostics Platform

Thu, 05 Feb 2026 00:00:00 +0000

NETAI (Network AI) is an AI-powered network anomaly detection and diagnostics platform for the National Research Platform (NRP). This project combines Kubernetes-native LLM integration, network performance monitoring, and predictive analytics to create an intelligent assistant for network operators. Students will work with cutting-edge technologies including Large Language Models (LLMs), Kubernetes, perfSONAR network measurements, time-series analysis, and containerized AI/ML workloads, while contributing to real-world applications in network operations and diagnostics.

The project involves developing a Kubernetes chatbot that leverages NRP’s managed LLM service (providing access to models like Qwen3-VL, GLM-4.7, and GPT-OSS) to help network operators understand complex network behaviors, diagnose anomalies, and receive natural language explanations of network issues. Students will integrate perfSONAR measurement data with traceroute path analysis to create an interactive network topology visualization, and develop AI/ML models for predictive network performance analysis using NRP’s GPU resources.

In addition, students will gain hands-on experience with fine-tuning LLMs on historical network diagnostics data, developing time-series forecasting models for network metrics, and implementing anomaly detection using deep learning techniques. The entire AI/ML pipeline will be containerized and deployed as Kubernetes workloads, utilizing GPU-enabled pods for model training and inference, ensuring scalability and seamless integration with existing NRP infrastructure.

The platform builds upon existing network diagnostics capabilities, combining end-to-end throughput measurements with detailed traceroute data to enable operators to visualize network paths, identify performance bottlenecks, and understand relationships between metrics and underlying infrastructure. The AI enhancement will provide predictive capabilities, automated incident reporting, and intelligent recommendations for network remediation strategies.

NETAI / LLM Integration & Kubernetes Chatbot

The proposed work includes developing a Kubernetes-native chatbot that integrates with NRP’s managed LLM service to provide intelligent network diagnostics assistance. Students will create a conversational interface that can answer questions about network performance, explain anomalies in natural language, and suggest remediation strategies. They will fine-tune LLMs on historical network diagnostics data, test results, and traceroute information to create domain-specific assistants. Students will implement RESTful APIs for chatbot interactions, develop prompt engineering strategies for network diagnostics, and create context-aware responses that incorporate real-time network telemetry. The chatbot will be deployed as Kubernetes services, utilizing GPU pods for inference and integrating with the existing diagnostics platform.

Topics: Large Language Models, Kubernetes, Chatbots, Natural Language Processing, Network Diagnostics, API Development
Skills: Python, Kubernetes, LLM APIs (Qwen3-VL, GLM-4.7, GPT-OSS), Prompt Engineering, REST APIs, Docker, GPU Computing
Difficulty: Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

NETAI / Network Anomaly Detection Models

The proposed work includes developing deep learning models for network anomaly detection using historical perfSONAR and traceroute data. Students will create models that can identify slow links, high packet loss, excessive retransmits, and failed network tests automatically. They will implement anomaly detection algorithms using techniques such as autoencoders, LSTM networks, and transformer architectures. Students will train models on NRP’s GPU clusters using historical network telemetry stored in SQLite databases, develop feature engineering pipelines for network metrics, and create real-time inference services deployed as Kubernetes workloads. The models will be integrated into the diagnostics platform to provide automated anomaly detection alongside the interactive visualization.

Topics: Deep Learning, Anomaly Detection, Time-Series Analysis, Network Monitoring, Model Training, GPU Computing
Skills: Python, PyTorch/TensorFlow, scikit-learn, Pandas, NumPy, SQLite, Kubernetes, GPU Pods, MLOps
Difficulty: Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

NETAI / Predictive Analytics & Forecasting

The proposed work includes developing predictive models that can forecast network performance degradation and identify patterns in network anomalies before they impact users. Students will create time-series forecasting models for network metrics such as throughput, latency, and packet loss, using techniques like ARIMA, Prophet, and deep learning-based forecasting. They will implement few-shot learning approaches to adapt models to new network topologies and measurement patterns, develop early warning systems for potential network issues, and create automated incident report generation using LLMs. Students will leverage NRP’s GPU resources for training forecasting models and deploy them as Kubernetes services for real-time predictions integrated with the diagnostics dashboard.

Topics: Time-Series Forecasting, Predictive Analytics, Machine Learning, Network Performance, Early Warning Systems, LLM Integration
Skills: Python, PyTorch/TensorFlow, Prophet, ARIMA, Pandas, NumPy, Time-Series Analysis, Kubernetes, GPU Computing
Difficulty: Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

NETAI / Kubernetes Deployment & Infrastructure

The proposed work includes setting up Kubernetes-based infrastructure for deploying the entire NETAI platform, including LLM services, ML models, and the diagnostics dashboard. Students will create Helm charts for deploying containerized AI/ML workloads, configure GPU-enabled pods for model training and inference, and implement persistent storage solutions for maintaining historical network telemetry. They will develop GitLab CI/CD pipelines for automated testing and deployment, set up monitoring and observability using Prometheus and Grafana for tracking model performance and resource usage, and create scalable deployment strategies that leverage NRP’s distributed computing resources. Students will also integrate the platform with existing perfSONAR infrastructure and ensure seamless operation within the NRP cluster.

Topics: Kubernetes, DevOps, CI/CD, GPU Computing, Container Orchestration, Infrastructure as Code, Monitoring
Skills: Kubernetes, Helm, GitLab CI/CD, Prometheus, Grafana, Docker, GPU Pods, Persistent Storage, Infrastructure Automation
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

Project Resources

National Research Platform: https://nrp.ai/
NRP LLM Service: https://nrp.ai/documentation/userdocs/ai/llm-managed/
perfSONAR: https://www.perfsonar.net/
MaDDash: https://github.com/esnet/maddash
Network Monitoring Documentation: https://nrp.ai/documentation/

Background

This project addresses critical gaps in network performance monitoring for the National Research Platform by integrating AI/ML capabilities with existing perfSONAR-based diagnostics. The platform combines end-to-end network measurements with detailed path-level analysis, enhanced by intelligent AI assistants that can help operators understand complex network behaviors and predict potential issues. By leveraging NRP’s managed LLM service and GPU resources, students will create a Kubernetes-native system that scales across the distributed research network infrastructure, providing both real-time diagnostics and predictive analytics to improve network reliability and performance for researchers nationwide.

VINE: Precision Agriculture Data Platform & Digital Twin

Thu, 05 Feb 2026 00:00:00 +0000

VINE (Vineyard Intelligence Network & Environment) is an AI/ML research project focused on precision agriculture using the National Research Platform (NRP). This project leverages the innovative demonstration at Iron Horse Vineyards to study how AI and machine learning can optimize agricultural practices through data-driven insights. Students will work with cutting-edge AI/ML technologies, distributed computing on NRP, and large-scale data analysis, while contributing to real-world applications in sustainable agriculture and climate adaptation.

The project involves AI/ML research using agricultural data from Iron Horse Vineyards, leveraging the computational resources of the National Research Platform for training and deploying machine learning models. Students will work with agricultural datasets including sensor data, multi-spectral drone imagery, and historical records, developing models for predictive analytics, computer vision, and time-series forecasting. The integration of NRP’s distributed infrastructure enables scalable AI research that can process large volumes of sensor data, multi-spectral imagery, and historical agricultural records.

Students will gain hands-on experience with AI/ML model development for agricultural applications, learning how to analyze multi-spectral drone imagery, process time-series sensor data, and build predictive models for irrigation scheduling, pest detection, and harvest timing. They will deploy and train models on NRP’s Kubernetes clusters, utilize GPU resources for deep learning workloads, and work with agricultural datasets for comprehensive research. The project emphasizes using distributed computing on NRP to scale AI/ML experiments and create open, shareable datasets for collaborative research.

The platform builds upon the success demonstrated at Iron Horse Vineyards, where AI-driven analytics have shown potential for 10% water use reduction and improved yield optimization. This project aims to advance AI/ML research in precision agriculture by utilizing NRP’s computational capabilities, creating reproducible research that can benefit the broader agricultural and research communities.

VINE / Data Pipeline & Integration

The proposed work includes building data pipelines to ingest, process, and prepare agricultural data from Iron Horse Vineyards and other sources for AI/ML research. Students will develop pipelines to collect sensor data (soil moisture, temperature, CO2, weather), multi-spectral drone imagery, and historical agricultural records. They will create data validation and quality assurance processes, implement data preprocessing for ML model training, and develop data integration workflows that connect agricultural datasets with NRP computational resources. Students will also work on data sharing mechanisms to make processed datasets available for the research community.

Topics: Data Engineering, Time-Series Data, Data Preprocessing, Data Sharing, ML Data Pipelines
Skills: Python, Pandas, NumPy, Data Validation, REST APIs, Docker, Kubernetes, Data Processing
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada

VINE / AI/ML Models for Agricultural Analytics on NRP

The proposed work includes developing and training machine learning models for agricultural applications using the National Research Platform (NRP). Students will create models for predictive irrigation scheduling based on soil moisture, weather forecasts, and historical data. They will develop computer vision models for analyzing multi-spectral drone imagery to detect plant health, identify pests, and estimate yield. Students will also work on time-series forecasting models for predicting harvest timing and optimizing resource allocation. The project will involve training models on NRP’s GPU clusters, utilizing distributed training capabilities, and deploying models for real-time inference. Students will leverage agricultural datasets for training and validation, and contribute model outputs and insights for the research community.

Topics: Machine Learning, Computer Vision, Time-Series Analysis, Predictive Analytics, Agricultural AI, Distributed Training
Skills: Python, PyTorch/TensorFlow, scikit-learn, OpenCV, Pandas, NumPy, MLOps, NRP Kubernetes, GPU Computing
Difficulty: Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada

VINE / Digital Twin & AI-Driven Visualization

The proposed work includes creating AI-enhanced digital twin systems for agricultural sites using computational resources on NRP. Students will develop 3D visualization systems (potentially using Omniverse or similar platforms) to represent vineyards and farms, integrate AI model predictions into the digital twin for real-time insights, and create interactive dashboards for monitoring and analysis. They will implement spatial data processing using ML models to map sensor locations and readings to geographic coordinates, and develop AI-driven simulation capabilities for testing different agricultural strategies (irrigation patterns, planting layouts, etc.) before implementation. Students will deploy visualization services on NRP infrastructure and integrate with agricultural data sources for real-time updates.

Topics: Digital Twin, AI-Enhanced Visualization, GIS, Spatial Data, ML-Driven Simulation, Real-Time Systems
Skills: Python, 3D Graphics (Omniverse/Unity/Blender), GIS tools, WebGL, React/Three.js, ML Integration, NRP Deployment
Difficulty: Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada

VINE / Web Dashboard & NRP Integration Platform

The proposed work includes building a comprehensive web dashboard for visualizing agricultural data, AI model predictions, and research insights. Students will develop a full-stack web application using modern frameworks (React, Flask/FastAPI) deployed on the National Research Platform (NRP). The dashboard will display real-time sensor readings, historical trends from agricultural datasets, AI model predictions, and digital twin visualizations. Students will create API endpoints that integrate with NRP computational resources and agricultural data sources, implement role-based access control for researchers, and enable data export/sharing with the broader research community. The platform will support interactive data exploration tools and provide programmatic access to AI/ML models running on NRP.

Topics: Full-Stack Web Development, Data Visualization, API Development, NRP Deployment, ML Model Serving
Skills: React, Flask/FastAPI, PostgreSQL, D3.js/Plotly, Bootstrap/Tailwind CSS, REST APIs, Kubernetes, NRP APIs
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada

Project Resources

National Research Platform: https://nrp.ai/
Iron Horse Vineyards Project: https://gitlab.nrp-nautilus.io/ihv
Omniverse Integration: https://gitlab.nrp-nautilus.io/omniverse
CENIC Network: https://cenic.org/
CENIC Precision Agriculture Blog: https://nrp.ai/cenic-precision-agriculture-2025

Background

This project builds upon the successful demonstration at Iron Horse Vineyards, where CENIC, UC San Diego, and partners have created a living laboratory for precision agriculture. The VINE project focuses on AI/ML research using the National Research Platform (NRP) for computational resources. By leveraging NRP’s distributed infrastructure and GPU clusters, students can train and deploy sophisticated ML models for agricultural applications. The project works with agricultural datasets from Iron Horse Vineyards and aims to create open, shareable datasets for the research community. This approach creates a scalable, reproducible framework for AI/ML research in precision agriculture that can benefit researchers, educators, and practitioners nationwide.

Reconfigurable and Placement-Aware Replication for Edge Systems

Sat, 31 Jan 2026 00:00:00 +0000

Project Description

Topics: Distributed systems
Skills: Rust, Java, Go, Python, Bash scripting, Linux, Docker.
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fadhil I. Kurnia

Modern replicated systems are typically evaluated under static configurations with fixed replica placement. However, real-world edge deployments are highly dynamic: workloads shift geographically, edge nodes join or fail, and latency conditions change over time. Our existing testbed provides reproducible evaluation for replicated systems but lacks support for dynamic reconfiguration and adaptive edge placement policies.

This project extends the existing open testbed to support:

Dynamic Replica Reconfiguration
- Membership changes (add/remove replicas)
- Leader migration and shard movement
- Online reconfiguration cost measurement (latency spikes, recovery overhead, state transfer cost)
Edge-Aware Placement Policies
- Demand-aware placement based on geographic workload skew
- Latency-aware and bandwidth-aware replica selection
- Comparison of static vs. adaptive placement strategies
- Evaluation under real-world latency matrices (e.g., US metro-level or cloud region traces)
What-if Simulation Framework
- Replay workload traces with time-varying demand
- Simulate hundreds of edge sites with realistic network conditions
- Quantify trade-offs between consistency, availability, reconfiguration overhead, and cost

The outcome will be an open-source framework that enables researchers to evaluate not only steady-state replication performance, but also how systems behave under churn, scaling events, and demand shifts. They are central challenges in real edge environments.

Expected Deliverables

Reconfiguration abstraction layer (API for membership & placement changes)
Placement policy plugin framework (k-means, facility-location heuristics, latency-minimizing, cost-aware)
Trace-driven dynamic workload engine
Public benchmark scenarios and reproducible experiment scripts
Artifact-ready documentation and evaluation report

AI Data Readiness Inspector (AIDRIN)

Fri, 30 Jan 2026 10:15:00 -0700

Garbage In, Garbage Out (GIGO) is a widely accepted quote in computer science across various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.

AIDRIN (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of data readiness for AI processes, covering a broad range of dimensions from the literature. AIDRIN uses metrics from traditional data quality assessment, such as completeness, outliers, and duplicates, to evaluate data. Furthermore, AIDRIN uses metrics specific to assessing AI data, such as feature importance, feature correlations, class imbalance, fairness, privacy, and compliance with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles. AIDRIN provides visualizations and reports to assist data scientists in further investigating data readiness.

AIDRIN Multiple File Formats

The proposed work will include improvements in the AIDRIN framework to (1) add support for new file formats such as Zarr, ROOT, and HDF5; and (2) to allow providing custom data ingestion mechanisms.

Topics: data readiness, AI, data analysis
Skills: Python, C/C++, data analysis, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Drishti

Fri, 30 Jan 2026 10:15:00 -0700

Drishti is a novel interactive web-based analysis framework to visualize I/O traces, highlight bottlenecks, and help understand the I/O behavior of scientific applications. Drishti aims to fill the gap between the trace collection, analysis, and tuning phases. The framework contains an interactive I/O trace analysis component for end-users to visually inspect their applications’ I/O behavior, focusing on areas of interest and getting a clear picture of common root causes of I/O performance bottlenecks. Based on the automatic detection of I/O performance bottlenecks, our framework maps numerous common and well-known bottlenecks and their solution recommendations that can be implemented by users.

Drishti Comparisons and Heatmaps

The proposed work will include investigating and building a solution to allow comparing and finding differences between two I/O trace files (similar to a diff), covering the analysis and visualization components. It will also explore additional metrics and counters such as Darshan heatmaps in the analysis and visualization components of the framework.

Topics: I/O, HPC, data analysis, visualization, profiling, tracing
Skills: Python, data analysis, performance profiling
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

EnergyAPI: An End-to-End API for Energy-Aware Forecasting and Scheduling

Fri, 30 Jan 2026 00:00:00 +0000

Over the past decades, electricity demand has increased steadily, driven by structural shifts such as the electrification of transportation and, more recently, the rapid expansion of artificial intelligence (AI). Power grids have responded by expanding generation capacity, integrating renewable energy sources such as solar and wind, and deploying demand-response mechanisms. However, the current pace of demand growth is increasingly outstripping grid expansion, leading to integration delays, greater reliance on behind-the-meter consumption, and rising operational complexity.

To mitigate the environmental and socioeconomic impacts of electricity consumption, large consumers such as cloud data centers and electric vehicle (EV) charging infrastructures are increasingly participating in demand-response programs. These programs provide consumers with grid signals indicating favorable periods for electricity usage, such as when energy is cheapest or has the lowest carbon intensity. Consumers can then shift workloads across time and location to better align with grid conditions and their own operational constraints. A key challenge, however, is the online nature of this problem: operators must make real-time decisions without full knowledge of future grid conditions. While forecasting and optimization techniques exist, their effectiveness depends heavily on workload characteristics, such as whether tasks are delay-tolerant cloud jobs or EV charging sessions with route and deadline constraints.

This project proposes the design and implementation of a modular, extensible API for energy-aware workload scheduling. The API will ingest grid signals alongside workload Service Level Objectives (SLOs) and operational requirements, and produce execution plans that adapt to changing grid conditions. It will support multiple pluggable scheduling strategies and heuristics, enabling developers to compare real-time and forecast-based approaches across different workload classes. By providing a reusable, open-source interface for demand-response-aware scheduling, this project aims to lower the barrier for developers to integrate energy-aware decision-making into distributed systems and applications.

Building an End-to-End Service for Energy Forecasting and Scheduling

Topics: Databases Machine Learning
Skills: Python, command line tools (bash), SQL (MySQL or SQLite), FastAPI, time-series analysis, basic machine learning
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Abel Souza

Develop a containerized, end-to-end platform consisting of a backend, API, and web-based frontend for collecting, estimating, and visualizing real-time and forecasted electrical grid signals. These signals include electricity demand, prices, energy production, grid saturation, and carbon intensity. The system will support scalable data ingestion, region-specific forecasting models, and interactive visualizations to enable energy-aware application development and analysis.

Tasks:

Study electrical grid signals and demand-response data sources (e.g., demand, price, carbon intensity, grid saturation) and identify their requirements for real-time and forecast-based consumption planning.
Design and implement a relational data model for storing historical, real-time, and forecasted grid signals.
Ingest and validate grid signal data into a MySQL or SQLite database, ensuring data quality and time alignment across regions.
Implement baseline time-series forecasting models for grid signals (e.g., demand, price, or carbon intensity), with support for region-specific configurations.
Query European Network of Transmission System Operators for Electricity (ENTSO-E) and EIA (Energy Information Administration (EIA)) APIs to collect grid data.
Develop a RESTful API that exposes both raw and forecasted grid signals for use by energy-aware applications and schedulers.
Build a web-based user interface to visualize historical trends, forecasts, and regional differences in grid conditions.
Implement an interactive choropleth map to display spatial variations in grid signals such as carbon intensity and electricity prices.
Design an extensible architecture that allows different regions to plug in custom forecasting models or heuristics.
Containerize the backend, API, and frontend components using Docker to enable reproducible deployment and easy integration by external users.

Environmental NeTworked Sensor (ENTS)

Fri, 30 Jan 2026 00:00:00 +0000

ENTS I: Usability improvements for visualization dashboard

Topics: Data Visualization, Backend, Frontend, UI/UX, Analytics
Skills:
- Required: React, Javascript, Python, SQL, Git
- Nice to have: Flask, Docker, CI/CD, AWS, Authentication
Difficulty: Medium
Size: Large (350 hours)
Mentors: Colleen Josephson, Alec Levy, John Madden

The Environmental NeTworked Sensor (ENTS) platform, formally Open Sensing Platform (OSP), implements data visualization website for monitoring microbial fuel cell sensors (see GitHub). The mission is to scale up the current platform to support other researchers or citizen scientists in integrating their novel sensing hardware or microbial fuel cell sensors for monitoring and data analysis. Examples of the types of sensors currently deployed are sensors measuring soil moisture, temperature, current, and voltage in outdoor settings. The focus of the software half of the project involves building upon our existing visualization web platform, and adding additional features to support the mission. A live version of the website is available here.

Below is a list of project ideas that would be beneficial to the ENTS project. You are not limited to the following projects, and encourage new ideas that enhance the platform:

Drag and drop charts functionality
Creation of unique charts by users (with unique equations)
Customizable options of charts (color, line width, datapoint/line style, axis labels)
Exportable charts (with customizable options)
Saving layouts via url

ENTS II: Migration to TockOS

Topics: Embedded system, operating system
Skills:
- Required: Rust, C/C++, Git, Github
- Nice to have: STM32 HAL, python
Difficulty: Hard
Size: Large (350 hours)
Mentors: Colleen Josephson, John Madden

The current version of the hardware firmware is implemented in baremetal through the use of STM hardware abstraction layer (HAL) drivers. We are interested in porting the firmware implementation to an operating system (OS) to allow for additional functionality to support environmental data logging. TockOS is an embedded operating system designed for running multiple concurrent, mutually distrustful applications on low-memory and low-power microcontrollers that will be used. TockOS allows for OTA updates, dynamic app loading, hardware multiplexing, and more. We envision multiple users utilizing shared ENTS hardware that provides communication and measurement capabilities. Thus, the initial cost of deploying wireless sensor networks is reduced.

The TockOS kernel is written in Rust to enhance security. Userspace apps can be written in either C, C++, or Rust. Development will be done through a remote development server to access the hardware. See the following repos for the current status of the project:

Userspace library: libtock-c
Kernel: tock
Baremetal: ENTS-node-firmware

Scope of work:

Writing kernel peripheral drivers.
- Done entirely in Rust.
- Low-level understanding of microcontroller
- Basic kernel functionality knowledge.
Porting baremetal components to userland apps.
- Involves porting STM HAL calls to TockOS syscalls.
- Primarily done in C.
- Understanding of syscalls.

Reproducible CXL Emulation

Fri, 30 Jan 2026 00:00:00 +0000

Compute Express Link (CXL) is an emerging memory interconnect standard that enables shared, coherent memory across CPUs, accelerators, and multiple hosts, unlocking new possibilities in hyperscale, HPC, and disaggregated systems. However, because access to real multi-host CXL hardware is limited, it is difficult for researchers and students to experiment with, evaluate, and reproduce results on advanced CXL topologies. OCEAN (Open-source CXL Emulation At Hyperscale) [https://github.com/cxl-emu/OCEAN] is a full-stack CXL emulation platform built on QEMU that enables detailed emulation of CXL 3.0 memory systems, including multi-host shared memory pools, coherent fabric topologies, and latency modeling. This project will create reproducible experiment pipelines, automated deployment workflows, and user-friendly tutorials so that others can reliably run and extend CXL emulation experiments without requiring specialized hardware.

Reproducible CXL Emulation for Multi-Host Memory Systems

Streamline multi-host CXL emulation without specialized hardware.

Topics: CXL emulation Memory Systems Reproducibility
Skills: C/C++, Virtualization (QEMU), Scripting, Performance Modeling
Difficulty: Medium
Size: Large (350 hours)
Mentors: Mujahid Al Rafi, Luanzheng "Lenny" Guo.

Tasks:

Create automated deployment scripts and configuration templates for OCEAN-based CXL emulation topologies (single-host and multi-host).
Develop a standardized experiment harness for running memory performance benchmarks (e.g., OSU micro-benchmarks, STREAM-style tests) in emulated CXL environments.
Build reproducible experiment pipelines that others can run to evaluate latency, bandwidth, and scaling properties of CXL memory systems.
Produce tutorials, documentation, and reproducibility artifacts to guide new users through setup, execution, and analysis.
Package and contribute all scripts, configurations, and documentation back to the OCEAN open-source repository.

Exploring Security and Isolation in CXL-Based Memory Systems

Investigate security and isolation properties of CXL-based memory systems using software emulation.

Topics: CXL Systems Security Memory Isolation Side Channel Emulation
Skills: C/C++, Virtualization (QEMU), Scripting, Computer Architecture, Security
Difficulty: Medium
Size: Large (350 hours)
Mentors: Mujahid Al Rafi, Luanzheng "Lenny" Guo.

Tasks:

Study the CXL memory model and fabric architecture to identify potential security and isolation risks in multi-host shared memory environments (e.g., contention, timing variation, and resource interference).
Set up multi-host or multi-VM CXL emulation environments using OCEAN that mimic realistic multi-tenant deployments.
Design and implement reproducible micro-benchmarks to measure timing, bandwidth contention, or observable interference through shared CXL memory pools.
Analyze how fabric configuration choices (e.g., topology, latency injection, memory partitioning, or allocation policies) affect isolation and leakage behavior.
Explore and prototype mitigation strategies—such as memory partitioning, throttling, or policy-driven allocation—and evaluate their effectiveness using the emulation platform.

Omni-ST: Instruction-Driven Any-to-Any Multimodal Modeling for Spatial Transcriptomics

Thu, 29 Jan 2026 00:00:00 +0000

Project description

Spatial transcriptomics (ST) integrates spatially resolved gene expression with tissue morphology, enabling the study of cellular organization, tissue architecture, and disease microenvironments. Modern ST datasets are inherently multimodal, combining histology images (H&E / IF), gene expression vectors, spatial graphs, cell annotations, and free-text pathology descriptions.

However, most existing ST methods are task-specific and modality-siloed: separate models are trained for image-to-gene prediction, spatial domain identification, cell type classification, or text-based interpretation. This fragmentation limits cross-task generalization and scalability.

Omni-ST proposes a single instruction-driven any-to-any multimodal backbone that treats each spatial transcriptomics modality as a “language” and formulates all tasks as:

Instruction + Input Modality → Output Modality

Natural language is elevated from auxiliary metadata to a unifying interface that specifies task intent, target modality, and biological context. This paradigm enables flexible, interpretable, and extensible spatial reasoning within a single model.

Project Idea: Instruction-Driven Any-to-Any Modeling for Spatial Transcriptomics

Topics: spatial transcriptomics, multimodal learning, instruction tuning, computational pathology
Skills: PyTorch, deep learning, Transformers, multimodal representation learning
Difficulty: Hard
Size: 350 hours

Mentor:

Xi Li — mailto:xil43@uci.edu

Essential information:

Design a unified multimodal backbone with lightweight modality adapters for histology images, gene expression vectors, spatial graphs, and text.
Use natural language instructions to condition model behavior, enabling any-to-any translation without task-specific heads.
Support core tasks including image → gene expression prediction, gene expression → cell type / spatial domain identification, region → text-based biological explanation, and text-based spatial retrieval.
Evaluate the model across multiple spatial transcriptomics tasks within a single framework, emphasizing generalization and interpretability.
Develop visualization and interpretation tools such as spatial maps and language-grounded explanations.

Expected deliverables:

An open-source PyTorch implementation of the Omni-ST framework.
Unified multitask benchmarks for spatial transcriptomics.
Visualization and interpretation tools for spatial predictions.
Documentation and tutorials demonstrating how to add new tasks via instructions.

StatWrap

Thu, 29 Jan 2026 00:00:00 +0000

StatWrap is a free and open-source assistive, non-invasive discovery and inventory tool to document research projects. It inventories project assets (e.g., code files, data files, manuscripts, documentation) and organizes information without additional input from the user. It also provides structure for users to add searchable and filterable notes connected to files to help communicate metadata about intent and analysis steps.

At its core, StatWrap helps investigators identify and track changes in a research project as it evolves - which may affect reproducibility. For example: (1) people on the project can change over time, so processes may not be consistently executed due to transitions in employment; (2) data changes over time, due to accruing additional cases, adding new variables, or correcting mistakes in existing data; (3) software (e.g. used for data preparation and statistical analysis) evolves as it is edited, improved, and optimized; and (4) software can break or produce different results due to changes ‘under the hood’ such as updates to statistical packages, compilers, or interpreters. StatWrap passively and actively documents these changes to support reproducibility.

Additional information:

Group and Individual Customizations

Topics: configuration, user interface
Skills: JavaScript, React
Difficulty: Medium
Size: Large (350 hours)
Mentor: Luke Rasmussen, Eric Whitley

The goal of this project is to expand the existing capabilities of StatWrap to provide more flexibility to individual users and groups. Currently, features within StatWrap such as the directory template for creating new projects and the reproducibility checklist are static, meaning everyone who downloads StatWrap has the same configuration. However, each user and team work differently and should be able to configure StatWrap to support their needs.

When a user creates a new project, StatWrap provides a collection of project templates. These create a directory hierarchy, along with some seed files (e.g., a README.md file in the project root). Different groups have their own conventions for creating project directories. While StatWrap can be released with additional project templates defined, there are many situations in which users would want to keep their project template local. StatWrap should allow a user to create a project template configuration, from scratch or being seeded by the contents of an existing project. A user should then be able to export this configuration, share it with others, and other user should have the ability to import the configuration into their instance of StatWrap.

Similarly, StatWrap provides a reproducibility checklist that includes six existing checklist items. However, individual users and groups may have their own checklists, including institution-specific steps. Similar to the project template, a user should be able to configure additional items for the checklist. A user should be able to create a “checklist template” that can be used and applied in multiple projects. A specific project’s template should also be modifiable once the checklist has been created.

The specific tasks of the project include:

Developing a configuration scheme for New Project templates
Provide a way for a user to import/export a template for New Projects
Develop a configuration scheme for Reproducibility Checklist questions
Provide a way for a user to import/export a template for the Reproducibility Checklist
Develop a configuration scheme for asset (file) attributes
Develop unit tests and conduct system testing

Network Simulation Bridge • Enabling Interactive Network Models

Wed, 28 Jan 2026 00:00:00 +0000

The Network Simulation Bridge – NSB – is a network co-simulation framework that bridges together applications and network simulators. It enables students, researchers, and developers to prototype their applications and systems on simulated networks. It consists of a message server and client endpoint interfaces which together form a bridge, routing application message payloads through the network simulator. NSB is designed to be extensible through modular interfaces that serve to allow users to contribute new features and modules that suit evolving and emerging use cases. NSB is developed to be application-, network simulator-, and platform-agnostic so that users and developers are empowered to integrate any application front-end with any network simulator back-end, providing versatility and flexibility when used alongside other tools in larger systems and applications.

NSB was created in-house by the Inter-Networking Research Group and is now being developed into a more full-featured open-source tool and ecosystem in partnership with the UCSC OSPO and as part of the NSF Pathways to Enable Open-Source Ecosystems program. In this transition to a more polished and feature-rich product, the next phase of NSB development will involve the engineering of new quality-of-life features, testing and iteration of the core tool itself, and user-centric refinement via implementation in interdisciplinary system models.

Develop a User-Centric Website for NSB

Topics: Web Development Dynamic Updates UX
Skills: web development experience, good communicator, (HTML/CSS), (Javascript)
Difficulty: Moderate
Size: Large
Mentors: Harikrishna Kuttivelil

Develop a clean and welcoming landing page and website for the project. The organization needs to reflect the needs of both users and potential project contributors. This website will be the first impression for people new to the project and should

Specific tasks:

Work with mentors on understanding the context of the project and the expected needs of the users.
Port relevant documentation and tutorials from the repository page, ensuring updates in the repository are reflected in the website.
Study existing open source product websites and draw insights to include in our own design.
Design the structure of the website according to best OS, visual design, and accessibility design practices.
Include visual content that showcases NSB integration and testimonials (if applicable).

Improve the User Experience of NSB

Topics: Software Engineering User-Centric Development Visualization UI/UX Documentation
Skills: package management, toolchain implementation, process automation, technical writing, (visualization), (bash), (Python), (C++)
Difficulty: Moderate
Size: Medium
Mentors: Harikrishna Kuttivelil

Our goal has always been to keep NSB streamlined and out of the way of the users and developers. In line with that, we want our tool to be easily available and installable, and we want the experience of using it to feel minimal and non-intrusive while providing sufficient observability of NSB’s internals for those who want it.

Specific tasks:

Work with mentors and potential users on identifying aspects of the user experience that can refined for better quality-of-life experiences.
Verify and iterate on existing software packaging methods for NSB to ensure that tool setup is stress-free.
Refine and update existing documentation and tutorials to reflect improvements in the setup, installation, and usage processes.
Work with mentors and other contributors to work backwards from what the user wants to see to design the user interface.
Work with other contributors (see below) to develop a Network-in-a-Box experience with NSB.

Create a Network-in-a-Box Experience with NSB

Topics: Software Engineering, Simulation, System Modeling, System Design, Visualization, UI/UX
Skills: software integration and interfacing, toolchain implementation, process automation, C++, (visualization), (LLM-enabled code generation), (technical writing)
Difficulty: Challenging
Size: Large
Mentors: Harikrishna Kuttivelil

NSB was originally designed for networking graduate students to interface with application-layer programs. But since then, there’s been more of an appetite for a simpler network-in-a-box approach that would allow users to quickly deploy baseline or generated network simulations that are ready for use with NSB.

Specific tasks:

Learn how to use one of the major open-source network simulators (ns3 or OMNeT++).
Work with mentors in designing a simpler, minimal user experience of operating NSB.
Develop tools to automatically create network simulations given input parameters (type of network, number of nodes, description of infrastructure).
Create documentation aimed at new users.
Implement or embed network visualizations to enrich the user experience.

Implement Networked System Models to Evaluate Quality of NSB

Topics: System Modeling Simulation System Design Software Development Product Testing
Skills: software integration, good communication, qualitative research, (proficiency in Python and/or C++), (processing scientific and technical literature)
Difficulty: Challenging
Size: Large
Mentors: Harikrishna Kuttivelil

NSB is a relatively new tool and has not been extensively tested outside of the core contributors, who know a bit too much about the tool. We need to better understand what external user and contributor experience will be like, and the best way to do that is to start developing with NSB to build models of connected systems, i.e., sensor networks, smart homes, smart farms, etc.

Specific tasks:

Research academic literature and relevant works to identify relevant distributed applications to model.
Work with mentors and collaborators to plan implementation of selected system models.
Track and report issues and concerns in quality-of-life experiences, critical errors, or difficulties.
Work with mentors and contributors to address issues and concerns.
Refine and update existing documentation and tutorials to reflect improvements in the setup, installation, and usage processes.
Work with other contributors (see below) in reviewing and cross-referencing model implementations.

Model Autonomous Vehicle Networks to Drive New Feature Development in NSB

Topics: System Modeling Simulation System Design Software Development
Skills: requirement-based software design, message parsing interfaces, server-client communication, (proficiency in Python and/or C++), (processing scientific and technical literature)
Difficulty: Challenging
Size: Large
Mentors: Harikrishna Kuttivelil

NSB today serves its named purpose – message relaying. However, modeling complex systems can sometimes involving synchronizing other simulation features, like mobility when dealing with vehivle networks. Implementing a generic layer of being able to synchronize user-defined features across endpoints would be a powerful, enabling feature in NSB. In the process, we may also uncover opportunities for improving the NSB developer experience.

Specific tasks:

Research academic literature and relevant works to identify and design potential autonomous vehicle network models.
Work with mentors and collaborators to iterate on system designs to ensure it serves the purpose of furthering NSB development.
Help mentors design and develop the new feature synchronization feature in NSB, driven by the autonomous vehicle system model.
Develop and iterate feature synchronization, using mobility as the synchronized feature.
Create documentation and tutorials to serve as resources for future users, contributors, and developers.
Work with other contributors (see above) in reviewing and cross-referencing model implementations.

Peersky Browser

Mon, 26 Jan 2026 12:00:00 -0800

Peersky Browser is an experimental personal gatekeeper to a new way of accessing web content. In a world where a handful of big companies control most of the internet, Peersky leverages distributed web technologies—IPFS, Hypercore, and BitTorrent return control to the users. With integrated local P2P applications, Peersky offers a fresh, community-driven approach to browsing.

Implement P2P Extension Store

Topics: Browser Extensions, P2P, Electron, IPFS, Hypercore
Skills: JavaScript, Electron.js, HTML/CSS, P2P
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Akhilesh Thite

Build a decentralized extension distribution flow that archives WebExtensions into a predictable P2P-friendly layout and installs directly from P2P URLs.

Tasks:

Define the P2P extension layout:
- Standardize /extensions/{name}/{version}/extension.zip and /extensions/{name}/index.json.
Design install compatibility for P2P URLs:
- Support peersky://extensions/... and P2P links from IPFS or Hypercore.
Archive Chrome Web Store extensions to P2P:
- Use chrome-extension-fetch to fetch CRX, convert to ZIP, and store it in the layout.
- Update index.json with metadata like version, P2P_URL, and fetchedAt.
- Publish the folder to IPFS or Hypercore and feed the link into the install flow.
Add settings and trust model:
- Add a “Load from P2P” settings toggle.
- Support curated extension hoards (index.json) and automated updates.
- Clarify integrity assumptions and sandboxing expectations.

More details in the issue: https://github.com/p2plabsxyz/peersky-browser/issues/42

Backup & Restore System (P2P JSON + Tabs Restore)

Topics: P2P, Backup, Session Restore, Electron, Onboarding
Skills: JavaScript, Electron.js, HTML/CSS, P2P
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Akhilesh Thite

Implement a backup and restore pipeline for Peersky’s P2P app data and session state, including an onboarding import flow for tabs from other browsers.

Tasks:

Generate a P2P backup bundle:
- Create a single .zip that contains lastOpened.json, tabs.json, ensCache.json, and the ipfs/ and hyper/ directories.
- Add an option to generate a CID for the backup zip for instant sharing.
Restore from settings:
- Upload a P2P backup zip file.
- Load a backup from an IPFS or Hyper CID.
- Import Chrome/Firefox tab exports produced by a helper extension.
Define the helper extension export format:
- Create a small extension under p2plabsxyz to export windows and tabs (URLs, titles, window grouping, active tab indexes).
- Ensure the export format is compatible with Peersky’s import pipeline.
Add onboarding import flow:
- Show onboarding.html on first launch and prompt “Import tabs from another browser?”.
- Guide users to install the helper extension and import the generated file.
Align with existing persistence:
- Reuse lastOpened.json, tabs.json, and peersky-browser-tabs localStorage for restores.

More details in the issue: https://github.com/p2plabsxyz/peersky-browser/issues/60

Scenic: A Language for Design and Verification of Autonomous Cyber-Physical Systems

Sat, 24 Jan 2026 00:00:00 +0000

Scenic is a probabilistic programming language for the design and verification of autonomous cyber-physical systems like self-driving cars. Scenic allows users to define scenarios for testing or training their system by putting a probability distribution on the system’s environment: the positions, orientations, and other properties of objects and agents, as well as their behaviors over time. Sampling these scenarios and running them in a simulator yields synthetic data which can be used to train or test a system. Since Scenic was released open-source in 2019, our group and many others in academia have used Scenic to find, diagnose, and fix bugs in autonomous cars, aircraft, robots, and other kinds of systems. In industry, it is being used by companies including Boeing, Meta, Deutsche Bahn, and Toyota in domains spanning autonomous driving, aviation, household robotics, railways, maritime, and virtual reality.

Our long-term goal is for Scenic to become a widely-used common representation and toolkit supporting the entire design lifecycle of AI-based cyber-physical systems. Towards this end, we have many summer projects available, ranging from adding new application domains to working on the Scenic compiler and sampler:

Extensions to the Scenic driving domain
Interfacing Scenic to new simulators
Scenic distribution visualizer

See the sections below for details.

Extensions to the Scenic Driving Domain

Topics: Autonomous Driving 3D modeling
Skills: Python; basic vector geometry
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

There are several potential goals of this project, including:

Supporting importing complex object information from simulators like CARLA.
Extending the domain to incorporate additional metadata, such as highway entrances and exits.
Fixing various bugs and limitations that exist in the driving domain (e.g. Issue #274 and Issue #295).

Interfacing Scenic to New Simulators

Topics: Simulation Autonomous Driving
Skills: Python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

Scenic is designed to be easily-interfaced to new simulators. Depending on student interest, we could pick a simulator which would open up new kinds of applications for Scenic and write an interface for it. Some possibilities include:

The AWSIM driving simulator (to allow testing the Autoware open-source autonomous driving software stack)
The CarMaker driving simulator

The goal of the project would be to create an interface between Scenic and the new simulator and write scenarios demonstrating it. If time allows, we could do a case study on a realistic system for publication at an academic conference.

Tool to Visualize Scenario Distributions

Topics: Visualization
Skills: Python; basic visualization and graphics
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

A Scenic scenario represents a distribution over scenes, but it can be difficult to interpret what exactly this distribution represents. Being able to visualize this distribution would be helpful for understanding and reasoning about Scenarios.

The goal of this project would be to build on an existing prototype for visualizing these distributions, and to create a tool that can be used by the wider Scenic community.

CauST: Causal Gene Intervention for Robust Spatial Domain Identification

Wed, 21 Jan 2026 00:00:00 +0000

Topics: spatial transcriptomics, spatial domain identification, causal inference, gene intervention
Skills:
- Programming Languages: Python (PyTorch preferred)
- Machine Learning: causal inference, representation learning, clustering
- Data Analysis: spatial transcriptomics preprocessing and evaluation (ARI, cross-slice generalization)
- Bioinformatics Knowledge (preferred): spatial transcriptomics, scRNA-seq, gene perturbation analysis
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Lijinghua Zhang (contact person)

Project Idea Description

Spatial domain identification is a core task in spatial transcriptomics (ST), aiming to segment tissue sections into biologically meaningful regions based on spatially resolved gene expression profiles. These spatial domains often correspond to anatomical layers, functional niches, or microenvironmental states, and are widely used as the basis for downstream biological interpretation.

Despite strong empirical performance, most existing spatial domain identification methods rely on purely correlational gene signals. Genes are selected or weighted based on association with spatial patterns, without distinguishing whether they causally drive domain formation or merely reflect downstream or confounded effects. As a result, current models often suffer from limited robustness and poor generalization across tissue sections or donors.

Problem: Correlation-Driven Gene Usage and Limited Generalization

In standard pipelines, gene expression features are typically used wholesale or filtered using heuristic criteria (e.g., highly variable genes). However, many genes that are strongly correlated with spatial domains are not causally responsible for domain structure. Including such non-causal or confounded genes can:

Reduce robustness across slices and donors
Obscure true domain-driving biological signals
Limit interpretability of spatial domain assignments

Empirically, domain identification performance often degrades substantially in cross-slice or cross-donor evaluation settings, underscoring the need for causally informed feature selection.

Proposed Solution: CauST

This project proposes CauST, a Causal Gene Intervention framework for robust spatial domain identification.

CauST aims to identify domain-driving genes by estimating their causal influence on spatial domain assignments via in-silico gene interventions. Instead of relying on observational correlations, CauST approximates counterfactual gene knockouts by perturbing individual gene expressions while controlling for confounding factors.

In addition, CauST leverages cross-slice invariance as a practical criterion for causal gene discovery, prioritizing genes whose effects on spatial domain identification remain stable across tissue sections and donors.

By filtering or reweighting genes based on estimated causal influence, CauST improves the robustness, generalizability, and interpretability of spatial domain identification models.

Project Objectives

Causal Gene Effect Estimation
- Design in-silico intervention strategies to estimate gene-level causal effects on spatial domain assignments.
Invariant Effect Analysis
- Identify genes with stable effects across tissue sections or donors.
Causal Gene Filtering
- Develop filtering or reweighting schemes based on estimated causal influence.
Integration with Existing Methods
- Integrate CauST into state-of-the-art spatial domain identification pipelines.
Evaluation and Validation
- Benchmark robustness, cross-slice generalization, and interpretability on public spatial transcriptomics datasets.

Project Deliverables

CauST Framework Implementation
- Open-source Python implementation compatible with common spatial transcriptomics toolchains.
Causal Gene Benchmarks
- Quantitative evaluation of causal gene filtering and its impact on domain identification.
Visualization Tools
- Tools for visualizing gene interventions, causal scores, and spatial effects.
Documentation and Tutorials
- Clear examples enabling adoption of CauST by the broader community.

Impact

CauST introduces a causally grounded perspective to spatial domain identification by explicitly modeling gene-level interventions. By shifting from correlation-driven gene usage to causal gene selection, this project improves robustness, generalizability, and biological interpretability in spatial transcriptomics analysis. CauST has the potential to serve as a foundational framework for integrating causal reasoning into spatial omics representation learning.

Agent4Target: An Agent-based Evidence Aggregation Toolkit for Therapeutic Target Identification

Tue, 20 Jan 2026 00:00:00 +0000

Topics: therapeutic target identification, drug discovery, evidence aggregation, AI agents, biomedical knowledge integration
Skills:
- Programming Languages: Python; experience with modern ML tooling preferred
- Machine Learning / AI: agent-based systems, workflow orchestration, weak supervision (basic), representation learning
- Software Engineering: modular system design, APIs, CLI tools, documentation
- Biomedical Knowledge (preferred): familiarity with drug–target databases (e.g., PHAROS, DepMap, Open Targets)
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Ziheng Duan (contact person)

Project Idea Description

Identifying and prioritizing high-quality therapeutic targets is a foundational yet challenging task in drug discovery. Modern target identification relies on aggregating heterogeneous evidence from multiple sources, including genetic perturbation screens, disease associations, chemical biology, and biomedical literature. These evidence sources are highly fragmented, noisy, and heterogeneous in both format and reliability.

While large language models and AI agents have recently shown promise in automating scientific workflows, many existing approaches focus on end-to-end prediction or conversational interfaces. Such systems are often difficult to reproduce, extend, or integrate into existing research pipelines, limiting their practical adoption by the biomedical community.

This project proposes Agent4Target, an agent-based evidence aggregation toolkit that reframes therapeutic target identification as a structured, modular workflow. Instead of using agents for free-form reasoning, Agent4Target employs agents as orchestrated components that systematically collect, normalize, score, and explain evidence supporting candidate therapeutic targets.

The goal is to deliver a reusable, open-source toolchain that can be integrated into diverse drug discovery workflows, independent of any single downstream prediction model or publication.

Key Idea and Technical Approach

Agent4Target models target identification as a multi-stage, agent-driven pipeline, coordinated by a central orchestrator:

Evidence Collector Agents
Specialized agents retrieve target-level evidence from heterogeneous sources, such as:
- Genetic perturbation and dependency data (e.g., DepMap)
- Target annotation and development status (e.g., PHAROS)
- Disease association scores (e.g., Open Targets)
- Automatically summarized literature evidence
Normalization & Scoring Agent
Collected evidence is converted into a unified, structured schema using typed data models (e.g., JSON / Pydantic).
This agent performs:
- Evidence normalization across sources
- Confidence-aware scoring and aggregation
- Optional weighting or calibration strategies
Explanation Agent
Rather than free-text generation, this agent produces structured explanations that explicitly link scores to supporting evidence, enabling transparency and interpretability for downstream users.
Workflow Orchestrator
A lightweight orchestration layer (e.g., LangGraph or a state-machine-based controller) manages agent execution, dependencies, and failure handling, ensuring reproducibility and extensibility.

This modular design allows individual agents to be replaced, extended, or reused without altering the overall system.

Project Objectives

Design a Modular Agent-based Architecture
- Define clear interfaces for evidence collection, normalization, scoring, and explanation agents.
Implement a Standardized Evidence Schema
- Develop a unified data model for heterogeneous target-level evidence.
Build a Reproducible Orchestration Framework
- Implement a deterministic, inspectable workflow for agent coordination.
Deliver a Community-Ready Toolkit
- Provide CLI tools, example notebooks, and clear documentation to support adoption.
Benchmark and Case Studies
- Demonstrate the toolkit on representative target identification scenarios using public datasets.

Project Deliverables

Open-Source Agent4Target Codebase
- A well-documented Python package with modular agent components.
Command-Line Interface (CLI)
- Tools for running end-to-end evidence aggregation pipelines.
Standardized Output Schema
- Machine-readable evidence summaries suitable for downstream modeling.
Example Notebooks and Benchmarks
- Demonstrations of usage and performance on real-world target identification tasks.
Documentation
- Installation guides, extension tutorials, and developer documentation.

Impact

Agent4Target provides a practical bridge between AI agents and real-world drug discovery workflows. By emphasizing structured evidence aggregation, reproducibility, and interpretability, this project enables researchers to systematically reason about therapeutic targets rather than relying on opaque, end-to-end models. The resulting toolkit can serve as a foundation for future work in AI-assisted drug discovery, weak supervision, and biomedical knowledge integration.

HistoMoE: A Histology-Guided Mixture-of-Experts Framework for Gene Expression Prediction

Tue, 20 Jan 2026 00:00:00 +0000

Topics: computational pathology, spatial transcriptomics, gene expression prediction, mixture-of-experts, multimodal learning
Skills:
- Programming Languages: Python; experience with PyTorch preferred
- Machine Learning: CNNs / vision encoders, mixture-of-experts, multimodal representation learning
- Data Analysis: handling large-scale histology image patches and gene expression matrices
- Bioinformatics Knowledge (preferred): familiarity with spatial transcriptomics or scRNA-seq data
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Ziheng Duan (contact person)

Project Idea Description

Histology imaging is one of the most widely available data modalities in biomedical research and clinical practice, capturing rich morphological information about tissues and disease states. In parallel, spatial transcriptomics (ST) technologies provide spatially resolved gene expression measurements, enabling unprecedented insights into tissue organization and cellular heterogeneity. However, the high cost and limited accessibility of ST experiments remain a major barrier to their widespread adoption.

Predicting gene expression directly from histology images offers a promising alternative, enabling molecular-level inference from routinely collected pathology data. Existing approaches typically rely on a single global model that maps image embeddings to gene expression profiles. While effective to some extent, these models struggle to capture the strong organ-, tissue-, and cancer-specific heterogeneity that underlies gene expression patterns.

This project proposes HistoMoE, a histology-guided mixture-of-experts (MoE) framework that explicitly models biological heterogeneity by learning specialized expert models for different cancer types or organs, and dynamically routing histology image patches to the most relevant experts.

Key Idea and Technical Approach

As illustrated in the figure above, HistoMoE integrates multiple data modalities and learning components:

Vision Encoder
Histology image patches are encoded into high-dimensional visual representations using a convolutional or transformer-based vision backbone.
Text / Metadata Encoder
Sample-level metadata (e.g., tissue type, organ, disease context) is encoded using a lightweight text or embedding model.
Gating Network
A gating network jointly considers image and metadata embeddings to infer routing weights over multiple cancer- or organ-specific expert models.
Expert Models
Each expert specializes in modeling gene expression patterns for a specific biological context (e.g., CCRCC, COAD, LUAD), producing patch-level gene expression predictions.

By explicitly modeling biological structure through expert specialization, HistoMoE aims to improve both prediction accuracy and interpretability, allowing researchers to understand which biological experts drive each prediction.

Project Objectives

Design and Implement the HistoMoE Framework
- Build a modular MoE architecture with pluggable vision encoders, gating networks, and expert models.
Multimodal Routing and Expert Specialization
- Explore how image features and metadata jointly inform expert selection.
Benchmarking and Evaluation
- Compare HistoMoE against single-model baselines on multiple cancer and organ-specific spatial transcriptomics datasets.
Interpretability Analysis
- Analyze expert routing behavior to reveal biologically meaningful patterns.

Project Deliverables

Open-Source HistoMoE Codebase
- Well-documented Python implementation with training, evaluation, and visualization tools.
Benchmark Results
- Quantitative comparisons demonstrating improvements over non-expert baselines.
Visualization and Analysis Tools
- Tools for inspecting expert usage, routing weights, and gene-level predictions.
Documentation and Tutorials
- Clear instructions and examples to enable adoption by the research community.

Impact

HistoMoE introduces an expert-system perspective to histology-based gene expression prediction, bridging morphological and molecular representations through biologically informed specialization. By combining multimodal learning with mixture-of-experts modeling, this project advances the interpretability and accuracy of computational pathology methods and contributes toward scalable, cost-effective alternatives to spatial transcriptomics experiments.

StaR: A Stability-Aware Representation Learning Framework for Spatial Domain Identification

Tue, 20 Jan 2026 00:00:00 +0000

Topics: spatial transcriptomics, spatial domain identification, representation learning, model robustness
Skills:
- Programming Languages: Python; PyTorch experience preferred
- Machine Learning: representation learning, clustering, robustness and stability analysis
- Data Analysis: spatial transcriptomics preprocessing and evaluation (ARI, clustering metrics)
- Bioinformatics Knowledge (preferred): familiarity with spatial transcriptomics or scRNA-seq data
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Ziheng Duan (contact person)

Project Idea Description

Spatial domain identification is a fundamental task in spatial transcriptomics (ST), aiming to partition tissue sections into biologically meaningful regions based on spatially resolved gene expression profiles. These spatial domains often correspond to distinct anatomical structures, cellular compositions, or functional microenvironments, and serve as a critical foundation for downstream biological analysis.

Despite rapid methodological progress, most existing spatial domain identification methods are highly sensitive to random initialization. In practice, simply changing the random seed can lead to substantially different clustering results and large performance fluctuations, even when using identical hyperparameters and datasets. This instability severely undermines the reliability, reproducibility, and interpretability of spatial transcriptomics analyses.

Problem: Seed Sensitivity and Unstable Representations

Empirical evidence shows that state-of-the-art spatial domain identification models can exhibit substantial performance variance across random seeds. For example, the Adjusted Rand Index (ARI) may vary from relatively strong performance (e.g., ARI ≈ 0.65) to noticeably degraded yet still reasonable outcomes (e.g., ARI ≈ 0.50) solely due to different random initializations.

By systematically evaluating models across hundreds to thousands of random seeds, we observe that:

Model performance landscapes are highly rugged, with sharp cliffs and isolated high-performing regions.
Standard training objectives implicitly favor brittle representations that are not robust to small perturbations in initialization or optimization trajectories.

These observations suggest that instability is not a peripheral issue, but rather a structural limitation of current representation learning approaches for spatial transcriptomics.

Proposed Solution: StaR

This project proposes StaR, a Stability-Aware Representation Learning framework designed to explicitly address seed sensitivity in spatial domain identification.

The core idea of StaR is to learn representations that are robust to perturbations in model parameters and training dynamics, rather than optimizing solely for peak performance under a single random seed. Concretely, StaR introduces controlled noise or perturbations into the training process and encourages consistency across multiple perturbed model instances, guiding the model toward flatter and more stable regions of the parameter space.

By prioritizing stability during representation learning, StaR aims to produce embeddings that:

Yield consistent spatial domain assignments across random seeds
Maintain competitive or improved clustering accuracy
Better reflect underlying biological structure

Project Objectives

Characterize Instability in Existing Methods
- Systematically quantify seed sensitivity across popular spatial domain identification models.
Develop Stability-Aware Training Objectives
- Design perturbation-based or consistency-driven losses that encourage robust representations.
Integrate StaR into Existing Pipelines
- Apply StaR to widely used spatial transcriptomics workflows with minimal architectural changes.
Evaluation and Benchmarking
- Evaluate StaR using clustering metrics (e.g., ARI) and stability metrics across multiple datasets and random seeds.
Biological Validation
- Assess whether stability-aware representations preserve biologically meaningful spatial patterns.

Project Deliverables

StaR Framework Implementation
- An open-source Python implementation compatible with common spatial transcriptomics toolchains.
Stability Benchmarks
- Comprehensive evaluations demonstrating reduced performance variance across seeds.
Visualization Tools
- Tools for visualizing performance landscapes, stability surfaces, and spatial domain consistency.
Documentation and Tutorials
- Clear examples enabling researchers to adopt StaR in their own analyses.

Impact

StaR addresses a critical yet underexplored challenge in spatial transcriptomics: model instability and poor reproducibility. By shifting the focus from single-run performance to stability-aware representation learning, this project improves the reliability and trustworthiness of spatial domain identification methods. StaR has the potential to become a foundational component in robust spatial transcriptomics pipelines and to inspire broader adoption of stability-aware principles in biological representation learning.

MedJEPA: Self-Supervised Medical Image Representation Learning with JEPA

Mon, 19 Jan 2026 10:15:56 -0700

Project Description

[MedJEPA] Medical image analysis is fundamental to modern healthcare, enabling disease diagnosis, treatment planning, and patient monitoring across diverse clinical applications. In radiology and pathology, deep learning models support automated detection of abnormalities, tumor segmentation, and diagnostic assistance. Medical imaging modalities including X-rays, CT scans, MRI, ultrasound, and histopathology slides generate vast amounts of unlabeled data that could benefit from self-supervised representation learning. Clinical applications include cancer detection and staging, cardiovascular disease assessment, neurological disorder diagnosis, and infectious disease screening. In drug discovery and clinical research, analyzing medical images helps evaluate treatment efficacy, predict patient outcomes, and identify biomarkers for disease progression. Telemedicine and point-of-care diagnostics benefit from AI-powered image analysis that extends expert-level interpretation to underserved regions. However, medical imaging faces unique challenges: limited labeled datasets due to expensive expert annotation, patient privacy concerns restricting data sharing, domain shift across different imaging equipment and protocols, and the need for models that generalize across hospitals and populations. Traditional medical image analysis relies heavily on supervised learning with manually annotated labels, creating bottlenecks due to the scarcity and cost of expert annotations. Existing self-supervised methods applied to medical imaging often employ complex training procedures with numerous heuristics—momentum encoders, stop-gradients, teacher-student architectures, and carefully tuned augmentation strategies—that may not translate well across different medical imaging modalities and clinical contexts. These approaches struggle with domain-specific challenges such as subtle pathological features, high-resolution images, 3D volumetric data, and the need for interpretable representations that clinicians can trust. To address these challenges, we propose MedicalJEPA: Self-Supervised Medical Image Representation Learning with Joint-Embedding Predictive Architecture, which leverages the theoretically grounded LeJEPA framework for 2D medical images and V-JEPA principles for medical video and volumetric data, creating a unified, scalable, and heuristics-free approach specifically tailored for medical imaging applications. By utilizing the principled JEPA frameworks with objectives like Sketched Isotropic Gaussian Regularization (SIGReg), MedJEPA eliminates complex training heuristics while learning clinically meaningful representations from unlabeled medical images. Unlike conventional self-supervised methods that require extensive hyperparameter tuning and may not generalize across medical imaging modalities, MedicalJEPA provides a clean, theoretically motivated framework with minimal hyperparameters that adapts to diverse medical imaging contexts—from chest X-rays to histopathology slides to cardiac MRI sequences. The learned representations can support downstream tasks including disease classification, lesion detection, organ segmentation, and survival prediction, while requiring significantly fewer labeled examples for fine-tuning. This approach democratizes access to state-of-the-art medical AI by enabling effective learning from the vast amounts of unlabeled medical imaging data available in hospital archives, addressing the annotation bottleneck that has limited progress in medical AI.

Project Objectives

Aligned with the vision of the 2026 Open Source Research Experience (OSRE), this project aims to apply Joint-Embedding Predictive Architecture (JEPA) frameworks to medical image representation learning, addressing the critical challenge of learning from limited labeled medical data. Medical imaging generates enormous amounts of unlabeled data, but supervised learning approaches are bottlenecked by the scarcity and cost of expert annotations. Existing self-supervised methods often rely on complex heuristics that don’t generalize well across diverse medical imaging modalities, equipment vendors, and clinical protocols. This project will leverage the theoretically grounded LeJEPA framework for 2D medical images (X-rays, histopathology slides, fundus images) and V-JEPA principles for temporal and volumetric medical data (cardiac MRI sequences, CT scans, surgical videos). The core challenge lies in adapting these heuristics-free, stable frameworks to medical imaging’s unique characteristics: subtle pathological features requiring fine-grained representations, high-resolution images demanding efficient processing, domain shift across hospitals and equipment, and the need for interpretable features that support clinical decision-making. The learned representations will be evaluated on diverse downstream clinical tasks including disease classification, lesion detection, organ segmentation, and prognosis prediction, with emphasis on few-shot learning scenarios that reflect real-world annotation constraints. Below is an outline of the methodologies and models that will be developed in this project.

Step 1: Medical Data Preparation: Develop data processing pipelines for diverse medical imaging modalities, implementing DICOM/NIfTI parsing, standardized preprocessing, and efficient data loading for self-supervised pre-training. Prepare 2D medical image datasets: Chest X-rays: ChestX-ray14, MIMIC-CXR, CheXpert for lung disease detection Histopathology: Camelyon16/17 (breast cancer), PCam (patch-level classification) Retinal imaging: EyePACS, APTOS (diabetic retinopathy), Messidor Dermatology: HAM10000, ISIC (skin lesion classification) Prepare 3D volumetric and temporal medical data: CT scans: LIDC-IDRI (lung nodules), Medical Segmentation Decathlon datasets MRI sequences: BraTS (brain tumors), ACDC (cardiac MRI), UK Biobank cardiac videos Medical video: Surgical procedure videos, endoscopy recordings, ultrasound sequences Implement medical imaging-specific preprocessing: intensity normalization, resolution standardization, handling of multi-channel medical images (different MRI sequences, RGB histopathology), and privacy-preserving anonymization. Design masking strategies appropriate for medical imaging: spatial masking for 2D images, volumetric masking for 3D scans, temporal masking for sequences, and anatomy-aware masking that respects organ boundaries. Create data loaders supporting high-resolution medical images, 3D volumes, and multi-modal inputs (e.g., multiple MRI sequences).
Step 2: JEPA Model Implementation for Medical Imaging: Implement LeJEPA for 2D medical images: Adapt joint-embedding predictive architecture for medical image characteristics (high resolution, subtle features, domain-specific patterns) Apply Sketched Isotropic Gaussian Regularization (SIGReg) to learn clinically meaningful embedding distributions Maintain single trade-off hyperparameter and heuristics-free training for reproducibility across medical imaging centers Support various encoder architectures: Vision Transformers for global context, ConvNets for local features, hybrid approaches Extend to V-JEPA for medical video and volumetric data: Spatiotemporal encoding for cardiac MRI sequences, surgical videos, and time-series medical imaging Temporal prediction objectives for understanding disease progression and treatment response 3D volume processing for CT and MRI scans with efficient memory management Multi-slice and multi-sequence learning for comprehensive medical imaging contexts Develop medical domain-specific enhancements: Multi-scale representation learning to capture both fine-grained pathological details and global anatomical context Interpretability mechanisms: attention visualization, feature attribution, and embedding space analysis for clinical validation Robustness to domain shift: training strategies that generalize across different scanners, protocols, and institutions Privacy-preserving training considerations compatible with medical data regulations (HIPAA, GDPR) Implement efficient training infrastructure: Support for distributed training across multiple GPUs for large medical imaging datasets Memory-efficient processing of high-resolution images and 3D volumes Checkpoint management and model versioning for clinical deployment pipelines Minimal-code implementation (≈50-100 lines) demonstrating framework simplicity
Step 3: Evaluation & Safety Validation: : Disease Classification Tasks: Multi-label chest X-ray classification: 14 pathology classes on ChestX-ray14, MIMIC-CXR Diabetic retinopathy grading: 5-class classification on EyePACS, APTOS Skin lesion classification: 7-class classification on HAM10000 Brain tumor classification: glioma grading on BraTS dataset Evaluate with linear probing, few-shot learning (5-shot, 10-shot), and full fine-tuning Lesion Detection and Segmentation: Lung nodule detection on LIDC-IDRI dataset Tumor segmentation on Medical Segmentation Decathlon tasks Polyp detection in colonoscopy videos Cardiac structure segmentation in MRI sequences Clinical Prediction Tasks: Survival prediction from histopathology slides Disease progression prediction from longitudinal imaging Treatment response assessment from pre/post imaging pairs Few-Shot and Low-Data Regime Evaluation: Systematic evaluation with 1%, 5%, 10%, 25%, 50% of labeled training data Comparison against supervised baselines and ImageNet pre-training Analysis of annotation efficiency: performance vs. number of labeled examples required

Project Deliverables

This project will deliver three components: software implementation, clinical evaluation, and practical deployment resources. The software implementing MedicalJEPA will be hosted on GitHub as an open-access repository with modular code supporting multiple medical imaging modalities (2D images, 3D volumes, videos), pre-trained model checkpoints on major medical imaging datasets (chest X-rays, histopathology, MRI), training and evaluation scripts with medical imaging-specific preprocessing pipelines, privacy-preserving training implementations compatible with clinical data regulations, and comprehensive documentation including tutorials for medical AI researchers and clinicians. The evaluation results will include benchmarks on 10+ medical imaging datasets across diverse modalities and clinical tasks, few-shot learning analysis demonstrating annotation efficiency gains, cross-institutional validation studies showing robustness to domain shift, interpretability visualizations enabling clinical validation of learned representations, and detailed comparisons against supervised baselines and existing medical self-supervised methods. .

NeuroHealth

Topics: Self-Supervised Medical Image Representation Learning with JEPA
Skills: Proficiency in Python, Pytorch, Github, JEPA
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Linsey Pang

References:

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics - Randall Balestriero and Yann LeCun, arXiv 2024
Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA) - Adrien Bardes et al., arXiv 2024
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture - Mahmoud Assran et al., CVPR 2023 (I-JEPA)
ChestX-ray14: Hospital-Scale Chest X-Ray Database - https://nihcc.app.box.com/v/ChestXray-NIHCC
Medical Segmentation Decathlon - http://medicaldecathlon.com/
MIMIC-CXR Database - https://physionet.org/content/mimic-cxr/
The Cancer Imaging Archive (TCIA) - https://www.cancerimagingarchive.net/
UK Biobank Imaging Study - https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/imaging-data

NeuroHealth: AI-Powered Health Assistant

Mon, 19 Jan 2026 10:15:56 -0700

Project Description

[NeuroHealth] Intelligent health assistance systems are increasingly essential for improving healthcare accessibility, patient engagement, and clinical decision support. In primary care and preventive medicine, AI assistants help users understand symptoms, schedule appropriate appointments, and receive preliminary health guidance. Telemedicine applications include triage support, appointment scheduling optimization, and patient education based on health inquiries. In chronic disease management, these systems provide medication reminders, lifestyle recommendations, and timely alerts for medical follow-ups. Healthcare navigation applications include finding appropriate specialists, understanding treatment options, and coordinating care across multiple providers. In wellness and preventive care, intelligent assistants enhance health literacy by delivering personalized health information, screening recommendations, and proactive health management strategies. By leveraging natural language understanding and medical knowledge integration, these systems enhance healthcare access, reduce unnecessary emergency visits, and empower users to make informed health decisions across diverse populations. Traditional health information systems often provide generic responses that fail to account for individual health contexts, medical history, and personal circumstances. Existing symptom checkers and health chatbots primarily rely on rule-based logic or simple decision trees, limiting their ability to understand nuanced health inquiries, reason about complex symptom patterns, or provide contextually appropriate guidance. These systems struggle with interpreting ambiguous descriptions, adapting to users’ health literacy levels, and generating personalized recommendations that account for individual medical constraints and preferences. To address these challenges, we propose NeuroHealth: AI-Powered Health Assistant, which leverages Large Language Models (LLMs) to create an intelligent conversational agent that synthesizes user health inquiries, symptom descriptions, and contextual information into actionable, personalized health guidance and appointment recommendations. By integrating LLM-based medical reasoning with structured clinical knowledge bases, NeuroHealth enhances symptom interpretation, appointment routing, and health education delivery. Unlike conventional systems that provide static responses from predetermined templates, NeuroHealth dynamically understands user intent, asks clarifying questions, assesses urgency levels, and generates appropriate recommendations—whether scheduling a doctor appointment, suggesting self-care measures, or directing users to emergency services. This fusion of LLM intelligence with validated medical knowledge enables a more accessible, adaptive, and helpful health assistance platform, bridging the gap between users seeking health information and appropriate medical care.

Project Objectives

Aligned with the vision of the 2026 Open Source Research Experience (OSRE), this project aims to develop an AI-Powered Health Assistant (NeuroHealth) to improve healthcare accessibility and patient engagement through intelligent conversational guidance. Healthcare systems face significant challenges in providing timely, personalized health information and connecting patients with appropriate care resources. Traditional symptom checkers and health information systems often deliver generic, rule-based responses that fail to account for individual contexts and struggle with natural language understanding. To address these limitations, this project will leverage Large Language Models (LLMs) to create an intelligent health assistant that understands user health inquiries, interprets symptom descriptions, assesses urgency, and provides personalized recommendations including doctor appointment suggestions, self-care guidance, and healthcare navigation support. The core challenge lies in designing NeuroHealth as a safe, accurate, and user-friendly system capable of natural conversation, medical knowledge retrieval, and appropriate response generation while maintaining clinical safety guardrails. Unlike conventional health chatbots that follow rigid conversation flows, NeuroHealth will reason over user inputs, ask clarifying questions, and dynamically adapt responses based on context, resulting in more helpful, accurate, and appropriate health assistance. Below is an outline of the methodologies and models that will be developed in this project.

Step 1: Data Collection & Knowledge Base Construction: Develop a comprehensive medical knowledge base integrating validated health information sources, symptom databases, condition descriptions, and appointment routing guidelines. Collect and curate conversational health inquiry datasets from public medical Q&A forums, symptom checker logs, and healthcare chatbot interactions to create training and evaluation data. Design structured representations for symptoms, conditions, urgency levels, and appointment recommendations to enable effective retrieval and reasoning. Extract common health inquiry patterns, symptom descriptions, and user intent categories to inform conversation flow design. Data sources can include public medical knowledge bases such as MedlinePlus, Mayo Clinic health information, clinical practice guidelines, and synthetic patient inquiry scenarios based on common healthcare use cases. Implement data validation mechanisms to ensure medical accuracy and clinical safety compliance.
Step 2: Model Development: Design and implement an LLM-based conversational health assistant that integrates medical knowledge retrieval with natural language understanding and generation. Develop a Retrieval-Augmented Generation (RAG) architecture that grounds LLM responses in validated medical information sources, reducing hallucination risks and ensuring factual accuracy. Create prompt engineering strategies and reasoning frameworks that enable the system to: interpret symptom descriptions, assess urgency levels, ask appropriate clarifying questions, and generate personalized health guidance. Implement a multi-component architecture including: intent recognition, symptom extraction, urgency assessment, appointment recommendation generation, and response formatting modules. Develop clinical safety guardrails that detect high-risk scenarios requiring immediate medical attention and provide appropriate emergency guidance. Design conversation management strategies that maintain context across multi-turn dialogues and adapt to users’ health literacy levels. The baseline architecture can leverage state-of-the-art models such as GPT-4, Claude, or open-source alternatives like Llama, Qwen, combined with medical knowledge retrieval systems.
Step 3: Evaluation & Safety Validation: : Benchmark NeuroHealth against existing symptom checkers and health chatbots, evaluating on metrics including response accuracy, appropriateness of appointment recommendations, urgency assessment precision, and user satisfaction. Conduct human evaluation studies with healthcare professionals to assess clinical safety, response quality, and appropriateness of medical guidance. Perform adversarial testing to identify potential failure modes, unsafe responses, or inappropriate recommendations under edge cases. Conduct ablation studies to analyze the impact of retrieval-augmented generation, safety guardrails, and conversation management strategies on system performance. Evaluate system performance across diverse health inquiry types including acute symptoms, chronic condition management, preventive care questions, and healthcare navigation requests. Assess response quality across different user demographics and health literacy levels to ensure equitable access. Optimize inference efficiency and response latency for real-time conversational interaction across web and mobile platforms.

Project Deliverables

This project will deliver three components: model development, evaluation and validation, and interactive demonstration. The software implementing the NeuroHealth system will be hosted on GitHub as an open-access repository with comprehensive documentation, deployment guides, and API specifications. The evaluation results, including benchmark comparisons against existing systems, clinical safety assessments, and user study findings, will be published alongside the GitHub repository. An interactive demo showcasing the conversational interface, symptom interpretation capabilities, and appointment recommendation generation will be provided to illustrate real-world application scenarios.

NeuroHealth

Topics: AI-Powered Health Assistant
Skills: Proficiency in Python, Github, LLM
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Large Language Models in Healthcare - Singhal et al., Nature 2023
Med-PaLM: Large Language Models for Medical Question Answering - Singhal et al., arXiv 2022
Capabilities of GPT-4 on Medical Challenge Problems - Nori et al., arXiv 2023
MedlinePlus Medical Encyclopedia - https://medlineplus.gov/
Clinical Practice Guidelines Database - https://www.guidelines.gov/

LMS Toolkit

Tue, 13 Jan 2026 13:00:00 -0800

The EduLinq LMS Toolkit is a suite of tools used by several courses at UCSC to interact with LMS’s (e.g. Canvas) from the command line or Python. A Learning Management System (LMS) is a system that institutions use to manage courses, assignments, students, and grades. The most popular LMSs are Canvas, Blackboard, Moodle, and Brightspace. These tools can be very helpful, especially from an administrative standpoint, but can be hard to interact with. They can be especially difficult when instructors and TAs want to do something that is not explicitly supported by their built-in GUIs (e.g., when an instructor wants to use a special grading policy). The LMS Toolkit project is an effort to create a single suite of command-line tools (along with a Python interface) to connect to all the above mentioned LMSs in a simple and uniform way. So, not only can instructors and TAs easily access the modify the data held in an LMS (like a student’s grades), but they can also do it the same way on any LMS. The LINQS Lab has made many contributions to the maintain and improve the Quiz Composer.

Currently, the LMS Toolkit supports Canvas, Moodle, and Blackboard. But, the degree of support for each LMS varies.

All students interested in LINQS projects for OSRE/GSoC 2026 should fill out this form. Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project. The form will stop accepting responses once the application window closes. Do not post on any of the project repositories about OSRE/GSoC (e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2026). Remember, these are active repositories that were not created for OSRE/GSoC.

Advanced LMS Support

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The LMS Toolkit already has basic read-write support for many core pieces of LMS functionality (e.g., working with grades and assignments). However, there are still many more features that can be supported such as group management, quiz management, quiz statistics, and assignment statuses.

The task for this project is to choose a set of advanced features (not limited to those features mentioned above), design an LMS-agnostic way to support those features, and implement those features. The flexibility in the features chosen to implement account for the variable size of this project.

See Also:

Repository for LMS Toolkit
GitHub Issues

New LMS Support: Brightspace

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The goal of the LMS toolkit is to provide a single interface for all LMSs. D2L Brightspace is one of the more popular LMSs. Naturally, the LMS Toolkit wants to support Brightspace as well. However, a challenge in supporting Brightspace is that it is not open source (unlike Canvas and Moodle). Therefore, support and testing on Brightspace may be very challenging.

The task for this project is to add basic support for the Brightspace LMS. It is not necessary to support all the same features that are supported for other LMSs, but at least the core features of score and assignment management should be implemented. The closed-source nature of Brightspace makes this a challenging and uncertain project.

See Also:

Lynx Grader

Tue, 13 Jan 2026 13:00:00 -0800

The EduLinq Lynx Grader (also referred to as “autograder”) is an open source tool used by several courses at UCSC to safely and quickly grade programming assignments. Grading student code is something that may seem simple at first (you just need to run their code!), but quickly becomes exceeding complex as you get more into the details. Specifically, grading a student’s code securely while providing the “last mile” service of getting code from students and sending results to instructors/TAs and the course’s LMS (e.g., Canvas) can be very difficult. The Lynx Grader provides all of this in a free and open source project. The LINQS Lab has made many contributions to the maintain and improve the Lynx Grader.

As an open source project, there are endless opportunities for development, improvements, and collaboration. Here, we highlight some specific projects that will work well in the summer mentorship setting.

LLM Detection

Topics: AI/ML LLM Research Backend
Skills: software development, backend, systems, data munging, go, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

As Large Language Model (LLM) tools like ChatGPT become more common and powerful, instructors need tools to help determine if students are the actual authors of the code they submit. More classical instances of plagiarism are often discovered by code similarity tools like MOSS. However these tools are not sufficient for detecting code written not by a student, but by an AI model like ChatGPT or GitHub Copilot.

The task for this project is to create a system that provides a score indicating the system’s confidence that a given piece of code was written by an AI tool and not a student. This will supplement the existing code analysis tools in the Lynx Grader. There are many approaches to completing this task that will be considered. A more software development approach can consist of levering exiting systems to create a production-ready system, whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.

There has been previous work on this issue, where a student did a survey of existing solutions, collection of initial datasets, and exploratory experiments on possible directions. This project would build off of this previous work.

See Also:

Code Analysis GUI

Topics: Frontend
Skills: software development, frontend, data munging, js, css, go
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Lynx Grader has existing functionality to analyze the code in a student’s submission for malicious content. Relevant to this project is that the Lynx Grader can run a pairwise similarity analysis against all submitted code. This is how most existing software plagiarism systems detect offending code. The existing infrastructure provides detailed statistics on code similarity, but does not currently have a visual way to display this data.

The task for this project is to create a web GUI using the Lynx Grader REST API to display the results of a code analysis. The size of this project depends on how many of the existing features are going to be supported by the web GUI.

See Also:

Web GUI

Topics: Frontend
Skills: software development, frontend, js, css
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Lynx Grader contains dozens of API endpoints, most directly representing a piece of functionality exposed to the user. All of these features are exposed in the Lynx Grader’s Python Interface. However, the Python interface is a purely command-line interface. And although command-line interface are objectively (read: subjectively) the best, a web GUI would be more accessible to a wider audience. The autograder already has a web GUI, but it does not cover all the features available in the Lynx Grader.

The task for this project is to augment the Lynx Grader’s web GUI with more features. Specifically, add support for more tools used to create and administer courses.

See Also:

Quiz Composer

Tue, 13 Jan 2026 13:00:00 -0800

The EduLinq Quiz Composer (also called the “Quiz Generator”) is a tool used by several courses at UCSC to create and maintain platform-agnostic quizzes (including exams and worksheets). Knowledge assessments like quizzes, exams, and tests are a core part of the learning process for many courses. However maintaining banks of questions, collaborating on new questions, and converting quizzes to new formats can use up a lot of time, taking time away from actually working on improving course materials. The Quiz Composer helps by providing a single text-based format that can be stored in a repository and “compiled” into many different formats including: HTML, LaTeX, PDF, Canvas, GradeScope, and QTI. The LINQS Lab has made many contributions to the maintain and improve the Quiz Composer.

Canvas Import

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Eriq Augustine, Lucas Ellenberger, Lise Getoor

The Quiz Composer houses quizzes and quiz questions in a simple and unambiguous format based on JSON and Markdown (specifically, the CommonMark specification). This allows the Quiz Composer to unambiguously create versions of the same quiz in many different formats. However, creating a quiz in the Quiz Composer format can be a daunting task for those not familiar with JSON or Markdown. Instead, it would be easier for people to import quizzes from another format into the Quiz Composer format, and then edit it as they see fit. Unfortunately not all other quiz formats, namely Canvas in this case, are unambiguous.

The task for this project is to implement the functionality of importing quizzes from Canvas to the standard Quiz Composer format. The unambiguous nature of Canvas quizzes makes this task non-trivial, and adds an additional element of design decisions to this task. It will be impossible to import quizzes 100% correctly, but we want to be able to get close enough that most people can import their quizzes without issue.

See Also:

Google Forms Export

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, python
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Eriq Augustine, Lucas Ellenberger, Lise Getoor

The Quiz Composer can export quizzes to many different formats, each with a varying level of interactivity and feature support. For example, quizzes can be exported to PDFs which will be printed and the students will just write down their answers to be checked in the future. Quizzes can also be exported to interactive platforms like Canvas where students can enter answers that may be automatically checked with feedback immediately provided to the student. On potential platform with functionality somewhere between the above two examples is Google Forms. “Forms” (an entity on Google Forms) can be something like a survey or (as of more recently) a quiz.

The task for this project is to add support for exporting quizzes from the Quiz Composer to Google Forms. There is a large overlap in the quiz features supported in Canvas (which the Quiz Composer already supports) and Google Forms, so most settings should be fairly straightforward. There may be some design work around deciding what features are specific to one quiz platform and what features can be abstracted to work across several platforms.

See Also:

Template Questions

Topics: Backend Teaching Tools API
Skills: software development, backend, data munging, python
Difficulty: Moderate-Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Lucas Ellenberger, Lise Getoor

Questions in the Quiz Composer are described using JSON and Markdown files which contain the question prompt, possible answers, and the correct answer. (Of course there are many differ question types, each with different semantics and requirements.) However, a limitation of this is that each question is always the same. You can have multiple copies of a question with slightly different prompts, numbers, and answers; but you are still limited to each question being static and unchanging. It would be useful to have “template questions” that can dynamically create static questions from a template and collection of replacement data.

The task for this project is to add support for the “template questions” discussed above. Much of the high-level design work for this issue has already been completed. But there is still the implementation and low-level design decision left to do.

See Also:

Understanding Skin-Tone based Bias in Text-to-Image Models Using Stable Diffusion

Tue, 27 May 2025 00:00:00 +0000

This project investigates skin tone bias in text-to-image generation by analyzing the output of Stable Diffusion models when prompted with socially and occupationally descriptive text. Despite the growing popularity of generative models like Stable Diffusion, little has been done to evaluate how these models reproduce or amplify visual bias—especially related to skin tone, perceived race, and social class—based solely on textual prompts.

This work builds on prior studies of bias in large language models (LLMs) and vision-language models (VLMs), and aims to explore how biases manifest visually, without explicitly specifying race or ethnicity in the input prompt. Our approach combines systematic prompt generation, model-based image creation, and skin tone quantification to assess disparities across generated samples.

The ultimate goal is to develop a reproducible evaluation pipeline, visualize disparities across demographic and occupational prompts, and explore strategies to mitigate representational harms in generative models.

Our goal is to create a reproducible pipeline for:

Generating images from prompts
Annotating or analyzing them using computer vision tools
Measuring bias across categories like skin tone, gender presentation, or status markers

Project webpage: https://github.com/marzianizam/ucsc-ospo.github.io/tree/main/content/project/osre25/UCSC/FairFace

Project Idea: Measuring Bias in AI-Generated Portraits

Topics: Responsible AI, Generative Models, Ethics in AI
Skills: Python, PyTorch, Stable Diffusion, Prompt Engineering, Data Analysis
Difficulty: Medium
Size: 350 hours
Mentors:
- Marzia Binta Nizam (mailto:manizam@ucsc.edu)
- Professor James Davis (mailto:davisje@ucsc.edu)

Background

Recent research has shown that text-to-image models can perpetuate racial and gender stereotypes through visual output. For instance, prompts like “CEO” or “nurse” often produce racially skewed results even when no explicit race or demographic cues are provided. This project examines whether similar disparities exist along skin tone dimensions, focusing on subtle biases rather than overt stereotypes.

The key challenge is that visual bias is not always easy to measure. This project addresses this issue by utilizing melanin-level quantification, a continuous and interpretable proxy for skin tone, in conjunction with consistent prompt templating and multi-sample averaging to ensure statistical rigor.

Objectives

Generate datasets using consistent prompts (e.g., “A portrait of a doctor”, “A homeless person”, etc.)
Use Stable Diffusion (and optionally, other models like DALL·E or Midjourney) to generate diverse image sets
Measure bias across demographic and occupational categories using image processing tools
Visualize the distribution of melanin values and facial features across samples
Explore prompt-level mitigation strategies to improve fairness in output

Deliverables

Open-source codebase for prompt generation and image evaluation
Statistical analysis of visual bias trends
Blog post or visual explainer on findings
Final report and recommendations on prompt engineering or model constraints

UC Open Source Repository Browser

Mon, 03 Mar 2025 13:00:00 -0800

The University of California Open Source Repository Browser (UC ORB) is a discovery tool designed to map and classify open source projects across the UC system. This project is a collaboration with the UC Network of Open Source Program Offices (OSPOs), which brings together six UC campuses (Santa Cruz, Berkeley, Davis, Los Angeles, Santa Barbara, and San Diego) to support open source research, promote sustainability, and establish best practices within academic environments.

By providing a centralized platform, UC ORB enhances the visibility of UC’s open source contributions, fosters collaboration among researchers and developers, and serves as a model for other institutions aiming to improve open source discovery and sustainability.

This project focuses on building the web application for UC ORB, which will serve as the primary interface for users to explore and interact with UC’s open source projects. The student will work on developing a clean, user-friendly, and scalable web application.

Develop the UC ORB Application

Topics: Web development
Skills: Experience in Python and at least one Python-based web framework (e.g., Flask, Django, FastAPI), experience with front-end technologies (React, HTML, CSS, JavaScript), familiarity with Git and collaborative development workflows, familiarity with database interaction (SQL).
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Juanita Gomez

Develop a web application that serves as the front-end interface for the UC ORB. The application will allow users to browse, search, and explore open source projects across the UC system. The project will involve integrating with the repository database to fetch and display repository data, designing an intuitive user interface, and ensuring the application is scalable and maintainable.

Specific Tasks:

Choose an appropriate Python-based web framework (e.g., Flask, Django, or FastAPI) for the backend and set up the basic structure of the application.
Develop a responsive and user-friendly front-end interface ensuring that it is accessible and works well on both desktop and mobile devices.
Add search functionality to allow users to find projects by keywords, tags, or other metadata.
Implement filtering options to narrow down search results (e.g., by campus, topic, or programming language).
Deploy the application to a cloud platform (e.g., AWS, or Google Cloud) or GitHub Pages (GitHub.io) for public access.
Create developer documentation that explains the application’s architecture, setup instructions, and contribution guidelines.
Write a short user manual to help end-users browse and use the web application effectively.

Applying MLOps to overcome reproducibility barriers in machine learning research

Sat, 01 Mar 2025 00:00:00 +0000

Topics: machine learning, MLOps, reproducibility
Skills: Python, machine learning, GitOps, systems, Linux, data, Docker
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fraida Fund and Mohamed Saeed

Project Idea Description

Reproducibility remains a significant problem in machine learning research, both in core ML and in the application of ML to other areas of science. In many cases, due to inadequate experiment tracking, dependency capturing, source code versioning, data versioning, and artifact sharing, even the authors of a paper may find it challenging to reproduce their own study several years later. This makes it difficult to vaidate and build on previous work, and raises concerns about its trustworthiness.

In contrast, outside of academic research, MLOps tools and frameworks have been identified as a key enabler of reliable, reproducible, and trustworthy machine learning systems in production. A good reference on this topic is:

Firas Bayram and Bestoun S. Ahmed. 2025. Towards Trustworthy Machine Learning in Production: An Overview of the Robustness in MLOps Approach. ACM Comput. Surv. 57, 5, Article 121 (May 2025), 35 pages. https://doi.org/10.1145/3708497

This project seeks to bridge the gap between widely adopted practices in industry and academic research:

by making it easier for researchers and scientists to use MLOps tools to support reproducibility. To achieve this, we will develop starter templates and recipes for research in computer vision, NLP, and ML for science, that have reproducibility “baked in” thanks to the integration of MLOps tools and frameworks. Researchers will launch these templates on open access research facilities like Chameleon.
and, by developing complementary education and training materials to emphasize the important of reproducibility in ML, and how the tools and frameworks used in the starter templates can support this goal.

Writing a successful proposal for this project

A good proposal for this project should -

demonstrate a good understanding of the current barriers to reproducibility in machine learning research (specific examples are welcome),
describe a “base” starter template, including the platforms and tools that will be integrated, as well as specific adaptations of this template for computer vision, NLP, and ML for science,
explain the “user flow” - how a researcher would use the template to conduct an experiment or series of experiments, what the lifecycle of that experiment would look like, and how it would be made reproducible,
include the contributor’s own ideas about how to make the starter templates more usable, and how to make the education and training materials relatable and useful,
and show that the contributor has the necessary technical background and soft skills to contribute to this project. In particular, the contributor will need to create education and training materials that are written in a clear, straightforward, and concise manner, without unncessary jargon. The proposal should show evidence of the contributor’s writing abilities.

Github link

There is no pre-existing Git repository for this project - at the beginning of the summer, the contributor will create a new repository in the Teaching on Testbeds organization, and the project materials will “live” there.

CacheBench: Building a Benchmarking Suite for Cache Performance Evaluation

Fri, 28 Feb 2025 00:00:00 +0000

Overview

In this project, we aim to develop a comprehensive benchmarking suite, CacheBench, for evaluating the performance of cache systems in modern computing environments. Caches play a crucial role in enhancing system performance by reducing latency and improving data access speeds. However, evaluating cache performance is a complex task that requires a diverse set of workloads and metrics to capture the cache’s behavior accurately. The current focus is on the eviction algorithms and if time permits, we will extend to other components of cache design.

This project will have three main components:

Implementing and benchmarking existing cache eviction algorithms in libCacheSim using large-scale simulation. This part will mainly focus on reproducing existing works.
Developing a set of microbenchmarks and a platform for researchers to evaluate new designs with little effort in the future. This part will focus on building the open-source infrastructure for future research.
Developing a leaderboard for the community to submit new algorithms and workloads. This part will focus on building the community and fostering adoption and collaboration.

Topics: storage systems, benchmarking, performance evaluation
Skills: C programming, web programming (e.g., node.js, React), database management
Difficulty: Moderate
Size: Large (350 hours).
Mentors: Juncheng Yang, Yazhuo Zhang (yazhuo@inf.ethz.ch)

FairFace

Fri, 28 Feb 2025 00:00:00 +0000

FairFace: Reproducible Bias Evaluation in Facial AI Models via Controlled Skin Tone Manipulation

Bias in facial AI models remains a persistent issue, particularly concerning skin tone disparities. Many studies report that AI models perform differently on lighter vs. darker skin tones, but these findings are often difficult to reproduce due to variations in datasets, model architectures, and evaluation settings. The goal of this project is to investigate bias in facial AI models by manipulating skin tone and related properties in a controlled, reproducible manner. By leveraging BioSkin, we will adjust melanin levels and other skin properties on existing human datasets to assess whether face-based AI models (e.g., classification and vision-language models) exhibit biased behavior toward specific skin tones.

Topics: Fairness & Bias in AI, Face Recognition & Vision-Language Models, Dataset Augmentation for Reproducibility
Skills: Machine Learning & Computer Vision, Deep Learning (PyTorch/TensorFlow), Data Augmentation & Image Processing, Reproducibility & Documentation (GitHub, Jupyter Notebooks).
Difficulty: Moderate
Size: Medium or Large ( Can be completed in either 175 or 350 hours, depending on the depth of analysis and number of models tested.)
Mentors: James Davis, Alex Pang

Key Research Questions

Do AI models perform differently based on skin tone?
- How do classification accuracy, confidence scores, and error rates change when skin tone is altered systematically?
What are the underlying causes of bias?
- Is bias solely dependent on skin tone, or do other skin-related properties (e.g., texture, reflectance) contribute to model predictions?
- Is bias driven by dataset imbalances (e.g., underrepresentation of certain skin tones)?
- Do facial features beyond skin tone (e.g., structure, expression, pose) contribute to biased predictions?
Are bias trends reproducible?
- Can we replicate bias patterns across different datasets, model architectures, and experimental setups?
- How consistent are the findings when varying image sources and preprocessing methods?

Specific Tasks:

Dataset Selection & Preprocessing
- Choose appropriate face/human datasets (e.g., FairFace, CelebA, COCO-Human).
- Preprocess images to ensure consistent lighting, pose, and resolution before applying transformations.
Skin Tone Manipulation with BioSkin
- Systematically modify melanin levels while keeping facial features unchanged.
- Generate multiple variations per image (lighter to darker skin tones).
Model Evaluation & Bias Analysis
- Test face classification models (e.g., ResNet, FaceNet) and vision-language models (e.g., BLIP, LLaVA) on the modified images.
- Compute fairness metrics (e.g., demographic parity, equalized odds).
Investigate Underlying Causes of Bias
- Compare model behavior across different feature sets.
- Test whether bias persists across multiple datasets and model architectures.
Ensure Reproducibility
- Develop an open-source pipeline for others to replicate bias evaluations.
- Provide codebase and detailed documentation for reproducibility.

IO logger: IO tracing in the modern computing era

Fri, 28 Feb 2025 00:00:00 +0000

Overview

Storage systems are critical components of modern computing infrastructures, and understanding their performance characteristics is essential for optimizing system efficiency. There were many works from twenty to thirty years ago, but the landscape has changed significantly with the advent of

cloud computing, virtualization, and storage disaggregation on the server side
ubiquitous fast wireless networking for end users that make remote storage feasible
AI and ML workloads that generate and move massive data both in the cloud and on the edge.

In this project, we aim to develop an IO logger, a tool for tracing, logging and analyzing IO operations in various computing environments. The IO logger will capture detailed information about read and write operations, latency, throughput, and other metrics to help researchers and practitioners understand the behavior of storage systems under different workloads and configurations. By providing a comprehensive view of IO performance, the IO logger will enable users to identify bottlenecks, optimize resource utilization, and improve system efficiency.

This project will have two phases:

IO logger for *NIX systems: Develop a tool leveraging eBPF and other tools for tracing IO operations on Linux and other Unix-like systems. The tool will capture detailed information about disk reads and writes, network transfers, and other IO activities, providing insights into system performance. The tool will be open-sourced, and we will work with industry partners and testbeds to integrate it into existing monitoring and analysis tools. Moreover, we will collect and open source the IO traces to benefit the community.
IO logger for personal computing environment: Develop a tool for end-users to trace IO operations on their personal devices, such as laptops, desktops, and mobile phones. We will design and implement tools for three different platforms, Window, MacOS and Andriod. We will use the tools to collect IO traces from volunteers and real-world applications. providing insights into storage usage, network activity, and application performance. The tool will be user-friendly, lightweight, and privacy-preserving, ensuring that users can monitor their IO activities without compromising their data security.

Notable difference and challenges compared to the existing works are:

more IO requests with rich features: open-source traces from previous works were collected all after page cache, which are often write-heavy, lose most IO requests, and do not provide enough features, e.g., process name. To address this, we will build a tool that can also records requests served by page cache, which requires the tool to be efficient and cannot impose significant overhead to the ruuning systems.
focus on new applications and workloads: the existing works were mostly outdated from the 1990s, during which the Internet has not been widely used, and applications are mostly processing local data and does not communicate with outside world. While there have been a few works looked into mobile storage a decade ago. The landscape has changed significantly since then, especially with the advent of AI and ML workloads that generate and move massive data both in the cloud and on the edge. This project will look into the difference and challenges brought by these new applications and workloads.

Topics: tracing tool, operating system, eBPF, performance evaluation
Skills: C programming, system programming, eBPF, Linux kernel, mobile application development
Difficulty: Hard
Size: Large (350 hours).
Mentors: Juncheng Yang

ReasonWorld

Fri, 28 Feb 2025 00:00:00 +0000

ReasonWorld: Real-World Reasoning with a Long-Term World Model

A world model is essentially an internal representation of an environment that an AI system would construct based on external information to plan, reason, and interpret its surroundings. It stores the system’s understanding of relevant objects, spatial relationships, and/or states in the environment. Recent augmented reality (AR) and wearable technologies like Meta Aria glasses provide an opportunity to gather rich information from the real world in the form of vision, audio, and spatial data. Along with this, large language (LLM), vision language models (VLMs), and general machine learning algorithms have enabled nuanced understanding and processing of multimodal inputs that can label, summarize, and analyze experiences.

With ReasonWorld, we aim to utilize these technologies to enable advanced reasoning about important objects/events/spaces in real-world environments in a structured manner. With the help of wearable AR technology, the system would be able to capture real-world multimodal data. We aim to utilize this information to create a long-memory modeling toolkit that would support features like:

Longitudinal and structured data logging: Capture and storing of multimodal data (image, video, audio, location coordinates etc.)
Semantic summarization: Automatic scene labeling via LLMs/VLMs to identify key elements in the surroundings
Efficient retrieval: For querying and revisiting past experiences and answering questions like “Where have I seen this painting before?”
Adaptability: Continuously refining and understanding the environment and/or relationships between objects/locations.
Adaptive memory prioritization: Where the pipeline can assess the contextual significance of the captured data and retrieve those that are the most significant. The model retains meaningful, structured representations rather than raw, unfiltered data.

This real-world reasoning framework with a long-term world model can function as a structured search engine for important objects and spaces, enabling:

Recognizing and tracking significant objects, locations, and events
Supporting spatial understanding and contextual analysis
Facilitating structured documentation of environments and changes over time

Alignment with Summer of Reproducibility:

Core pipeline for AR data ingestion, event segmentation, summarization, and indexing (knowledge graph or vector database) would be made open-source.
Clear documentation of each module and how they collaborate with one another
The project could be tested with standardized datasets, simulated environments as well as controlled real-world scenarios, promoting reproducibility
Opportunities for Innovation - A transparent, modular approach invites a broad community to propose novel expansions

Specific Tasks:

A pipeline for real-time/batch ingestion of data with the wearable AR device and cleaning
Have an event segmentation module to classify whether the current object/event is contextually significant, filtering out the less relevant observations.
Have VLMs/LLMs summarize the events with the vision/audio/location data to be stored and retrieved later by structured data structures like knowledge graph, vector databases etc.
Storage optimization with prioritizing important objects and spaces, optimizing storage based on contextual significance and frequency of access.
Implement key information retrieval mechanisms
Ensure reproducibility by providing datasets and scripts

ReasonWorld

Topics: Augmented reality Multimodal learning Computer vision for AR LLM/VLM Efficient data indexing
Skills: Machine Learning and AI, Augmented Reality and Hardware integration, Data Engineering & Storage Optimization
Difficulty: Hard
Size: Large (350 hours)
Mentors: James Davis, Alex Pang

AI for Science: Automating Domain Specific Tasks with Large Language Models

Sun, 23 Feb 2025 21:30:56 -0800

Recent advancements in Large Language Models (LLMs) have transformed various fields by demonstrating remarkable capabilities in processing and generating human-like text. This project aims to explore the development of an open-source framework that leverages LLMs to enhance discovery across specialized domains.

The proposed framework will enable LLMs to analyze and interpret complex datasets, automate routine tasks, and uncover novel insights. A key focus will be on equipping LLMs with domain-specific expertise, particularly in areas where specialized tools – such as ANDES – are not widely integrated with LLM-based solutions. By bridging this gap, the framework will empower researchers and professionals to harness LLMs as intelligent assistants capable of navigating and utilizing niche computational tools effectively.

AI for Science: Automating Domain Specific Tasks with Large Language Models

Topics: Large Language Models AI for Science
Skills: Python, Experience with LLMs, Prompt Engineering, Fine-Tuning, LLM Frameworks
Difficulty: Medium-Difficult
Size: Large (350 hours)
Mentor: [Daniel Wong]Daniel Wong, [Luanzheng “Lenny” Guo]Luanzheng "Lenny" Guo

Project Tasks and Milestones

Designing an extensible framework that facilitates the integration of LLMs with specialized software and datasets.
Developing methodologies for fine-tuning LLMs to act as domain experts.
Implementing strategies for improving tool interoperability, allowing LLMs to interact seamlessly with less commonly used but critical analytical platforms.

Enhancing Reproducibility in Distributed AI Training: Leveraging Checkpointing and Metadata Analytics

Fri, 21 Feb 2025 09:00:00 -0700

Reproducibility in distributed AI training is a crucial challenge due to several sources of uncertainty, including stragglers, data variability, and inherent randomness. Stragglers—slower processing nodes in a distributed system—can introduce timing discrepancies that affect the synchronization of model updates, leading to inconsistent states across training runs. Data variability, stemming from non-deterministic data shuffling and differing data partitions across nodes, can also lead to variations in model performance. Additionally, inherent randomness in algorithm initialization, such as random weight beginnings and stochastic processes like dropout, further compounds these challenges. Reproducibility in AI is pivotal for ensuring the credibility of AI-driven scientific findings, akin to how reproducibility underpins traditional scientific research.

To enhance AI reproducibility, leveraging metadata analytics and visualization along with saved checkpoints offers a promising solution. Checkpointing in AI training is a pivotal technique that involves saving snapshots of a model and its parameters at regular intervals throughout the training process. This practice is essential for maintaining progress in the face of potential interruptions, such as hardware failures, and enables the resumption of training without having to restart from scratch. In the context of distributed AI training, checkpointing also provides a framework for analyzing and ensuring reproducibility, offering a means to systematically capture and review the training trajectory of models. Analyzing checkpoints can specifically help identify issues like stragglers, which are slower computing nodes in a distributed system that can impede synchronized progress. For example, by examining the time stamps and resource utilization data associated with each checkpoint, anomalies in processing time can be detected, revealing nodes that consistently lag behind others. This analysis enables teams to diagnose performance bottlenecks and optimize resource allocation across the distributed system, ensuring smoother and more consistent training runs. By combining checkpointing with metadata analytics, it becomes possible to pinpoint the exact training iterations where delays occur, thereby facilitating targeted investigations and solutions to improve overall system reproducibility and efficiency.

Workplan

The proposed work will include: 1) Setting up a checkpointing system within the distributed AI training framework to periodically save model states and metadata; 2) Designing a metadata analysis schema for populating model and system statistics from the saved checkpoints; 3) Conducting exploratory data analysis to identify patterns, anomalies, and sources of variability in the training process; 4) Creating visualization tools to represent metadata insights with collected statistics and patterns; 5) Using insights from metadata analytics and visualization to optimize resource distribution across the distributed system and mitigate straggler effects; and 6) Disseminating results and methodologies through academic papers, workshops, and open-source contributions.

Topics: Reproducibility AI distributed AI checkpoint metadata analysis
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Thu, 20 Feb 2025 09:00:00 -0700

Retrieval-Augmented Generation (RAG) frameworks, which merge the capabilities of retrieval systems and generative models, significantly enhance the relevance and accuracy of responses produced by large language models (LLMs). These frameworks retrieve relevant documents from a large corpus and use these documents to inform the generative process, thereby improving the contextuality and precision of the generated content. Ensuring reproducibility in data queries using similarity search within these RAG frameworks is critical for maintaining the reliability and consistency of scientific workflows. Reproducibility ensures that the same input query consistently yields the same output, which is vital for scientific tasks that rely on precise and repeatable results. Inconsistencies can arise from various sources, affecting the trustworthiness of scientific outcomes. Differences in retrieval algorithms can lead to variable sets of documents being retrieved for the same query. Variations in data indexing methods can cause inconsistencies in how documents are ranked and accessed. The stochastic nature of LLM operations introduces an element of randomness in the generative process. Updates in datasets can also alter the baseline against which queries are processed and interpreted, leading to different results over time.

This proposal aims to address these reproducibility challenges in similarity searches within RAG frameworks. This work involves analyzing the root causes of non-determinism, benchmarking and validating the consistency of query results, implementing enhancements to minimize variability, and developing tools and best practices to ensure reproducibility. Reproducibility in data queries can be influenced by several factors, including updates in datasets, differences in retrieval algorithms, varying data indexing methods, and the stochastic nature of LLM operations. Each of these factors can cause variability in the documents retrieved and in the generated responses. Ensuring consistency in query results across different runs is crucial for maintaining the integrity of LLM-driven scientific research, allowing researchers to confidently build upon prior work and achieve reliable, trustworthy outcomes.

Workplan

The proposed work will include: (1) Identifying sources of non-determinism and variability, such as algorithmic differences and indexing methods, in RAG; (2) Utilizing standardized scientific datasets to benchmark the reproducibility of similarity search results across different RAG frameworks; (3) Establishing protocols for handling dataset updates to ensure that such changes do not impact the reproducibility of similarity search results; and (4) Implementing mechanisms to track and document updates to datasets, ensuring that changes are reflected consistently across all instances of the RAG framework. By addressing these areas, the proposed work aims to mitigate challenges related to reproducibility in similarity search queries within RAG frameworks, ultimately enhancing the reliability and trustworthiness of scientific research outcomes.

Topics: Reproducibility LLM RAG Scientific Workflows
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Exploration of I/O Reproducibility with HDF5

Wed, 19 Feb 2025 09:00:00 -0700

Parallel I/O is a critical component in high-performance computing (HPC), allowing multiple processes to read and write data concurrently from a shared storage system. HDF5—a widely adopted data model and library for managing complex scientific data—supports parallel I/O but introduces challenges in I/O reproducibility, where repeated executions do not always produce identical results. This lack of reproducibility can stem from non-deterministic execution orders, variations in collective buffering strategies, and race conditions in metadata and dataset chunking operations within HDF5’s parallel I/O hierarchy. Moreover, many HDF5 operations that leverage MPI I/O require collective communication; that is, all processes within a communicator must participate in operations such as metadata creation, chunk allocation, and data aggregation. These collective calls ensure that the file structure and data layout remain consistent across processes, but they also introduce additional synchronization complexity that can impact reproducibility if not properly managed. In HPC scientific workflows, consistent I/O reproducibility is essential for accurate debugging, validation, and benchmarking, ensuring that scientific results are both verifiable and trustworthy. Tools such as h5bench—a suite of I/O kernels designed to exercise HDF5 I/O on parallel file systems—play an important role in identifying these reproducibility challenges, tuning performance, and ultimately supporting the overall robustness of large-scale scientific applications.

Workplan

The proposed work will include (1) analyzing and characterizing parallel I/O operations in HDF5 with h5bench miniapps, (2) exploring and validating potential reproducibility challenges within the parallel I/O hierarchy (e.g., MPI I/O), and (3) implementing solutions to address parallel I/O reproducibility.

Topics: Parallel I/O MPI-I/O Reproducibility HPC HDF5
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo and [Wei Zhang]Wei Zhang

Peersky Browser

Tue, 18 Feb 2025 12:00:00 -0800

Peersky Browser is an experimental personal gatekeeper to a new way of accessing web content. In a world where a handful of big companies control most of the internet, Peersky leverages distributed web technologies—IPFS, Hypercore, and Web3—to return control to the users. With integrated local P2P applications, Peersky offers a fresh, community-driven approach to browsing.

Implement Web Extensions Integration

Topics: Browser Extensions, UI/UX, Electron
Skills: JavaScript, Electron.js, HTML/CSS
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Akhilesh Thite

Implement web extension support in Electron by leveraging its web extension node modules, pre-installing extensions, and providing a user interface for adding, updating, and securely managing them.

Tasks:

Loading Extensions via Electron Modules:
- Utilize Electron’s web extension node modules to load extensions, as Electron.js doesn’t support marketplace integration.
Default Pre-installed Extensions:
- Configure a set of pre-installed extensions like uBlock to offer immediate value for privacy and security.
User-Installed Extensions UI:
- Create an interface where users can add extension .zip files in peersky://settings.
- Add an option for users to manually update all installed extensions.
Validate and Sandbox Extensions:
- Check the integrity and manifest structure of the uploaded extensions to ensure they meet Chrome Manifest V3 requirements.
- Apply sandboxing techniques and enforce strict content security policies to mitigate potential risks.
Extension Management UI:
- Design a dedicated UI at the top right of the navigation bar to manage extensions, including stack order and pinning functionality for quick access and organization.

Implement Chat History Synchronization for Hyper Chat Rooms

Topics: P2P Communication, Hypercore Protocol, Real-time Synchronization
Skills: JavaScript, Distributed Systems, P2P
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Akhilesh Thite

Implement chat history synchronization for Hyper chat rooms, ensuring that new devices retrieve all past messages—including those sent while offline—for a seamless user experience. Additionally, research and experiment with mDNS to enable true offline, peer-to-peer messaging on local networks.

Tasks:

History Retrieval Mechanism:
- Implement chat history synchronization so that when a new device joins a Hyper chat room, it retrieves the entire chat history from the Hypercore feed.
Offline Message Inclusion:
- Ensure that devices that were offline during prior messages can still access the full chat history upon joining the room, even after messages were sent in their absence.
UI Integration:
- Create a seamless experience for users across devices by ensuring that no messages are lost and that users can access the full chat history regardless of their online status.
Research mDNS (Multicast DNS):
- mDNS is a protocol that allows devices on the same local network to communicate with each other without the need for a central DNS server. This enables peer-to-peer communication, especially in offline environments, making it ideal for offline messaging.
- Experiment with the mDNS() function to enable peer-to-peer communication for offline chat rooms.
Create Hyper Chat Web App Version:
- Currently, Hyper chat is accessed via peersky://p2p/chat. Develop a web app version of Hyper chat that can be hosted on the hyper:// protocol (hyper://chat.p2plabs.xyz). This way, other P2P browsers (like Agregore) can use it to communicate.

AR4VIP

Tue, 18 Feb 2025 00:00:00 +0000

We are interested in developing navigation aids for visually impaired people (VIP) using AR/VR technologies. Our intended use is primarily indoors or outdoors but within private confines e.g. person’s backyard. Using AR/VR headsets or smart glasses allows navigation without using a cane and frees the users’ hands for other tasks.

Continue Development on Meta Quest 3 Headset

Topics: Dynamic scenes Spatial audio Proximity detection
Skills: AR/VR familiarity, WebXR, Unity, SLAM, good communicator, good documentation skills
Difficulty: Moderate
Size: Medium or large (175 or 350 hours)
Mentors: Alex Pang, James Davis

Continue development and field testing with the Meta Quest 3 headset. See this repository page for current status.

Specific tasks:

Improve spatial audio mapping
Improve obstacle detection, at different heights, with pre-scanned geometry as well as dynamic objects e.g. other people, pets, doors
Special handling of hazards e.g. stairs, uneven floors, etc.
Explore/incorporate AI to help identify objects in the scene when requested by user

New Development on Smart Glasses

Topics: Dynamic scenes Spatial audio Proximity detection
Skills: AR/VR familiarity, WebXR, Unity, SLAM, good communicator, good documentation skills
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Alex Pang, James Davis

VR headsets are bulky and awkward, but currently is more advanced than AR glasses in terms of programmability. Ultimately, the form factor of smart glasses is more practical for extended use by our target users. There are many vendors working on pushing out their version of smart glasses targetting various applications e.g. alternative for watching TV, etc. We are interested in those that provide capabilities to support spatial computing. Most of these will likely have their own brand specific APIs. This project has 2 goals: (a) develop generic brand-independent API, perhaps extensions to WebXR, to support overarching goal of navigation aid for VIP, and (b) port functionality of VR version to smart glasses while taking advantage of smart glass functionalities and sensors.

Specific tasks:

Explore current and soon-to-be-available smart glass options e.g. Snap Spectacles, Xreal Air 2 ultra, etc. and select a platform to work on (subject to cost and availability of SDK). At a minimum, glass should be microphones and speakers, and cameras. Infrared cameras or other low light capability is a plus. Sufficient battery life or option for quick exchange.
Identify support provided by SDK e.g. does it do realtime scene reconstruction? does it support spatial audio? etc. If it supports features outside of WebXR, provide generic hooks to improve portability of code to other smart glasses.
Port and extend functionalities from the Meta Quest 3 VR headsets to smart glass platform.
Add AI support if glasses support them.
Provide documentation of work.

Assessing and Enhancing CC-Snapshot for Reproducible Experiment Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

A critical challenge in computer systems research reproducibility is establishing and sharing experimental environments. While open testbeds like Chameleon provide access to hardware resources, researchers still face significant barriers when attempting to recreate the precise software configurations, dependencies, and system states needed for reproducible experiments. Environment snapshotting tools offer a solution, but face technical challenges in consistently capturing running systems without introducing distortions or requiring disruptive system modifications. This project addresses these fundamental reproducibility barriers by enhancing CC-Snapshot, an tool capturing the experimental environment configured by the user on bare metal images, to create more reliable and consistent system captures that can be shared and redeployed without loss of fidelity.

CC-Snapshot is a tool on the Chameleon testbed that enables users to package their customized environments as complex images or appliances. By allowing researchers to share these environments easily, CC-Snapshot offers a powerful mechanism for reproducibility, ensuring that experiments can be replicated and extended by others.

In this project, you will review existing CC-Snapshot workflows, research the latest snapshotting technologies, and develop enhancements that improve the tool’s usability and reliability. This includes ensuring snapshots are created consistently (even when the OS is actively running), preserving the integrity of user systems, and exploring advanced features such as out-of-band snapshotting and API-based triggers.

Key Outcomes

Improved Snapshot Consistency: New methods to capture the full state of a disk without risking corruption or data inconsistency.
Enhanced Reproducibility: A refined workflow that allows researchers to reliably share custom environments, facilitating collaborative and repeatable experiments.
User-Friendly Tooling: Streamlined processes that reduce disruption to running systems—so installing dependencies or rebooting into special environments is less burdensome.
Exploratory Features (Stretch Goals): Advanced mechanisms to stream disk data in real time during snapshotting and to initiate snapshots via an API call (for parity with VM snapshots).

Topics: Cloud Computing, Systems & Infrastructure, Reproducibility, Operating System Internals

Skills: Linux / OS Concepts, Cloud Tools, Systems Programming / Scripting, DevOps / CI

Difficulty: Moderate

Size: Medium

Mentors: Michael Sherman, Mark Powers

Tasks:

Ensure Snapshot Consistency
- Reboot into a ramdisk and copy the offline disk.
- Use kexec to switch to/from a ramdisk environment without a full reboot.
- Change images to use a snapshot-capable filesystem (e.g., LVM) for safer live snapshots.
- Investigate additional methods (e.g., blog.benjojo.co.uk) for safely imaging live disks.
Prevent System Modifications During Snapshot
- Currently, CC-Snapshot installs dependencies (e.g., qemu-img) on the running system, affecting its state.
- In-Band Fix: Download and run tools in a temp directory with static linking, avoiding system-level changes.
- Out-of-Band Approach: Snapshots done via ramdisk or kexec do not require altering the running system.
API-Triggered Snapshots
- Extend or integrate with the Nova “snapshot instance” API to support the same workflow for bare metal.
- Leverage Ironic’s new “service steps” feature for an automated snapshot pipeline.
(Stretch Goal) Streaming Snapshots
- Modify the workflow to stream data directly to storage, rather than making a full local copy first.
- Explore incremental or differential snapshot techniques to reduce bandwidth usage and storage overhead.

CarbonCast: Building an end-to-end consumption-based Carbon Intensity Forecasting service

Tue, 18 Feb 2025 00:00:00 +0000

CarbonCast is a machine-learning-based approach to provide multi-day forecasts of the electrical grid’s carbon intensity. Developed in Python, the current version of CarbonCast delivers accurate forecasts in numerous regions by using historical source production data of a particular geographical region, time of day/year, and weather forecasts as features. However, there is no easy way to access and visualize the data through a standard interface. In addition, much important information is left out and is not available to users. For instance, electricity grids often import electricity from neighboring regions and so electricity consumption depends on both electricity generation and imports. Moreover, it is imperative for each energy source to utilize a tailored predictive mechanism. Consequently, any carbon optimization solution trying to reduce carbon emissions due to its electricity consumption will benefit more from following a consumption-based CI signal.

The plan for this project is to develop both the frontend and the backend API services for CarbonCast. We also intend to enhance CarbonCast by implementing an architecture wherein each region can employ a distinct interface for their predictive modeling. In scenarios where these new models do not yield superior outcomes within a region, the current architecture will serve as a fallback solution.

Building an end-to-end consumption-based Carbon Intensity Forecasting service

Topics: Databases Machine Learning
Skills: Python, command line (bash), MySQL, Django, machine learning, cronjob
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Abel Souza

Develop a containerized end-to-end backend, API, and frontend for collecting, estimating, and visualizing real-time and forecast electrical grid’s carbon intensity data in a scalable manner.

Tasks:

Research web technologies and frameworks relevant to CarbonCast development.
Run and collect CarbonCast’s data (CSV)
Ingest CSV into a MySQL or SQLite database
Develop an Application Programming Interface (API) and a Web User Interface (UI) to provide real-time data access and visualization.
Deploy the CarbonCast API as a service and dockerize it so that other users and applications can locally deploy and use it easily.
Implement a choropleth web map to visualize the carbon intensity data across the different geographical regions supported by CarbonCast.
Enhance CarbonCast by implementing an extensible architecture wherein every region can employ distinct models for their predictive modeling.

Chameleon Trovi Support for Complex Experiment Appliances

Tue, 18 Feb 2025 00:00:00 +0000

Overview

The discoverability and accessibility of research artifacts remains a significant barrier to reproducibility in computer science research. While digital libraries index research papers, they rarely provide direct access to the artifacts needed to reproduce experiments, especially complex multi-node systems. Additionally, when artifacts are available, they often lack standardized metadata, versioning, and deployment mechanisms that would enable researchers to easily find and reuse them. This project addresses these challenges by extending Trovi, a repository of experimental artifacts executable on open platforms, to support complex, multi-node appliances, making sophisticated experimental environments discoverable, shareable, and deployable through a standardized interface - ultimately lowering the barriers to reproducing complex systems experiments.

Chameleon has historically enabled researchers to orchestrate complex appliances—large, multi-node clusters configured via OpenStack Heat—to conduct advanced experiments. Meanwhile, Chameleon team introduced Trovi as repository for open platforms (beyond Chameleon) that pioneers mechanisms for artifact and platform integration leading to immediate execution for pratical reproducibility. This project aims to bridge the two by adding support in Trovi for importing, discovering, and launching complex appliances. By integrating these capabilities, researchers will be able to one-click deploy complex appliances directly from the Trovi dashboard, archive them for future reference, and reproduce experiments on demand.

Key Outcomes

Extended Trovi API: Enable the import and management of complex appliances as artifacts.
Streamlined One-Click Launch: Integrate with Chameleon’s existing provisioning workflows so users can launch multi-node clusters directly from Trovi.
Enhanced Dashboard Experience: Provide UI assistance for discovering, reviewing, and customizing complex appliance artifacts.
Improved Artifact Reproducibility: Automate the process of exporting CC-snapshot images and other resources to ensure everything is preserved across sites (UC, TACC), highlighting any parameters that need user attention for cross-site portability.

Topics: Reproducible Research, Cloud Computing & Orchestration, OpenStack Heat, UI/UX & Web Development

Skills: Python, APIs, Cloud (OpenStack), DevOps & Automation, Frontend

Difficulty: Hard

Size: Large

Mentors: Mark Powers

Tasks:

Extensions to the Trovi API
- Add support for importing complex appliances as artifacts (including Heat templates, metadata, and associated disk images).
- Develop methods for tagging, versioning, and categorizing these appliances, making them easier to discover.
One-Click Launch of Complex Appliances
- Integrate with Chameleon’s orchestration engine, enabling single-click cluster deployments from the Trovi UI.
- Validate correct configuration and resource availability through automated checks.
Trovi Dashboard Enhancements
- Update the front-end to provide intuitive controls for customizing or parameterizing complex appliances before launching.
- Offer a clear workflow for reviewing dependencies, resource requirements, and usage instructions.
Automated Export & Multi-Site Testing
- Streamline the export of snapshots or images into Trovi as part of the appliance import process.
- Optionally re-run the imported appliances at multiple sites (UC, TACC), detecting any unparameterized settings or missing dependencies.

Contextualization – Extending Chameleon’s Orchestration for One-Click Experiment Deployment

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Reproducibility in computer systems research is often hindered by the quality and completeness of artifact descriptions and the complexity of establishing experimental environments. When experiments involve multiple interconnected components, researchers struggle with hardcoded configurations, inadequate documentation of setup processes, and missing validation steps that would verify correct environment establishment. This project addresses these challenges by extending orchestration capabilities beyond basic hardware provisioning to include comprehensive contextualization—making complex, multi-component experimental environments deployable via parameterized templates with clear validation points, standardized metadata, and minimal user intervention—thus significantly reducing the barriers to reproducing complex distributed systems experiments.

Chameleon already provides powerful capabilities to orchestrate and configure resources through Heat templates (similar to Terraform) and the python-chi library. However, these focus primarily on provisioning (i.e., allocating and configuring hardware resources). This project goes a step further by addressing contextualization—the process of creating complete, ready-to-use experimental environments that incorporate everything from network layout to instance-level configuration and discovery—with additional features such as parameterized templates, experiment-level metadata, and output reporting.

Key Outcomes

Template-Based One-Click Launch: Users can deploy multi-resource experiments (VMs, networks, storage, etc.) via a single click or a minimal set of input parameters.
Enhanced Experiment Contextualization: Each launched resource can gain access to global “experiment-level” metadata (e.g., IP-to-hostname mappings for cluster authentication) and outputs that summarize important details.
Streamlined User Experience: An asynchronous deployment workflow that provides notifications and uses “outputs” to highlight critical connection information (e.g., bastion host IP, final results).
Optional Advanced Features: Partial reconfiguration to avoid full rebuilds when changes are minor, an “export” function to capture existing deployments into a new template, and potential publishing to Trovi for reproducibility and archiving.

Topics: Cloud Computing & Orchestration, Infrastructure as Code, DevOps & Automation, Reproducible Research Environments

Skills:

OpenStack & Heat Templates: Familiarity with provisioning resources on Chameleon using Heat or Terraform-like workflows.
Python & Scripting: For enhancing or extending the python-chi library.
Systems / Network Knowledge: Understanding multi-VM topologies, cluster configurations, and network-level interactions.
CI/CD & DevOps: Experience building or integrating asynchronous deployment and notifications.

Difficulty: Hard

Size: Large (suitable for a semester-long project or a summer internship)

Mentors: Paul Marshall

Tasks:

One-Click Template Launch
- Design a template (in Heat or similar) specifying multiple cloud resources (images, networks, disk images, SSH keys, etc.).
- Ensure the template author can define input parameters with defaults.
- Allow the user to launch the template quickly with default values or adjust parameters before deployment.
Asynchronous Provisioning & Notifications
- Implement a long-running process that deploys resources step-by-step.
- Provide status updates to the user (e.g., via UI notifications, email, or logs) when deployments complete or fail.
Experiment-Level Metadata
- Inject metadata such as IP-to-hostname mappings to each instance for easy cluster authentication.
- Allow the template to define “outputs” (like a public IP of a bastion or location of final results).
Partial Reconfiguration (Optional)
- Enable partial updates if only one of several servers changes, saving time and resources.
- Improve fault tolerance by avoiding full redeploys in the event of partial failures.
Export Running Configurations into a New Template (Optional)
- Build a web-interface or script to detect existing user-owned resources (servers, networks, etc.).
- Generate a proposed template from those resources, suggesting parameters (e.g., flavor, disk image, or SSH key).
- Extend or modify existing templates by adding discovered resources.
Integration with Trovi / Multi-Site Testing (Optional)
- Provide a method to archive or publish the final template (and associated disk images, data sets) in Trovi.
- Attempt to re-run the template at multiple Chameleon sites (e.g., UC, TACC) to identify parameters or modifications needed for cross-site reproducibility.

MPI Appliance for HPC Research on Chameleon

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Message Passing Interface (MPI) is the dominant programming model for high-performance computing (HPC), enabling applications to scale efficiently across thousands of processing cores. In reproducibility initiatives for HPC research, MPI implementations are critical as they manage the complex communications that underpin parallel scientific applications. However, reproducing MPI-based experiments remains challenging due to the need for specific library versions, network configurations, and multi-node setups that must be precisely orchestrated.

The popularity of an “MPI cluster” as a base layer for many results in HPC caused support for MPI template and appliance to be specifically requested by the SC24 reproducibility chair to support the conference’s reproducibility initiative, providing researchers with standardized environments for validating results. By extending the work begun for SC24, this project aims to create higher-quality, ready-to-use, and maintainable MPI environments for the Chameleon testbed that abstracts away complex configuration details while ensuring consistent performance across experiments—thus making HPC experiments more accessible and reproducible for the broader research community.

You will lead efforts to configure disk images with the necessary MPI dependencies and provide orchestration templates that set up networking and instances automatically. The resulting appliance will allow researchers to quickly and consistently deploy distributed computing environments with MPI. The goal is to facilitate reproducible and scalable computational experiments for a wide range of scientific and engineering applications.

Key Outcomes

Ready-to-Use MPI Disk Images: Create one or more images pre-configured with the correct versions of MPI and dependencies, ensuring a consistent environment.
Simple Cluster Configuration Scripts: Provide scripts or playbooks that efficiently bring up a fully functional MPI cluster on Chameleon, abstracting away manual setup steps.
Orchestration Template: An automated workflow that sets up networks, instances, and additional resources needed to run large-scale MPI workloads.

Topics: High-Performance Computing (HPC), Cloud Computing, MPI & Distributed Systems, DevOps & Automation

Skills:

MPI & Parallel Programming: Understanding of MPI libraries, cluster configuration, and typical HPC workflows.
Cloud Orchestration: Familiarity with OpenStack Heat or other Infrastructure-as-Code (IaC) tools for provisioning resources.
Linux System Administration: Experience configuring and troubleshooting packages, network settings, and performance optimizations.
Scripting & Automation: Ability to write scripts (e.g., Bash, Python) to automate setup and deployment steps.

Difficulty: Moderate to Hard

Size: Medium

Mentor: Ken Raffenetti

Tasks

Disk Images with MPI Dependencies
- Build base images with the correct versions of MPI (e.g., MPICH, OpenMPI) and any required libraries (e.g., GCC, network libraries).
- Ensure all packages are up to date and tested for compatibility with Chameleon’s bare metal and/or VM environments.
Cluster Setup Scripts
- Develop lightweight scripts or Ansible playbooks that join new instances into an MPI cluster, configuring hostnames, SSH keys, and MPI runtime settings.
- Validate cluster functionality by running simple distributed “Hello World” tests and more advanced benchmarks (e.g., Intel MPI Benchmarks).
Orchestration Template
- Provide a Heat template (or similar) specifying the network configuration, instance counts, and environment variables for MPI.
- Enable easy parameterization of cluster size, disk images, and other variables so users can customize their setups on the fly.
Integration & Testing
- Document best practices for launching and using the MPI images in Chameleon.
- Demonstrate reproducibility with multiple cluster sizes and workloads to ensure reliability.

Smart Environments – An AI System for Reproducible Custom Computing Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

The complexity of environment setup and the expertise required to configure specialized software stacks can often hinder efforts to reproduce important scientific achievements in HPC and systems studies. Researchers often struggle with incomplete or ambiguous artifact descriptions that make assumptions about “common knowledge” that is actually specific domain expertise. When trying to reproduce experiments, reviewers may spend excessive time debugging environment inconsistencies rather than evaluating the actual research. These challenges are compounded when experiments need to run on different hardware configurations.

This project seeks to address these fundamental reproducibility barriers by using AI to translate natural language environment requirements often used in papers or artifact descriptions into actionable, reproducible configurations—bridging the knowledge gap between experiment authors and reviewers while standardizing environment creation across different hardware platforms. We will develop an AI-driven system that automatically generates and configures reproducible computing environments based on artifact descriptions from conferences, Trovi artifacts on the Chameleon testbed, and other reliable sources for scientific experiment code and associated documentation. Leveraging Natural Language Processing (NLP), the system will allow researchers to describe desired environments in plain English, then map those descriptions onto predefined configuration templates. By simplifying environment creation and ensuring reproducibility, the system promises to eliminate duplicate setup efforts, accelerate research workflows, and promote consistent experimentation practices across diverse hardware.

Key Outcomes

Working Prototype: A system that automatically generates machine images deployable on bare metal and VM instances, based on user-provided requirements.
Comprehensive Documentation: Detailed user manuals, guides, and best practices tailored to researchers, ensuring a smooth adoption process.
Live Demo: A demonstration environment (e.g., a web app or Jupyter notebook) that shows how to request, configure, and launch reproducible cloud environments on both hardware profiles.
Long-Term Impact: Building blocks for future AI-driven automation of cloud infrastructure, reducing human error and enabling fast, repeatable research pipelines.

Topics: Reproducibility, AI & NLP, Cloud Computing, DevOps and Automation

Skills:

Machine Learning / AI: Familiarity with NLP methods to interpret user requirements.
Python: Primary language for backend services and cloud interactions.
Cloud API Integration: Experience with OpenStack or similar APIs to provision and configure images on both bare metal and virtual machines.
DevOps: Automated environment configuration, CI/CD workflows, and containerization.

Difficulty: Hard

Size: Large

Mentors: Paul Marshall

Tasks:

Requirement Gathering & NLP Design
- Research the specific needs of researchers building experimental setups.
- Design an NLP pipeline to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into environment “recipes.”
Backend Environment Builder
- Implement logic that converts parsed user requirements into machine-image definitions for bare metal and VM instances.
- Integrate with Chameleon’s APIs to provision servers, install software, and run configuration validation automatically.
Front-End & User Experience
- Develop an intuitive web or CLI interface that researchers can use to capture experiment environment requirements.
- Provide real-time status updates during environment setup, along with meaningful error messages and quick-start templates.
Testing & Validation
- Conduct end-to-end tests using diverse software stacks (e.g., HPC libraries, machine learning frameworks) on bare metal and VM instances.
- Ensure reproducibility by re-creating the same environment multiple times and comparing configurations.
Documentation & Demonstration
- Produce user-facing documentation, including tutorials and best practices for researchers who frequently run experiments on Chameleon Cloud.
- Create a short live demo or screencast showcasing how to configure an environment for a specific research workflow.

Smart Environments – An AI System for Reproducible Custom Computing Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Key Outcomes

Working Prototype: A system that automatically generates machine images deployable on bare metal and VM instances, based on user-provided requirements.
Comprehensive Documentation: Detailed user manuals, guides, and best practices tailored to researchers, ensuring a smooth adoption process.
Live Demo: A demonstration environment (e.g., a web app or Jupyter notebook) that shows how to request, configure, and launch reproducible cloud environments on both hardware profiles.
Long-Term Impact: Building blocks for future AI-driven automation of cloud infrastructure, reducing human error and enabling fast, repeatable research pipelines.

Topics: Reproducibility, AI & NLP, Cloud Computing, DevOps and Automation

Skills:

Machine Learning / AI: Familiarity with NLP methods to interpret user requirements.
Python: Primary language for backend services and cloud interactions.
Cloud API Integration: Experience with OpenStack or similar APIs to provision and configure images on both bare metal and virtual machines.
DevOps: Automated environment configuration, CI/CD workflows, and containerization.

Difficulty: Hard

Size: Large

Mentors: Paul Marshall

Tasks:

Requirement Gathering & NLP Design
- Research the specific needs of researchers building experimental setups.
- Design an NLP pipeline to parse plain-English descriptions (e.g., “I need Python 3.9, CUDA 11, and scikit-learn”) into environment “recipes.”
Backend Environment Builder
- Implement logic that converts parsed user requirements into machine-image definitions for bare metal and VM instances.
- Integrate with Chameleon’s APIs to provision servers, install software, and run configuration validation automatically.
Front-End & User Experience
- Develop an intuitive web or CLI interface that researchers can use to capture experiment environment requirements.
- Provide real-time status updates during environment setup, along with meaningful error messages and quick-start templates.
Testing & Validation
- Conduct end-to-end tests using diverse software stacks (e.g., HPC libraries, machine learning frameworks) on bare metal and VM instances.
- Ensure reproducibility by re-creating the same environment multiple times and comparing configurations.
Documentation & Demonstration
- Produce user-facing documentation, including tutorials and best practices for researchers who frequently run experiments on Chameleon Cloud.
- Create a short live demo or screencast showcasing how to configure an environment for a specific research workflow.

Widgets for Python-chi in Jupyter

Tue, 18 Feb 2025 00:00:00 +0000

Overview

Reproducibility challenges in research extend beyond code and environments to the experimental workflow itself. When experiments involve dynamic resource allocation, monitoring, and reconfiguration, researchers often struggle to document these interactive steps in a way that others can precisely follow. The lack of structured workflow documentation and real-time feedback creates barriers for reviewers attempting to reproduce experiments, as they cannot easily verify whether their resource configurations match the original experiment’s state. This project addresses these challenges by developing interactive Jupyter widgets that make experiment resource management more visual, intuitive, and self-documenting—transforming ad-hoc command sequences into reproducible workflows that automatically log interactions and configuration changes while providing immediate visual feedback on experiment topology and resource states.

As cloud researchers often work with Jupyter Notebooks for interactive data analysis and experimentation, the python-chi library offers a powerful way to automate and control resources on Chameleon Cloud. This project will extend python-chi by adding interactive widgets specifically designed for use in Jupyter, empowering users to launch, monitor, and manage their experiments without leaving the notebook environment. By bringing visual and intuitive controls directly into the user’s workflow, we aim to improve both reproducibility and usability for complex resource management tasks.

Key Outcomes

User-Friendly Jupyter Widgets: Develop a suite of widgets to visualize reserved resources, hardware availability, and experiment topologies in real time.
Integrated Experiment Management: Enable researchers to orchestrate experiments (launch, configure, monitor) within a single, notebook-centric workflow.
Enhanced Feedback & Usability: Provide clear, asynchronous status updates and resource reconfiguration progress, reducing confusion and user error.
Improved Reproducibility: By automating and logging widget interactions, experiments become more traceable and easier to replicate.

Topics: Interactive Data Tools, Cloud Resource Management, DevOps & Automation, User Experience (UX)

Skills:

Python & Jupyter: Experience creating custom Jupyter widgets, using ipywidgets or similar frameworks.
Cloud Automation: Familiarity with how resources are provisioned, monitored, and deprovisioned on Chameleon.
Frontend / GUI Development: Basic understanding of web technologies (HTML/CSS/JavaScript) can be helpful for widget design.
Software Engineering & CI: Ability to version-control, test, and deploy Python packages.

Difficulty: Moderate

Size: Medium

Mentor: Michael Sherman, Mark Powers

Tasks:

Resource Visualization Widgets
- Build custom widgets that show reserved resources (nodes, networks, storage) in Jupyter.
- Provide an interactive topology view for experiments, indicating node statuses and connections.
Experiment Setup & Execution
- Add controls for launching and managing experiments directly from notebooks.
- Show feedback (e.g., progress bars, status messages) as resources are being allocated or reconfigured.
Hardware Availability & Status Tracking
- Implement a widget that provides real-time data on Chameleon’s hardware availability (bare metal, VMs, GPU nodes, etc.).
- Allow users to filter or select specific resources based on current hardware states.
Usability & Feedback Loop
- Gather user feedback on the widget designs and workflows.
- Refine the interface to minimize clicks, improve clarity, and reduce friction for common tasks.

Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges

Sat, 15 Feb 2025 00:00:00 +0000

Project Description

Topics: Distributed systems
Skills: Java, Go, Python, Bash scripting, Linux, Docker.
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fadhil I. Kurnia

Replication is commonly employed to improve system availability and reduce latency. By maintaining multiple copies, the system can continue operating even if some replicas fail, thereby ensuring consistent availability. Placing replicas closer to users further decreases latency by minimizing the distance data must travel. A typical illustration of these advantages is a Content Delivery Network (CDN), where distributing content to edge servers can yield latencies of under 10 milliseconds when users and contents are in the same city.

In recent times, numerous edge datastores have emerged, allowing dynamic data to be served directly from network-edge replicas. Each of these replicated systems may employ different coordination protocols to synchronize replicas, leading to varied performance and consistency characteristics. For instance, Workers KV relies on a push-based coordination mechanism that provides eventual consistency, whereas Cloudflare Durable Objects and Turso deliver stronger consistency guarantees. Additionally, researchers have introduced various coordination protocols—such as SwiftPaxos, EPaxos, OPaxos, WPaxos, Raft, PANDO, and QuePaxa—each exhibiting its own performance profile, especially when being used in geo-distributed deployment.

This project aims to develop an open testbed for evaluating replicated systems and their coordination protocols under edge deployment. Currently, researchers face challenges in fairly comparing different replicated systems, as they often lack control over replica placement. Many previous studies on coordination protocols and replicated systems relied on mock implementations, particularly for well-known systems like Dynamo and Spanner, which are not open source. An open testbed would provide a standardized environment where researchers can compare various replicated systems, classes of coordination protocols, and specific protocol implementations using common benchmarks. Since the performance of replicated systems and coordination protocols varies depending on the application, workload, and replica placement, this testbed would offer a more systematic and fair evaluation framework. Furthermore, by enabling easier testing and validation, the testbed could accelerate the adoption of research prototypes in the industry.

Project Deliverables

Compilation of traces and applications from various open traces and open benchmarks.
Distributed workload generator to run the traces and applications.
Test framework to simulate latency of 100s of edge servers for measurement.
Open artifact of the traces, applications, workload generator, and test framework, published on Github.

Vector Embeddings Dataset

Tue, 11 Feb 2025 13:00:00 -0800

Vector Embeddings Dataset

Topics: Vector Embeddings LLMs Transformers
Skills: software development, apis, scripting, python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Jayjeet Chakraborty

To benchmark vector search algorithms (aka ANN algorithms), there are several datasets available but none of them represent actual real world workloads. This is because they usually have small vectors of only a few hundred dimensions. For vector search experiments to represent real world workloads, we want to have datasets with several thousand dimensions like what is generated by OpenAIs text-embedding models. This project aims to create a dataset with 1B embeddings from a wikipedia dataset using open source models. Ideally, we will have 3 versions of this dataset, with 1024, 4096, and 8192 sized embeddings to start with.

Brahma

Tue, 11 Feb 2025 12:34:56 -0700

Brahma is a lightweight framework for building collaborative and cross platform WebXR based experiences using Three.js for the front-end and a simple Node.js/WebSocket script on the backend. It was created at the Social Emotional Technology Lab to facilitate the development of novel collaborative interfaces and virtual environments capable of loading scientific datasets. For example, in the featured image, multiple avatars are exploring a marine science dataset related to seal migration paths overlaid on NOAA bathymetry and telemetry data.

It addresses a gap where prior open-source collaborative VR is no longer available such as the defunct Mozilla Hubs or proprietary engine based frameworks such as Ubiq. Furthermore, it uses very little computational resources to run and develop, enabling creators who may not have a powerful computer to run a game engine in order to develop a networked VR application.

This project involves the first public release of Brahma– creating a lightweight open source framework that facilitates multi-user games, scientific visualizations and other applications. In order to do so, we need to formalize the framework, provide documentation, and implement key examples so that the open source tool can be extensible and serve a wider community.

Mentees can expect to learn best practices for VR development and testing and gain familiarity with full stack development practices. Mentees should have access and experience using a VR headset.

Brahma / Protoocol Release and Validation

Topics: Web Development Software Architecture VR Development Computer Graphics Cloud Platforms
Skills: Node.js, Three.js
Difficulty: Moderate-Challenging
Size: Large (350 hours)
Mentors: Samir Ghosh

The proposed work includes three phases, primarily working on backend code, and API design. In the first phase, to gain familiarity, the mentee will be running and testing the Brahma backend on a variety of cloud platforms such as AWS, Google Cloud, and Azure– and learning best methods for documentation in the process. Then, in the second phase, the mentee will work on formalizing the protocol for avatar embodiment and other multi-user interfaces, testing the application with a simple pong game. In the third phase, the mentee will address telemetry, logging, and analysis considerations.

This project is well suited for someone who has interest in virtual reality, especially social VR, multi-user, or collaborative applications

Brahma / Allocentric WebXR Interfaces

Topics: Web Development VR Development Computer Graphics UX/UI
Skills: Three.js, GLSL, WebSocket
Difficulty: Moderate-Challenging
Size: Medium or large (175 or 350 hours)
Mentors: Samir Ghosh

The proposed work primarily involves front-end code and VR interface design. In the first phase, the mentee will gain familiarity with best practices for WebXR development through the implementation and documentation of simple interaction patterns. Then, the mentee will implement a simple multi-user pong game to learn about allocentric interfaces. In the final phase of the project, the mentee will design and implement one or more allocentric interface of their choosing.

This project is well suited for someone who has interest in virtual reality, especially aspects of graphics and interaction design.

WildBerryEye

Tue, 11 Feb 2025 10:15:56 -0700

WildBerryEye leverages Raspberry Pi and YOLO object detection models to monitor pollinizers like bees and hummingbirds visiting flowers. This initiative aims to enhance environmental research by automating data collection and analysis of pollinator activities, which are crucial for ecological assessments and conservation efforts. The project utilizes video data provided by Dr. Rossana Maguiña, processed through advanced machine learning techniques to accurately identify and track pollinator interactions in natural habitats.

Develop web-based user interface

Topics: Full Stack Development React Flask
Skills: Experience with full stack development and real time processing
Difficulty: Moderate to Challenging
Size: Medium or large (175 or 350 hrs)
Mentors: Carlos Isaac Espinosa Ramirez

Develop a clean and intuitive web-based interface for WildBerryEye, ensuring ease of use for researchers and contributors. The platform should present real-time pollinator detection results, facilitate data visualization, and allow users to interact with system settings efficiently. The website must be accessible, visually appealing, and optimized for both desktop and mobile users, avoiding unnecessary complexity or intrusive elements.

Specific tasks:

Frontend Development: Continue development to enhance the user interface using React and CSS, ensuring a responsive and user-friendly design.
Backend Development: Expand functionality using Flask, focusing on efficient API endpoints and seamless interaction with the frontend (excluding database implementation).
Real-Time Communication: Implement and refine real-time updates between the frontend and backend to enhance system responsiveness.
Usability & Design Optimization: Research and propose improvements to the system’s usability, design, and overall user experience.

AI Data Readiness Inspector (AIDRIN)

Tue, 11 Feb 2025 10:15:00 -0700

Garbage In Garbage Out (GIGO) is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.

AIDRIN (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of the readiness of data for AI processes, covering a broad range of readiness dimensions available in the literature. AIDRIN uses metrics in traditional data quality assessment, such as completeness, outliers, and duplicates, for data evaluation. Furthermore, AIDRIN uses metrics specific to assess data for AI, such as feature importance, feature correlations, class imbalance, fairness, privacy, and FAIR (Findability, Accessibility, Interoperability, and Reusability) principle compliance. AIDRIN provides visualizations and reports to assist data scientists in further investigating the readiness of data.

AIDRIN Visualizations and Science Gateway

The proposed work will include improvements in the AIDRIN framework to (1) enhance, extend, and optimize the visualizations of metrics related to all six pillars of AI data readiness and (2) set up a science gateway on NERSC or AWS cloud service.

Topics: data readiness AI
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench with AI workloads

Tue, 11 Feb 2025 10:15:00 -0700

h5bench is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.

Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library’s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.

h5bench with AI workloads

The proposed work will include (1) analyzing and characterizing AI workloads that rely on HDF5 datasets, (2) extracting a kernel of their I/O operations, and (3) implementing and validating the kernel in h5bench.

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

HAgent

Tue, 11 Feb 2025 00:00:00 +0000

HAgent is a platform to build AI hardware agent engine to support multiple components in chip design, such as code generation, verification, debugging, and tapeout.

HAgent is build as a compiler for for Hardware Agents, it interfaces with typical EDA tools like compilers, synthesis, and verification. There are several projects around enhancing HAgent.

BugFarm hagent step

Objective: Develop a HAgent step (pass) to create bugs in a given design.

Description: Using LLMs (Hagent APIs), the goal is to add “bugs” to input Verilog design. The goal is for other tools passes that need to fix bugs, to use this infrastructure as a bug generator. There is a MCY (https://github.com/YosysHQ/mcy) that does something similar but it does not use verilog and create a very different Verilog output. The BugFarm is supposed to have somewhat similar functionality but edit the Verilog directly which results in a code with just a few edits. Like MCY, there has to be a step to confirm that the change affects results. The project should benchmarks and compare with MCY.

Skills Needed: Python, Verilog, and understand agents
Difficulty: Medium
Size: Medium
Mentors: Jose Renau, Farzaneh Rabiei Kashanaki

HDEval Competition Repository

Objective: Create a platform for HDL programming challenges and community engagement.

Description: Develop a repository where users can solve HDL problems in Verilog, Chisel, PyRTL, etc. Implement a points system for successful solutions. Allow users to submit new problems (code, specifications, verification, and tests) that are not easily solvable by LLMs. Automate solution testing and provide feedback on submissions.

The submissions consist of 4 components: code, specification, verification, and tests. It should be possible to submit also examples of bugs in code/specification/verification/tests during the design.

If the code is different from Verilog, it should include the HDL (chisel, PyRTL,…) and also the Verilog.

The specification is free form. For any given specification, an expert on the area should be able to generate code, verification, and tests. Similarly, from any pair. Any expert should be able to generate the rest. For example, from verification and tests, it should be able to generate the code and specification.

Typical specifications consist of a plan, API, and a sample usage.

Skills Needed: Web design, some hardware understanding
Difficulty: Medium
Size: Medium
Mentors: Jose Renau, Farzaneh Rabiei Kashanaki

Integrate Silicon Compiler

Objective: Silicon Compiler is an open-source Python library that allows to interface with many EDA tools. The idea is to integrate it with HAgent to allow prompts/queries to interface with it.

Description: The agentic component requires to check with silicon compiler that the generated Python compiles but also that has reasonable parameters. This will require a react loop for compiler errors, and likely a judge loop for testing for reasonable options/flow with feedback from execution. Since there is not much training examples, it will require a few shot with a database to populate context accordingly.

The end result should allow to select different tools and options trhough silicon compiler.

Skills Needed: Backend chip design
Difficulty: High
Size: Medium
Mentors: Jose Renau

Comodore 64 or MSX or Gameboy

Objective: Create a prompt-only specification to build a hardware accelerated for the target platform (Comodore 64, MSX or Gameboy). The generated code should focus on Verilog, but it is fine to also target some other HDL. In all the cases, the project should include a generated Verilog integrated with some emulator for verification.

Description: Using Hagent, create an HDLEval benchmark (set of prompts) that provide the necessary information to create the Verilog implementation. HDLEval prompts usually consists of a high-level PLAN or specification, an API to implement, and a few examples of usage for the given API.

The result of running the bencharmk, a generated Verilog runs program in the emulator and the Verilog to compare correctness. The platform should have an already existing emulator vice-emu or mGBA to perform cosimulation against the generated specification.

Skills Needed: Verilog for front-end design
Difficulty: High
Size: Large (175 or 350 hours)
Mentors: Jose Renau

Scenic: A Language for Design and Verification of Autonomous Cyber-Physical Systems

Tue, 11 Feb 2025 00:00:00 +0000

3D Driving Scenarios
A Library for Aviation Scenarios
Interfacing Scenic to new simulators
Optimizing and parallelizing Scenic
Improvements and infrastructure for the VerifAI toolkit

See the sections below for details.

3D Driving Scenarios

Topics: Autonomous Driving 3D modeling
Skills: Python; basic vector geometry
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

Scenic scenarios written to test autonomous vehicles use the driving domain, a Scenic library defining driving-specific concepts including cars, pedestrians, roads, lanes, and intersections. The library extracts information about road networks, such as the shapes of lanes, from files in the standard OpenDRIVE format. Currently, we only generate 2D polygons for lanes, throwing away 3D information. While this suffices for many driving scenarios, it means we cannot properly model overpasses (the roads appear to overlap) or test driving scenarios where 3D geometry is important, such as hilly terrain.

The goals of this project are to extend our road network library to generate 3D meshes (instead of 2D polygons) for roads, write new Scenic scenarios which use this new capability, and (if time allows) test autonomous driving software using them.

A Library for Aviation Scenarios

Topics: Autonomous Aircraft
Skills: Python; ideally some aviation experience
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

We have used Scenic to find, diagnose, and fix bugs in software for autonomous aircraft: in particular, this paper studied a neural network-based automated taxiing system using the X-Plane flight simulator. We also have prototype interfaces to AirSim and Microsoft Flight Simulator. However, our experiments so far have mainly focused on simple scenarios involving a single aircraft.

The goal of this project is to develop an aviation library for Scenic (like the driving domain mentioned in the previous project) which will allow users to create complex aviation scenarios in a simulator-agnostic way. The library would define concepts for aircraft, flight paths, weather, etc. and allow importing real-world data about these. The student would demonstrate the library’s functionality by writing some example scenarios and testing either simple aircraft controllers or (if time allows) ML-based flight software.

Interfacing Scenic to New Simulators

Topics: Simulation Autonomous Driving Robotics LLMs
Skills: Python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

The AWSIM driving simulator (to allow testing the Autoware open-source autonomous driving software stack)
The CoppeliaSim robotics simulator
NVIDIA’s Cosmos, an LLM which generates videos from text prompts
NVIDIA’s Omniverse (various applications, e.g. simulating virtual factories)
Various simulators for which we have prototype interfaces that could be generalized and made more usable, including MuJoCo and Isaac Sim

Optimizing and Parallelizing Scenic

Topics: Optimization Parallelization
Skills: Python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

Large-scale testing with Scenic, when one wants to generate thousands of simulations, can be very computationally-expensive. In some cases, the bottleneck is the simulator, and being able to easily run multiple simulations in parallel would greatly increase scalability. In others, Scenic itself spends substantial time trying to sample scenarios satisfying all the given constraints.

This project would explore a variety of approaches to speeding up scene and simulation generation in Scenic. Some possibilities include:

Parallelizing scene generation and simulation (e.g. using Ray)
Systematically profiling real-world Scenic programs to characterize the main bottlenecks and propose optimizations
JIT compiling Scenic’s internal sampling code (e.g. using Numba)

Improvements and Infrastructure for the VerifAI Toolkit

Topics: DevOps Documentation APIs
Skills: Python
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

VerifAI is a toolkit for design and analysis of AI-based systems that builds on top of Scenic. It adds among other features the ability to perform falsification, intelligently searching for scenarios that will cause a system to behave in an undesirable way.

The goal of this project is to improve VerifAI’s development infrastructure, documentation, and ease of use, which are currently relatively poor compared to Scenic. Specific tasks could include:

Setting up continuous integration (CI) on GitHub
Creating processes to help users/developers submit issues and PRs and deal with them in a timely manner
Writing more documentation, including tutorials and examples (not only for end users of VerifAI but those wanting to develop custom falsification components, for example)
Refactoring VerifAI’s API to make it easier to use and extend

Architecting the Future of Scientific Data: Multi-Site Streaming Without Compromise

Mon, 10 Feb 2025 00:00:00 +0000

Data is generated at ever-increasing rates, yet it’s often processed more slowly than it’s collected. Scientific instruments frequently operate below their full capacity or discard valuable data due to network bottlenecks, security domain mismatches, and insufficient real-time processing capabilities.

SciStream reimagines how scientific data moves across modern research infrastructure by providing a framework for high-speed (+100Gbps) memory-to-memory streaming that doesn’t compromise on security. Whether connecting scientific instruments to analysis clusters or bridging across institutional boundaries, SciStream provides the foundation for next-generation scientific workflows.

Building on our published research, we’re now expanding the framework’s capabilities through open-source development and community collaboration. These projects offer an opportunity for students to gain hands-on experience with cutting-edge networking and security technologies used in high-performance computing (HPC), cloud infrastructure, and large-scale scientific experiments.

SciStream-SecureBench: A Framework for Benchmarking Security Protocols in Scientific Data Streaming

Project Idea Description:

Topics: Security Protocols, Network Performance, Data Streaming, Reproducibility, High-throughput Computing
Skills: Python, Scripting, Linux, Network Protocol Analysis, Containers, Benchmarking tools
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Ever wondered why large scientific experiments need to move massive amounts of data securely and quickly? While TLS and SSH are standard for secure data transfer, there’s a surprising lack of benchmarks that evaluate their performance in high-speed scientific workflows. This project aims to fill this gap by developing a benchmarking suite that measures how different security configurations impact real-time scientific data streaming.

Specific Tasks of the Project Include

Developing benchmarking tools that measure key security performance metrics like handshake latency, throughput stability, and computational overhead.
Running real-world experiments on research testbeds (Chameleon, FABRIC) to simulate scientific data patterns.
Automating comparative analysis between TLS and SSH, with focus on streaming-specific metrics like time-to-first-byte and sustained throughput.
Documenting best practices for security protocol selection in high-performance streaming.

Why This Matters for Your Career

Gain expertise in network security and performance analysis, highly valued in cybersecurity, cloud computing, and HPC.
Work on a real research challenge with potential for publication.

SciStream-StreamBench: Comparative Analysis of Scientific Streaming Frameworks

Project Idea Description:

Topics: Data Streaming Protocols, Network Performance, Benchmarking, Distributed Systems, Real-time Computing
Skills: Python, ZeroMQ, EPICS/PVAccess, Linux, Performance Analysis, Visualization
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Scientific experiments generate enormous amounts of streaming data, but how do we choose the best framework for handling it efficiently? Despite the widespread use of ZeroMQ and PVApy, there’s little systematic benchmarking comparing their performance. This project will develop real-world benchmarks to evaluate how different frameworks handle scientific data in high-speed environments.

The Specific Tasks of the Project Include

Designing benchmarking methodologies to assess key performance metrics like synchronization overhead, time-to-first-data, and throughput stability.
Developing a test harness that simulates real-world streaming conditions (network variability, concurrent streams, dynamic data rates).
Running experiments on Chameleon and FABRIC testbeds.
Automating data collection and visualization to highlight performance trends.
Documenting best practices and framework-specific optimizations.

Why This Matters for Your Career

Get hands-on experience with real-time data processing and network performance analysis.
Learn benchmarking techniques useful for distributed systems, cloud computing, and high-performance networking.

SciStream-QUIC: Next-Generation Proxy Architecture for Scientific Data Streaming

Project Idea Description:

Topics: QUIC Protocol, Network Proxies, Performance Analysis, Protocol Design, Hardware Acceleration
Skills: Python/C++, Network Programming, QUIC (quiche/aioquic), Linux, Performance Analysis
Difficulty: Hard
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Ever wondered how YouTube loads videos faster than traditional web pages? That’s because of QUIC, a next-generation protocol designed for speed and security. Initial evaluations of federated streaming architectures (INDIS'22 paper) suggest potential benefits of QUIC, but comprehensive benchmarking is needed. This project explores whether QUIC-based proxies can outperform traditional TCP+TLS proxies for scientific data streaming, potentially revolutionizing how researchers move large datasets.

The Specific Tasks of the Project Include

Developing a QUIC-based proxy optimized for scientific workflows.
Running benchmarks to compare QUIC vs. traditional TLS proxies.
Investigating hardware encryption offloading for QUIC and TLS.
Designing reproducible experiments using Chameleon and FABRIC testbeds.
Documenting best practices for deploying QUIC proxies in HPC environments.

Why This Matters for Your Career

Gain experience in cutting-edge networking protocols used in cloud computing (Google, Cloudflare, etc.).
Learn about hardware acceleration and its role in high-speed networking.

SciStream-Auth: Modern Authentication and User Interface for Scientific Data Streaming

Project Idea Description:

Topics: Authentication Systems, UI/UX Design, Security Integration, Scientific Computing
Skills: Python, Web Development (React/Vue), OAuth 2.0/SAML, Security Analysis
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Not a security expert? You can still contribute by designing an interactive front-end!

In today’s scientific computing landscape, authentication and user experience often act as barriers to adoption rather than enabling seamless collaboration. While SciStream excels at high-speed data transfer, its reliance on a single authentication provider and command-line interface limits its accessibility. This project aims to transform SciStream into a more versatile platform by implementing a modular authentication system and developing an intuitive graphical interface.

By expanding beyond Globus Auth to support multiple authentication frameworks, we can enable broader adoption across different scientific communities while maintaining robust security. Coupled with a modern GUI that visualizes real-time streaming activity, this enhancement will make SciStream more accessible to researchers—allowing them to focus on their science rather than wrestling with complex configurations.

This project will design a user-friendly interface that makes secure scientific data streaming as intuitive as using a cloud storage service. You’ll also gain hands-on experience with authentication methods used by industry leaders like Google and Facebook, while directly improving access to scientific data.

The Specific Tasks of the Project Include

Design and implementation of a pluggable authentication system supporting multiple providers (OAuth 2.0, SAML, OpenID Connect, certificate-based auth)
Development of a modern, responsive GUI using web technologies that provides real-time visualization of system status
Creation of comprehensive security testing protocols to validate the authentication implementations
Implementation of session management and secure credential handling within the GUI
Design of an intuitive interface for managing streaming configurations and monitoring data flows
Creation of documentation and examples to help facilities integrate their preferred authentication mechanisms

Kolmogorov-Arnold-based Transformer for LLMs: Implementation, Evaluation and Benchmarking

Sun, 09 Feb 2025 10:15:56 -0700

Project Objectives

KALLM project proposes a new Kolmogorov-Arnold Network (KAN)-based Transformer implementation of an open-source LLM called SmolLM2. Transformers have found increasing success in many open-source LLMs across language reasoning tasks like text generation, summarization and some tasks imitating advanced critical thinking. However, Kolmogorov-Arnold Networks (KANs) are an attractive alternative to the Multi-Layer Perceptrons (MLPs) which are used in these Transformer architectures by default. KAN-based Transformers (KATs) offer several advantages over MLPs. (i) They follow the universal approximation property and hence can theoretically approximate any function i.e. they can learn from any complex input patterns. (ii) They are more interpretable as they decompose the entire input into multiple manageable components with each layer processing one component. This is unlike MLPs, where each layer processes the input sequences holistically. (iii) KANs can lead to faster convergence on certain reasoning tasks due to their ability to break down the input sequences into simple univariate functions.

However, currently there exist little to no open-source implementations of KAN-based Transformers in open-source LLMs. Until recently, an efficient implementation of KAN was not available; the same can be said for KAN-based Transformer implementations. With the recent efficient implementations of open-source KAN-based Transformers (KATs), integrating them into open-source LLM engines becomes a possibility. This project will implement KAT as the core architecture of the Transformer in SmolLM2, an open-source LLM and perform evaluation and benchmarking against language reasoning tasks.

Project Methodology and Milestones

The project methodology is a mix of implementation and evaluation. The mentors are well-experienced in working with large codebases and will be available to guide through the technical and non-technical portions of the project. The step-by-step project methodology is outlined as follows.

Installation of SmolLM2 from the official Git Repo: The open-source implementation of SmolLM2 engine (hereon referred to as smollm plainly) is available on GitHub. The project primarily focuses on language reasoning and hence we limit ourselves to the SmolLM2 implementation and forego other forks of SmolLM such as the SmolVLM family.
- The project needs to be sanity checked by installing the engine on local computers.
- Following that, the students are to familiarize themselves with basic workflow such as running a sample code using the pretrained model. The instructions for installing SmolLM are located under the “tools” at smollm/tools/smol-tools subfolder.
- Next step is to train the SmolLM using the prepackaged Transformer model called “HF://mlc-ai/SmolLM2-1.7B-Instruct-q0f16-MLC". The instructions are provided here.
Implementation—KAT in SmolLM: The smollm pretrained model is at smollm/tools/smollm_local_inference/mlc.py. The pretrained model is called “HF://mlc-ai/SmolLM2-1.7B-Instruct-q0f16-MLC”. This is an MLP-based Transformer. However, we will train the smollm with transformers ourselves with the same model as well as with KAT.
- A KAT implementation is available on GitHub at ICLR2025. To implement KAT in smollm, we will replace the default Transformer (HF://mlc-ai/SmolLM2-1.7B-Instruct-q0f16-MLC) with the open-source KAT mentioned above.
Training SmolLM with default Transformer and with KAT: This step will require compute resources and requires deployment of the implementation on Chameleon Cloud and/or National Research Platform (NRP). The mentors have access to these two testbeds and will provide the students access to those resources.
- The first task of this step is to port to the implementation to Chameleon Cloud before the model can be trained. This task may require around a week’s worth of turnaround time and can be performed in parallel with steps 1 & 2 if needed.
- Training: The FULL dataset for training smollm is called smolltalk located here: HuggingFaceTB/smoltalk. The training code and instructions are at huggingface/alignment-handbook. Although the baseline uses SmolLM2-1.7.B-Instruct (pretrained model), we will instead train smollm for SmolLM2-135M-Instruct and SmolLM2-360M-Instruct as noted at the bottom of the page at HuggingFaceTB/smoltalk· Datasets at Hugging Face. According to this, for SmolLM2-135M-Instruct and SmolLM2-360M-Instruct we will ONLY use the smol-smolltalk dataset.
Benchmarking: Finally, the benchmarks used throughout this project to evaluate our implementations will be the same as those for the release (pretrained) versions SmolLM2-135M-Instruct and SmolLM2-360M-Instruct. The benchmarks for language reasoning will be chosen.

Project Timeline

The following project timeline is anticipated. Some tasks may take longer or shorter than anticipated and hence the timeline is not 100% set in stone. However, it serves as a baseline based on the mentor’s prior experience on working with similar research projects. Each cell in the timeline chart is one week.

Project Testbeds

Sai Lamba Karanam has adminstrator-level access to the National Research Platform (NRP) and will provide access to cloud resources (compute) to the students working on the project. Both mentors will also have access to Chameleon Cloud platform and will grant access to compute resources for training, evaluation and benchmarking purposes.

Project Deliverables

KALLM will be hosted on github at this repo. The mentors have extensive experience working with Machine Learning (ML) and Artifical Intelligence (AI) workflows in academic and industry settings. We seek to choose mentees that are willing to learn implementation of AI models and working with semi-large code bases. Mentees will have to become comfortable working with remote cloud testbeds (Chameleon and/or NRP) during the latter-half of the project. Some milestines described in the Project Methodolody can be done in parallel.

KALLM is part of larger collaborative effort between the mentors and involves milestones and outcomes that fall outside the scope of this project, but are related. The mentors plan to publish the large project outcomes aat ML-based venue(s) towards the end of Fall 2025. The mentees will be added as coauthors.

KALLM

Topics: Towards High Performance NCCL-enabled 2D partitioned PyLops-MPI library
Skills: Python proficiency, scikit-learn, Experience with Linux, Introductory Experience with Cloud Computing Platforms
Difficulty: Easy-Medium
Size: Medium (175 hours)
Mentor: Sai Suman Lamba Karanam, Zahmeeth Sakkaff

Smart Batching for Large Language Models

Sun, 09 Feb 2025 10:15:56 -0700

Sequence tokenization is a crucial step during Large Language Model training, fine-tuning, and inference. User prompts and training data are tokenized and zero-padded before being fed to the model in batches. This process allows models to interpret human language by breaking down complex sentences into simple token units that are numerically represented in a token set. However, the process of sequence padding for maintaining batch dimensions can introduce unnecessary overhead if batching is not properly done.

In this project, we introduce Smart Batching, where we dynamically batch sequences in a fine-tuning dataset by their respective lengths. With this method, we aim to minimize the amount of zero padding required during sequence batching, which can result in improved and efficient fine-tuning and inference speeds. We also analyze this method with other commonly used batching practices (Longest Sequence, Random Shuffling) on valuable metrics such as runtime and model accuracy.

Project Title

Topics: Large Language Models Fine-Tuning AI Transformers
Skills: Python, Pytorch, Large Language Models
Difficulty: Moderate
Size: Large (350 hours)
Mentor: [Daniel Wong]Daniel Wong, [Luanzheng “Lenny” Guo]Luanzheng "Lenny" Guo

Project Tasks and Milestones

Implement an open source smart batching framework based on HuggingFace to allow for dynamically grouping sequences of similar token lengths into batches
Analyze runtime, padding, and model accuracy with smart batching and other commonly used batching practices
Apply smart batching with distributed fine-tuning and observe large language model outputs

Towards High Performance NCCL-enabled 2D partitioned PyLops-MPI library

Sun, 09 Feb 2025 10:15:56 -0700

Project Description

PyLops ecosystem designed to enable large-scale, distributed-memory computations for matrix-free inverse problems. PyLops has achieved more than 400 stars till now. PyLops-MPI is an extension of the PyLops. It can be widely used in scientific computation problems. Developed as part of the 2023 Google Summer of Code, PyLops-MPI builds on the core PyLops framework by integrating MPI-based parallelism through the mpi4py library. This allows users to efficiently scale PyLops-based computations beyond a single node, leveraging high-performance computing (HPC) clusters for tackling increasingly large problem sizes.

By extending PyLops’ modular and user-friendly interface to distributed environments, PyLops-MPI provides researchers and engineers with a powerful tool for developing scalable scientific applications across disciplines, from geophysics to machine learning. As part of the broader PyLops ecosystem, it represents a significant step toward high-performance, parallel inverse problem solving, catering to both academia and industry.

PyLops-MPI is aimed to provide an efficient and user-friendly distibuted inverse problem solution. The software are designed to handle all three distinct use-cases in the distributed inverse problem: (1) Both model and data are fully distributed across nodes. (2) Data are distributed across nodes but model is available at all nodes. (3) Both model and data are available in all nodes (or just in the master). There are multiple use-cases for PyLops-MPI, e.g. Least-squares Migration, Multi-Dimensional Deconvolution and Post Stack Inversion - 3D. We’ve already provided a solution based on mpi4py. With the development of PyLops-MPI, we plan to upgrade the communication infrastructure of PyLops-MPI to better support GPU-based cluster and reduce the overall datavement. This will further boost the performance of PyLops-MPI.

This project is designed based on the roadmap of PyLops-MPI library. We plan to provide three major functionalities for the PyLops-MPI. Firstly, in the PyLops-MPI library, we support the NVIDIA GPU where we operates data on GPU. We are relying on CUDA-Aware MPI implementaiton in mpi4py. It requires a CUDA-Aware MPI software stacks which is strict to the system software. Alternatively, we can use NCCL instead of mpi4py when we call GPU routines. NCCL also has better support for NVLink etc. Secondly, the parallelism of PyLops-MPI is achieved through spliting the data and models to different MPI processors. It might not be scalable when we have multiply right hand side inverse problems. We have a 2D partitioned implementation while we don’t know exactly the performance of the implementation. Thus, we propose to benchmark the scalability of the new 2D partitioned implementation. We are aware that other distibuted matrix-matrix multiplication algorithms such as 2D SUMMA algorithm can be incoporated into the PyLops-MPI library. We would also like to implement the 2D SUMMA algorithm if time permits. Finally, we would like to benchmark MPI one-sided API in PyLops-MPI library since asynchronous execution features of one-sided API will improve the scalability of the algorithm.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), the project aims to benchmark and extend the capability of the PyLops-MPI library in the distributed environment. Below is an outline of the algorithms to be developed in this project:

Goal 1: Enabling NCCL API in PyLops-MPI Library: Understanding the design of PyLops-MPI. Using NCCL API for collective communication in PyLops-MPI when data is on GPU. Benchmarking the performance of NCCL compared with mpi4py in different scenarios.
Goal 2: Benchmarking 2D Parallelism in PyLops-MPI Library: Understanding the design of PyLops-MPI. Understanding the design of 2D partition design in current PyLops-MPI library. Benchmarking the performance of 2D partition design in PyLops-MPI compared with original 1D partition design. If possible, implement the 2D SUMMA algorithm.
Goal 3: Enabling MPI One-sided API in PyLops-MPI Library: Understanding the design of PyLops-MPI. Uderstanding the message roofline model and MPI one-sided API. Implementing the one-sided communication strategies in the PyLops-MPI library.

Project Benchmark Suites

We plan to use at least three use cases in the tutorial sections in PyLops-MPI. We will measure the communication volume and time to solution of these use-cases.

Project Benchmark Testbeds

Yuxi Hong is granted access to the world-leading supercomputers such as Delta, Delta-AI through an ACCESS funded grant. Delta is a dedicated, ACCESS-allocated resource designed by HPE and NCSA, delivering a highly capable GPU-focused compute environment for GPU and CPU workloads. DeltaAI is a companion system to Delta. Powered by the NVIDIA GH200 Grace Hopper Superchip, DeltaAI provides powerful compute capabilities for simulation and data science. The team also has access to Perlmutter supercomputer based in LBNL.

Project Deliverables

The PyLops-MPI is hosted in github at Repo. The paper mainly describes the design of the PyLops-MPI. Our mentors are the main developers and deisgner of the PyLops-MPI. Our mentors are also experts in HPC and MPI libraries. We will select the proper the mentees for the projects and provide the benchmark results and new functionalities of PyLops-MPI by the end of the projects. We plan to have two or three mentees for the projects since each goal is a milestone for PyLops-MPI. The three goals can be achieved seperately, they are orthogonal to each other and can be exectuted in parallel.

PyLops-MPI

Topics: Towards High Performance NCCL-enabled 2D partitioned PyLops-MPI library
Skills: Proficient in Python, Experience with MPI, Experience with GPU
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Yuxi Hong, Matteo Ravasi, Nan Ding

GeFARe: Discovering Reproducible Failure Scenarios and Developing Failure-Aware Scheduling for Genomic Workflows

Sun, 09 Feb 2025 00:00:00 +0000

Topics: genomic processing (e.g., DNA and RNA alignment), workflow scheduling, resource/cluster management, container orchestration
Skills: Linux, cloud computing (e.g., OpenStack), cluster manager (e.g., Kubernetes), systems automation (e.g., Bash/Python/Puppet), genomic workflows and applications (e.g., BWA, FastQC, Picard, GATK, STAR)
Difficulty: Hard
Size: Large (350 hours)
Mentor(s): In Kee Kim

Project Idea description

Large-scale genomic workflow executions require large-scale computing infrastructure, as well as high utilization of that infrastructure, to maximize throughput. Systems researchers have developed various techniques to achieve this goal, including scheduling, resource harvesting, tail mitigation, and failure recovery. However, many of these large-scale efforts have been carried out by separate groups/institutions that operate such large-scale infrastructure (e.g., major tech companies and national research labs). Reproducing and building upon these works at a similar scale in an academic environment is challenging – even labs with strong ties to these institutions often have to rely on trace-based research, which does not fully capture the complexities of real-world deployments.

We observe two fundamental reasons for this difficulty: 1) a lack of computational infrastructure at a comparable scale and 2) a lack of representative workloads and software stacks. Although the academic community has sought to broaden access to large-scale infrastructure through testbeds like ChameleonCloud and CloudLab, the representative workloads and software stacks to reproduce aforementioned works remain limited.

We aim to address this challenge by providing a robust, easy-to-use, and open-source environment for large-scale genomics workflow scheduling. Specifically, this environment will include: a) a suite of tools to set up infrastructure on academic cloud testbeds, b) a scheduling research platform for genomic workflows, and c) software stacks to reproduce large-scale failure scenarios.

We limit the scope of this project to only one or two major failure scenarios. For example, out-of-memory (OOM) failures occur when genomics applications run with insufficient available memory. However, we aim to make the software stack extendable for other scenarios whenever possible.

Throughout this project, students will learn to use cloud testbeds (e.g., ChameleonCloud) for workflow scheduling research. They will gain hands-on experience in open-source cluster management and container orchestration tools (e.g., Kubernetes) and will also learn about various aspects of high-performance computing when genomic workflows.

Finally, we will open-source all the code, software stacks, and datasets created during this project. Using these artifacts, we will also ensure the reproducibility of failure scenarios.

Project Deliverable

Acquire a basic understanding of genomic data processing (will mentor guidance)
Build tools to set up a multi-node cluster on ChameleonCloud
Create automation code/tools to set up genomics workflows’ input and containerized applications
Discovering failure scenarios for genomics workflow execution (will mentor guidance)
Develop a Kubernetes-based platform to implement scheduling policies (Students may use or build upon existing open-source works)
Document the steps needed to reproduce the proposed failure scenarios

StatWrap

Sun, 09 Feb 2025 00:00:00 +0000

Additional information:

Project Search

Topics: search, user interface, indexing
Skills: JavaScript, React
Difficulty: Medium
Size: Large (350 hours)
Mentor: Luke Rasmussen, Eric Whitley

The goal of this project is to leverage the information entered by users and passively discovered by StatWrap to facilitate cross-project searching. This functionality will allow investigators to search across projects (current and past) to find relevant projects, assets, and notes. Given the potentially sensitive nature of data included in projects, the indexing of content for searching must be done locally.

The specific tasks of the project include:

Identify and evaluate open-source projects to index content for searching
Add a new classification for projects of “Active” and “Past” in the user interface
Implement the search capability within the user interface
Develop unit tests and conduct system testing

Disentangled Generation and Editing of Pathology Images

Fri, 07 Feb 2025 00:00:00 +0000

Topics: computational pathology, image generation, disentangled representations, latent space manipulation, deep learning
Skills:
- Programming Languages:
  - Proficient in Python, with experience in machine learning libraries such as PyTorch or TensorFlow.
- Generative Models:
  - Familiarity with Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and contrastive learning methods.
- Data Analysis:
  - Image processing techniques, statistical analysis, and working with histopathology datasets.
- Biomedical Knowledge (preferred):
  - Basic understanding of histology, cancer pathology, and biological image annotation.
Difficulty: Advanced
Size: Large (350 hours). The project involves substantial computational work, model development, and evaluation of generated pathology images.
Mentors: Xi Li (contact person), Mentor Name

Project Idea Description

The project aims to advance the generation and disentanglement of pathology images, focusing on precise control over key histological features. By leveraging generative models, we seek to create synthetic histological images where specific pathological characteristics can be independently controlled.

Challenges in Current Approaches

Current methods in histopathology image generation often struggle with:

Feature Entanglement: Difficulty in isolating individual factors such as cancer presence, severity, or staining variations.
Lack of Control: Limited capability to manipulate specific pathological attributes without affecting unrelated features.
Consistency Issues: Generated images often fail to maintain realistic cellular distributions, affecting biological validity.

Project Motivation

This project proposes a disentangled representation framework to address these limitations. By separating key features within the latent space, we aim to:

Control Histological Features: Adjust factors such as cancer presence, tumor grade, number of malignant cells, and staining methods.
Ensure Spatial Consistency: Maintain the natural distribution of cells during image reconstruction and editing.
Enable Latent Space Manipulation: Provide interpretable controls for editing and generating realistic histopathology images.

Project Objectives

Disentangled Representation Learning:
- Develop generative models (e.g., VAEs, GANs) to separate and control histological features.
Latent Space Manipulation:
- Design mechanisms for intuitive editing of pathology images through latent space adjustments.
Spatial Consistency Validation:
- Implement evaluation metrics to ensure that cell distribution remains biologically consistent during image generation.

Project Deliverables

Generative Model Framework:
- An open-source Python implementation for pathology image generation and editing.
Disentangled Latent Space Tools:
- Tools for visualizing and manipulating latent spaces to control specific pathological features.
Evaluation Metrics:
- Comprehensive benchmarks assessing image quality, feature disentanglement, and biological realism.
Documentation and Tutorials:
- Clear guidelines and code examples for the research community to adopt and build upon this work.

Impact

By enabling precise control over generated histology images, this project will contribute to data augmentation, model interpretability, and biological insight in computational pathology. The disentangled approach offers new opportunities for researchers to explore disease mechanisms, develop robust diagnostic models, and improve our understanding of cancer progression and tissue morphology.

Autograder

Thu, 06 Feb 2025 13:00:00 -0800

The EduLinq Autograder is an open source tool used by several courses at UCSC to safely and quickly grade programming assignments. Grading student code is something that may seem simple at first (you just need to run their code!), but quickly becomes exceeding complex as you get more into the details. Specifically, grading a student’s code securely while providing the “last mile” service of getting code from students and sending results to instructors/TAs and the course’s LMS (e.g., Canvas) can be very difficult. The Autograder provides all of this in a free and open source project. The LINQS Lab has made many contributions to the maintain and improve the Autograder.

All students interested in LINQS projects for OSRE/GSoC 2025 should fill out this form. Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project. The form will stop accepting responses once the application window closes. Do not post on any of the project repositories about OSRE/GSoC (e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2025). Remember, these are active repositories that were not created for OSRE/GSoC.

LLM Detection

Topics: AI/ML LLM Research Backend
Skills: software development, backend, systems, data munging, go, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The task for this project is to create a system that provides a score indicating the system’s confidence that a given piece of code was written by an AI tool and not a student. This will supplement the existing code analysis tools in the Autograder. There are many approaches to completing this task that will be considered. A more software development approach can consist of levering exiting systems to create a production-ready system, whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.

See Also:

Code Analysis GUI

Topics: Frontend
Skills: software development, frontend, data munging, js, css, go
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Autograder has existing functionality to analyze the code in a student’s submission for malicious content. Relevant to this project is that the Autograder can run a pairwise similarity analysis against all submitted code. This is how most existing software plagiarism systems detect offending code. The existing infrastructure provides detailed statistics on code similarity, but does not currently have a visual way to display this data.

The task for this project is to create a web GUI using the Autograder REST API to display the results of a code analysis. The size of this project depends on how many of the existing features are going to be supported by the web GUI.

See Also:

Web GUI

Topics: Frontend
Skills: software development, frontend, js, css
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Autograder contains dozens of API endpoints, most directly representing a piece of functionality exposed to the user. All of these features are exposed in the Autograder’s Python Interface. However, the Python interface is a purely command-line interface. And although command-line interface are objectively (read: subjectively) the best, a web GUI would be more accessible to a wider audience. The autograder already has a web GUI, but it does not cover all the features available in the Autograder.

The task for this project is to augment the Autograder’s web GUI with more features. Specifically, add support for more tools used to create and administer courses.

See Also:

LMS Toolkit

Thu, 06 Feb 2025 13:00:00 -0800

The EduLinq LMS Toolkit (also called the “Canvas Tool” or “py-canvas”) is a suite of tools used by several courses at UCSC to interact with Canvas from the command line or Python. A Learning Management System (LMS) is a system that institutions use to manage courses, assignments, students, and grades. The most popular LMSs are Canvas, Blackboard, Moodle, and Brightspace. These tools can be very helpful, especially from an administrative standpoint, but can be hard to interact with. They can be especially difficult when instructors and TAs want to do something that is not explicitly supported by their built-in GUIs (e.g., when an instructor wants to use a special grading policy). The LMS Toolkit project is an effort to create a single suite of command-line tools (along with a Python interface) to connect to all the above mentioned LMSs in a simple and uniform way. So, not only can instructors and TAs easily access the modify the data held in an LMS (like a student’s grades), but they can also do it the same way on any LMS. The LINQS Lab has made many contributions to the maintain and improve the Quiz Composer.

Currently, the LMS Toolkit only supports Canvas, but this suite of projects hopes to not only expand existing support, but add support for more LMSs.

Advanced Canvas Support

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The LMS Toolkit already has basic read-write support for core Canvas functionality (working with grades and assignments). However, there are still many more features that can be supported such as group management, quiz management, quiz statistics, and assignment statuses.

The task for this project is to implement chose of set of advanced Canvas features to support (not limited to those features mentioned above), design an LMS-agnostic way to support those features, and implement those features. The flexibility in the features chosen to implement account for the variable size of this project.

See Also:

Repository for LMS Toolkit
GitHub Issues

New LMS Support: Moodle

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The goal of the LMS toolkit is to provide a single interface for all LMSs. It is a lofty goal, however there is currently only support for Canvas. Moodle is one of the more popular LMSs. Naturally, the LMS Toolkit wants to support Moodle as well. Moodle is open source, so adding support in the LMS Toolkit should not be too challenging.

The task for this project is to add basic support for the Moodle LMS. It is not necessary to support all the same features that are supported for Canvas, but at least the core features of score and assignment management should be implemented.

See Also:

New LMS Support: Blackboard

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The goal of the LMS toolkit is to provide a single interface for all LMSs. It is a lofty goal, however there is currently only support for Canvas. Blackboard (also called “Blackboard Learn”) is one of the more popular LMSs. Naturally, the LMS Toolkit wants to support Blackboard as well. However, a challenge in supporting Blackboard is that it is not open source (unlike Canvas). Therefore, support and testing on Blackboard may be very challenging.

The task for this project is to add basic support for the Blackboard LMS. It is not necessary to support all the same features that are supported for Canvas, but at least the core features of score and assignment management should be implemented. The closed nature of Blackboard makes this a challenging and uncertain project.

See Also:

New LMS Support: Brightspace

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The goal of the LMS toolkit is to provide a single interface for all LMSs. It is a lofty goal, however there is currently only support for Canvas. D2L Brightspace is one of the more popular LMSs. Naturally, the LMS Toolkit wants to support Brightspace as well. However, a challenge in supporting Brightspace is that it is not open source (unlike Canvas). Therefore, support and testing on Brightspace may be very challenging.

The task for this project is to add basic support for the Brightspace LMS. It is not necessary to support all the same features that are supported for Canvas, but at least the core features of score and assignment management should be implemented. The closed nature of Brightspace makes this a challenging and uncertain project.

See Also:

Testing / CI Infrastructure

Topics: Backend Teaching Tools Testing CI
Skills: software development, backend, testing, ci, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Batuhan Salih, Lise Getoor

The goal of the LMS toolkit is to provide a single interface for all LMSs. This means that our system must communicate with several different (the LMSs), each with their own systems, data patterns, versions, and quirks. Testing will be essential to ensure that our tools keep working as the different LMSs evolve and update. The LMS Toolkit currently tests with Canvas by mocking API responses. However, this tactic does not scale well with multiple LMSs (and multiple versions of each system). A more scalable approach would be to have test instances of the different LMSs that our testing infrastructure can interact with both interactively and in continuous integration (CI).

The task for this project is to create testing infrastructure that connects to test instances of different LMS systems (e.g., Canvas). This task does not require that all the LMSs in this document are used, but the testing infrastructure should be robust enough to support them all. The open source LMSs (Canvas and Moodle) will likely be much easier to setup than the others, and should be targeted first. We should be able to run tests locally as well as in CI, and will likely heavily use Docker containers.

See Also:

Quiz Composer

Thu, 06 Feb 2025 13:00:00 -0800

Canvas Import

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, http request inspection, python
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Eriq Augustine, Lucas Ellenberger, Lise Getoor

See Also:

Google Forms Export

Topics: Backend Teaching Tools API
Skills: software development, backend, rest api, data munging, python
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Eriq Augustine, Lucas Ellenberger, Lise Getoor

See Also:

Template Questions

Topics: Backend Teaching Tools API
Skills: software development, backend, data munging, python
Difficulty: Moderate-Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Lucas Ellenberger, Lise Getoor

See Also:

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Thu, 06 Feb 2025 10:15:56 -0700

Project Description

Sequential Recommender Systems are widely used in scientific and business applications to analyze and predict patterns over time. In biology and ecology, they help track species behavior by suggesting related research on migration patterns and environmental changes. Medical applications include personalized treatment recommendations based on patient history and predicting disease progression. In physics and engineering, these systems optimize experimental setups by suggesting relevant past experiments or simulations. Environmental and climate science applications include forecasting climate trends and recommending datasets for monitoring deforestation or pollution. In business and e-commerce, sequential recommenders enhance user experiences by predicting consumer behavior, suggesting personalized products, and optimizing marketing strategies based on browsing and purchase history. By leveraging sequential dependencies, these recommender systems enhance research efficiency, knowledge discovery, and business decision-making across various domains. Traditional sequential recommendation systems rely on historical user interactions to predict future preferences, but they often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors. Existing models primarily use predefined embeddings and handcrafted features, limiting their ability to generalize across diverse recommendation scenarios. To address these challenges, we propose LLM Enhanced Contextual Sequential Recommender (LLMSeqRec), which leverages Large Language Models (LLMs) to enrich sequential recommendations with deep contextual understanding and adaptive reasoning. By integrating LLM-generated embeddings and contextual representations, LLMSeqRec enhances user intent modeling, cold-start recommendations, and long-range dependencies in sequential data. Unlike traditional models that rely solely on structured interaction logs, LLMSeqRec dynamically interprets and augments sequences with semantic context, leading to more accurate and personalized recommendations. This fusion of LLM intelligence with sequential modeling enables a more scalable, adaptable, and explainable recommender system, bridging the gap between traditional sequence-based approaches and advanced AI-driven recommendations.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), this project aims to develop an LLM-Enhanced Contextual Sequential Recommender (LLMSeqRec) to improve sequential recommendation accuracy across various scientific and business applications. Sequential recommender systems are widely used to analyze and predict patterns over time, assisting in fields such as biology, ecology, medicine, physics, engineering, environmental science, and e-commerce. However, traditional models often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors, as they primarily rely on vanilla sequential Id orders. To address these limitations, this project will leverage Large Language Models (LLMs) to enhance context-aware sequential recommendations by dynamically integrating LLM-generated embeddings and contextual representations. The core challenge lies in designing LLMSeqRec, a unified and scalable model capable of enriching user intent modeling, mitigating cold-start issues, and capturing long-range dependencies within sequential data. Unlike conventional systems that rely solely on structured interaction logs, LLMSeqRec will interpret and augment sequences with semantic context, resulting in more accurate, adaptable, and explainable recommendations. Below is an outline of the methodologies and models that will be developed in this project:

Step 1: Data Preprocessing & Feature Creation: Develop a data processing pipeline to parse user’s sequential interaction behaviors into sequential data points for LLM-based embeddings and contextual sequential transformer modeling; Extract user behavior sequences, items’ metadata, and temporal patterns to create context-aware sequential representations for training, validation and testing; The data source can be from Amazon open public data or Movie Lense data set. The data points creation can follow SASRec (in the reference 1).
Step 2: Model Development: Design and implement LLM-enhanced sequential recommendation models, integrating pretrained language models to augment user-item interactions with semantic context; Develop an adaptive mechanism to incorporate external contextual signals, such as product descriptions, reviews into the sequential recommendation process; The baseline model can be SASRec pytorch implementation.
Step 3: Evaluation: : Benchmark LLMSeqRec against state-of-the-art sequential recommenders, evaluating on accuracy, NDCG and cold-start performance; Conduct ablation studies to analyze the impact of LLM-generated embeddings on recommendation quality; Optimize model inference speed and efficiency for real-time recommendation scenarios.

Project Deliverables

This project will deliver three components, software, model training, validation and performance evaluation and demo. The software which implements the above LLMSeqRec model will be hosted on the github repo as open-access repositories. The evaluation results and demo will be published along the github repo .

LLMSeqRec

Topics: LLM Enhanced Contextual Sequential Recommender
Skills: Proficiency in Python, Pytorch, Github, Self-attention, Transformer
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Self-Attentive Sequential Recommendation (SASRec)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Amazon Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews
Movie Lense Data: https://grouplens.org/datasets/movielens/

ReIDMM: Re-identifying Multiple Objects across Multiple Streams

Thu, 06 Feb 2025 10:15:56 -0700

Project Description

Re-identifying multiple objects across multiple streams (ReIDMM) is essential in scientific research and various industries. It involves tracking and analyzing entities across different viewpoints or time frames. In astronomy, ReIDMM helps track celestial objects like asteroids and space debris using multiple observatories. In biology and ecology, it enables the identification of animals across different camera traps and aids in tracking microscopic organisms in laboratory studies. In physics and engineering, it is used for tracking particles in high-energy physics experiments, monitoring structural changes in materials, and identifying robots or drones in lab automation. Beyond scientific applications, ReIDMM plays a critical role in industries such as retail, where it tracks customer behavior across multiple stores and improves sales and prevents theft. In smart cities, it supports traffic monitoring by identifying vehicles across intersections for improved traffic flow management. In manufacturing, it enables supply chain tracking by locating packages across conveyor belts and warehouse cameras. In autonomous systems, ReIDMM enhances multi-camera sensor fusion and warehouse robotics by identifying pedestrians, obstacles, and objects across different camera views.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), this project aims to develop an open-source algorithm for multiple-object re-identification across diverse open-source data streams. As highlighted earlier, this method is expected to have wide-ranging applications in both scientific research and industry. Utilizing an open-source dataset, our focus will be on re-identifying common objects such as vehicles and pedestrians. The primary challenge lies in designing a unified algorithm, ReIDMM, capable of performing robust multi-object re-identification across multiple streams. Users will be able to tag any object as a target in a video or image for tracking across streams. Below is an outline of the algorithms to be developed in this project:

Step 1: Target Object Identification: Randomly select a target object from an image or video using object detection models such as YOLOv7. These models detect objects by generating bounding boxes around them. Target objects could include vehicles, pedestrians, animals, or other recognizable entities. This step ensures an initial object of interest is chosen for re-identification.
Step 2: Feature Extraction and Embedding: Once the target object is identified, extract relevant features such as bounding box coordinates, timestamp, location metadata (if available), and visual characteristics. A multimodal embedding approach is used, where these features are transformed into a numerical representation (embedding vector) that captures the object’s unique identity. This allows for efficient comparison across different images or videos.
Step 3: Searching and Matching: To find the target object in other images or videos: (1) Extract embeddings of all objects detected in the other images/videos; (2) Compute similarity between the target object’s embedding and those of all detected objects using metrics like cosine similarity or Euclidean distance. (3) Rank objects by similarity, returning the most probable matches. The highest-ranked results are likely to be the same object observed from different angles, lighting conditions, or time frames.

Project Deliverables

This project will deliver three things, software, evaluation results and demo. The software which implements the above ReIDMM algorithm will be hosted on the github repo as open-access repositories. The evaluation results and demo will be published along the github repo.

ReIDMM

Topics: ReIDMM: Re-identifying Multiple Objects across Multiple Streams`
Skills: Proficient in Python, Experience with images processing, machine learning
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Linsey Pang

Reference:

Seam: Kubernetes-Aware Programmable Networking & Cloud Provisioning

Wed, 05 Feb 2025 00:00:00 +0000

Seam is a project focused on building a Kubernetes-aware programmable networking and cloud provisioning system. It combines Python, Kubernetes, P4 programming, and SmartNICs to create a robust framework for managing cloud resources, optimizing networking, and provisioning virtual machines. Students will learn about cutting-edge technologies such as Kubernetes, Docker, P4 programming, SmartNICs, KubeVirt, Prometheus, Grafana, and Flask, while working on real-world applications in high-performance computing environments. This project will help students understand the intricacies of cloud resource management and programmable networking, providing them with valuable skills for future careers in software engineering, networking, and DevOps.

The project involves creating a Python library for provisioning Kubernetes resources, including virtual machines and networking, using tools such as KubeVirt for VM provisioning and ESnet SENSE for network configuration. The library will also integrate monitoring solutions with Prometheus and Grafana for real-time metrics collection and visualization. Students will develop Flask-based dashboards for managing these resources, implement automated pipelines using GitLab CI/CD, and explore full-stack web development, database management with PostgreSQL, and API design.

In addition, students will gain hands-on experience with programmable networking using P4 and SmartNICs, learning how to write P4 programs for dynamic routing, security, and network policy enforcement at the hardware level. The integration of Kubernetes, SmartNICs, and P4 programming will allow for advanced optimizations and efficient management of high-performance cloud environments.

Thus far, the framework has been developed to allow provisioning of resources within Kubernetes, integrating Prometheus and Grafana for monitoring, and providing an interface for users to manage cloud resources. We aim to extend this by incorporating advanced network policies and improving the web interface.

Seam / Kubernetes Resource Provisioning and Management

The proposed work includes expanding the Python library to support comprehensive Kubernetes resource provisioning, network management, and virtual machine provisioning using KubeVirt. Students will enhance the current implementation to allow users to define resource limits, CPU/GPU quotas, and network policies. They will also integrate with ESnet SENSE to facilitate L2 networking, and explore the use of Prometheus and Grafana for real-time performance monitoring and metrics collection.

Topics: Kubernetes, Python, Cloud Computing, Networking, Programmable Networking, Monitoring, CI/CD
Skills: Python, Kubernetes, P4 programming, KubeVirt, ESnet SENSE, Docker, GitLab CI/CD, Prometheus, Grafana, PostgreSQL, Flask
Difficulty: Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada, Thomas A. DeFanti, Jeffrey Weekley, Derek Weitzel, Dmitry Mishin

Seam / Full-Stack Web Development and Dashboard

The proposed work includes building a Flask-based web dashboard using Bootstrap for UI, integrating it with the Python library to enable users to easily provision resources, monitor network performance, and track resource usage in real-time. The dashboard will support role-based access control (RBAC), allowing for secure multi-user management. Students will also integrate PostgreSQL for managing and storing configurations, logs, and performance metrics.

Topics: Full-Stack Web Development, Flask, Bootstrap, PostgreSQL, Kubernetes, Monitoring, DevOps
Skills: Web Development, Flask, Bootstrap, PostgreSQL, API Development, Kubernetes
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada, Thomas A. DeFanti, Jeffrey Weekley, Derek Weitzel, Dmitry Mishin

Seam / CI/CD and GitLab Integration

The proposed work includes setting up GitLab CI/CD pipelines for automated testing, deployment, and maintenance of the Python library, Kubernetes resources, and web dashboard. Students will automate the deployment of P4 programs, Kubernetes deployments, and networking configurations. They will also focus on unit testing, integration testing, and the automation of benchmarking experiments to ensure reproducibility of results.

Topics: CI/CD, GitLab, Python, Kubernetes, DevOps, Testing, Automation
Skills: GitLab CI/CD, Python, Kubernetes, Docker, Automation, Testing, Benchmarking
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada, Thomas A. DeFanti, Jeffrey Weekley, Derek Weitzel, Dmitry Mishin

Seam / Networking & SmartNIC Programming

The proposed work includes writing P4 programs to control network traffic flow, enforce network security policies, and optimize data transfer across the Kubernetes cluster. Students will gain experience with SmartNICs (Xilinx Alveo U55C, SN1000, NVIDIA Bluefield 2) and Tofino switches, using P4 to write network policies and integrate with the Kubernetes network layer (Multus, Calico). Students will also explore gRPC APIs for dynamically adjusting network policies and provisioning virtual network interfaces in real time.

Topics: Networking, P4 Programming, SmartNICs, Kubernetes Networking, Cloud Computing
Skills: P4, Networking, SmartNICs, Kubernetes Networking, Multus, Calico, gRPC
Difficulty: Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada, Thomas A. DeFanti, Jeffrey Weekley, Derek Weitzel, Dmitry Mishin

WaDAR

Wed, 05 Feb 2025 00:00:00 +0000

WaDAR (Water Radar) is an innovative, low-cost, hybrid approach to soil moisture sensing that combines the benefits of in-ground (in situ) and remote sensing technologies. Traditional soil moisture measurement methods suffer from drawbacks: in situ sensors are expensive and difficult to maintain, while remote sensing offers lower accuracy and resolution. WaDAR bridges this gap by using inexpensive underground backscatter tags paired with above-ground radars, enabling completely wireless, high-resolution soil moisture monitoring.

Key Features of WaDAR

Uses RF backscatter tags buried underground to provide high-accuracy soil moisture readings.
Uses ultra-wideband radar for above-ground sensing.
Offers an average error of just 1.4%, comparable to state-of-the-art commercial sensors.
Reduces deployment costs significantly, making it accessible for widespread agricultural use.
Supports real-time, scalable, and maintenance-free soil moisture monitoring for farmers.

Improving and Optimizing Data Processing Pipeline for More Accurate Soil Moisture Measurements

Topics: Digital Signal Processing Machine Learning
Skills: C/embedded, signal processing, machine learning, MATLAB (optional)
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Colleen Josephson, Eric Vetha

Enhance the accuracy of soil moisture measurements by refining the data processing pipeline.

Tasks:

Develop and test algorithms for noise reduction and signal improvement.
Implement advanced filtering and statistical techniques to improve measurement precision.
Validate improvements using real-world field data.
Translate algorithms into embedded to be implemented in real-time embedded hardware.

Improving Backscatter Tag PCB

Topics: Hardware Design Signal Processing
Skills: PCB design, RF knowledge
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Colleen Josephson, Eric Vetha

Enhance the performance of WaDAR’s backscatter tags by optimizing PCB design for improved signal-to-noise ratio (SNR) and implementing a communication protocol for tag identification.

Tasks:

Redesign PCB for improved readings.
Implement and test a communication protocol to distinguish between multiple tags.
Evaluate hardware changes in real-world field conditions.
Optimize power consumption and scalability for practical deployment.

Mediglot

Tue, 04 Feb 2025 00:00:00 +0000

PolyPhy is a GPU-oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our workshop and more details about our research here. Recent projects, such as Polyglot and Mediglot have focused on using PolyPhy to better visualize language embeddings.

Medicinal Language Embeddings

Topics: Large Language Models NLP Embeddings Medicine
Skills: Python, JavaScript, Data Science, Technical Communication
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Oskar Elek, Kiran Deol

This project aims to refine and enhance Mediglot, a web application for visualizing 3D medicinal embeddings, which extends the Polyglot app and leverages the PolyPhy toolkit for network-inspired data science. Mediglot currently enables users to explore high-dimensional vector representations of medicines (derived from their salt compositions) in a 3D space using UMAP, as well as analyze similarity through the innovative Monte-Carlo Physarum Machine (MCPM) metric. Unlike traditional language data, medicinal embeddings do not have an inherent sequential structure. Instead, we must work with the salt compositions of each medicine to create embeddings that are faithful to the intended purpose of each medicine.

This year, we would like to focus on exploring and integrating state-of-the-art AI techniques and algorithms to improve Mediglot’s clustering capabilities and its representation of medicinal data in 3D. The contributor will experiment with advanced large language models (LLMs) and cutting-edge AI methods to develop innovative approaches for refining clustering and extracting deeper insights from medicinal embeddings. Beyond LLMs, we would like to experiment with more traditional language processing methods to design novel embedding procedures. Additionally, we would like to experiment with other similarity metrics. While the similarity of two medicines depends on the initial embedding, we would like to examine the effects of different metrics on the kinds of insights a user can extract. Finally, the contributor is expected to evaluate and compare different algorithms for dimensionality reduction to enhance the faithfulness of the visualization and its interpretability.

The ideal contributor for this project has experience with Python (and common scientific toolkits such as NumPy, Pandas, SciPy). They will also need some experience with JavaScript and web development (MediGlot is distributed as a vanilla JS web app). Knowledge of embedding techniques for language processing is highly recommended.

Specific tasks:

Closely work with the mentors to understand the context of the project and its detailed requirements in preparation for the proposal.
Become acquainted with the tooling (PolyPhy, PolyGlot, Mediglot) prior to the start of the project period.
Explore different embedding techniques for medicinal data (including implementing novel embedding procedures).
Explore different dimensionality reduction techniques, with a focus on faithful visualizations.
Document the process and resulting findings in a publicly available report.

Enhancing PolyPhy Web Application

Topics: Web Development UI/UX Design Full Stack Development JavaScript Next.js Node.js
Skills: Full Stack Web Development, UI/UX Design, JavaScript, Next.js, Node.js, Technical Communication
Difficulty: Challenging
Size: Medium (175 hours)
Mentors: Oskar Elek, Kiran Deol

This project aims to revamp and enhance the PolyPhy web platform to better support contributors, users, and researchers. The goal is to optimize the website’s UI/UX, improve its performance, and integrate Mediglot to provide users with a seamless experience in visualizing both general network structures and 3D medicinal embeddings.

The contributor will be responsible for improving the website’s overall look, feel, and functionality, ensuring a smooth and engaging experience for both contributors and end-users. This includes addressing front-end and back-end challenges, optimizing the platform for better accessibility, and ensuring seamless integration with Mediglot.

The ideal candidate should have experience in full-stack web development, particularly with Next.js, JavaScript, and Node.js, and should be familiar with UI/UX design principles. A strong ability to communicate effectively, both in writing and through code, is essential for this role.

Specific tasks:

Collaborate with mentors to understand the project’s goals and the specific requirements for the website improvements.
UI/UX Redesign:
- Redesign and enhance the website’s navigation, layout, and visual elements to create an intuitive and visually engaging experience.
- Improve mobile responsiveness for broader accessibility across devices.
Website Performance & Stability:
- Identify and resolve performance bottlenecks, bugs, or issues affecting speed, stability, and usability.
Mediglot Integration:
- Integrate the Mediglot web application with PolyPhy, ensuring seamless functionality and a unified user experience for visualizing medicinal data alongside general network reconstructions.
Documentation:
- Document the development process, challenges, and solutions in a clear and organized manner, ensuring transparent collaboration with mentors and the community.

Type Narrowing: A Language Design Benchmark

Sat, 01 Feb 2025 00:00:00 +0000

Untyped languages such as JavaScript and Python provide a flexible starting point for software projects, but eventually, the lack of reliable types makes code hard to debug and maintain. Gradually typed languages such as TypeScript, Flow, Mypy, and Pyright address the problem with type checkers that can reason about an ever-growing subset of untyped code. Widening the subset with precise types is an ongoing challenge.

Furthermore, designs for precise gradual types need to be reproducible across languages. Ideas that works well in one language need to be validated in other contexts in a principled, scientific way to separate deep insights from language-specific hacks.

Type narrowing is a key feature of gradual languages. Narrowing uses type tests in code to refine types and push information forward along the paths that the program may follow. For example, when a type test checks an object field, later code can trust the type of the field:

// item :: JSON Object
if typeof(item["price"] == "number"):
 // item :: JSON Object,
 // where field "price" :: Number
 return item["price"] + (item["price"] * 0.30) // add tax

Nearly every gradual language agrees that some form of type narrowing is needed, but there is widespread disagreement about how much support is enough. TypeScript lets users define custom type tests, but it does not analyze those tests to see whether they are reliable. Flow does analyze tests. TypeScript does not allow asymmetric type tests (example: is_even_number), but Flow, Mypy and Pyright all do! None of the above track information compositionally through program execution, but another gradual language called Typed Racket does Is the extra machinery in Typed Racket really worth the effort?

Over the past several months, we have curated a language design benchmark for type narrowing, If-T:

https://github.com/utahplt/ift-benchmark

The benchmark presents type system challenges in a language-agnostic way to facilitate reproducibility across languages. It also includes a datasheet to encourage cross-language comparisons that focus on fundamental typing features rather than incidental difference between languages. So far, we have implemented the benchmark for five gradual languages. There are many others to explore, and much more to learn.

The goal of this project is to replicate and extend the If-T type narrowing benchmark. Outcomes include a deep understanding of principled type narrowing, and of how to construct a benchmark that enables reproducible cross-language comparisons.

Related Work:

Type Narrowing in TypeScript https://www.typescriptlang.org/docs/handbook/2/narrowing.html
Type Narrowing in Python https://typing.readthedocs.io/en/latest/spec/narrowing.html#typeguard
Logical Types for Untyped Languages https://doi.org/10.1145/1863543.1863561

Evaluate New Gradual Languages

Topics: benchmark implementation, programming languages, types
Skills: Ruby, Lua, Python, Clojure, or PHP
Difficulty: Medium
Size: Small
Mentor: Ben Greenman

Bring the If-T Benchmark to new typecheckers. Examples include Sorbet, Hack, Luau, Pyre, Cinder / Static Python, Typed Clojure, and (potentially) Elixir. Conduct a scientific, cross-language analysis to discuss the implications of benchmark results.

Do Unsound Narrowings Lead to Exploits?

Topics: corpus study, types, counterexamples
Skills: TypeScript or Python
Difficulty: Medium
Size: Small
Mentor: Ben Greenman

Investigate type narrowing in practice through a corpus study of software projects. Use the GitHub or Software Heritage APIs to search code for user-defined predicates and other instances of narrowing. Search for vulnerabilities due to the unsound typing of user-defined predicates.

Environmental NeTworked Sensor (ENTS)

Fri, 31 Jan 2025 00:00:00 +0000

ENTS I: Web portal for large-scale sensor networks

Topics: Data Visualization, Backend, Frontend, UI/UX, Analytics
Skills:
- Required: React, Javascript, Python, SQL, Git
- Nice to have: Flask, Docker, CI/CD, AWS, Authentication
Difficulty: Medium
Size: Large (350 hours)
Mentors: Colleen Josephson, John Madden, Alec Levy

Below is a list of project ideas that would be beneficial to the ENTS project. You are not limited to the following projects, and encourage new ideas that enhance the platform:

Improve streaming functionality
Generic interface for sensor measurements
Logger registration
Over the air (OTA) configuration updates
Implement unit tests and API documentation

ENTS II: Hardware to for large-scale field sensor networks

Topics: Embedded system, wireless communication, low-power remote sensing
Skills:
- Required: C/C++, Git, Github, PlatformIO
- Nice to have: STM32 HAL, ESP32 Arduino, protobuf, python, knowledge of standard communication protocols (I2C, SPI, and UART)
Difficulty: Hard
Size: Large (350 hours)
Mentors: Colleen Josephson, John Madden, Jack Lin

The Environmental NeTworked Sensor (ENTS) node aims to be a general purpose hardware platform for outdoor sensing (e.g. agriculture, ecological monitoring, etc.). The typical use case involves a sensor deployment in an agricultural field, remotely uploading measurements without interfering with farming operations. The current hardware revision (Soil Power Sensor was originally designed for monitoring power output of microbial fuel cells using high fidelity voltage and current measurement channels, as well as auxiliary sensors such as the SDI-12 TEROS-21 soil moisture sensor. The primary activities of this project will involve low-level firmware design and implementation, but may also incorporate hardware design revisions if necessary. We are looking to expand functionality to other external sensors, as well as optimize for power consumption, via significant firmware design activities.

Long-range, low-power wireless communication is achieved through a LoRa capable STM32 microcontroller with in-lab experiments using an ESP32 microcontroller to enable the simpler WiFi interface. Both wireless interfaces communicate upload measurements to our data visualization dashboard, ENTS I. The combined goal across both of these projects is to create a system that enables researchers to test and evaluate novel sensing solutions. We are looking to make the device usable to a wide range of researchers which may not have a background in electronics, so are interested in design activities that enhance user friendliness.

In total there will be 2-4 people working on the hardware with progress being tracked on GitHub. Broader project planning is tracked through a Jira board. We intend to have weekly meetings to provide updates on current issue progress along with assigning tasks. Please reach out to John Madden if there are any questions or specific ideas for the project.

Below is a list of project ideas that would be beneficial to the ENTS project. You are not limited to the following projects, and encourage new ideas that enhance the platform:

Backup logging via SD card
I2C multiplexing for multiple of the same sensors
Batch sensor measurement uploading

Causeway: Scaling Experiential Learning Through Micro-Roles

Thu, 30 Jan 2025 00:00:00 +0000

Causeway is a platform for learning to develop web applications using an Angular, RxJS, NgRx, and Firebase stack. Most online coding tutorials focus on covering the technical syntax or features of a language or framework, which means that new developers don’t have great resources for building a holistic picture of how everything they learn connects to actually developing a complex web application. Causeway breaks down the process of developing a web application into a hierarchy of micro-roles which provides learners with a clear pathway for learning that also translates to a clear process for developing an application. In the longer future, this would also enable learners to easily contribute to projects as they learn through taking on micro-roles for yet-to-be-developed projects. The platform uses the Stackblitz WebContainer API to run full applications in the browser for interactive learning.

Thus far, we have developed a version of the platform that walks learners through the process of developing UI components of a web application as well as containers that contain multiple UI components and are responsible for fetching data from the backend and handling events and updates to the database. We’d like to extend the content to cover defining the database schema and entire applications, and to other topics beyond web development like AI/ML. We’d like to add quizzes to the experience and explore ways to use Generative AI to augment the learning experience, e.g. to support planning, reflection, and assessment. Finally, we’d like to instrument the application with logs and analytics so we can better measure impact and learning outcomes, and develop a stronger CI/CD pipeline.

Causeway / Improving the Core Infrastructure

The proposed work includes adding logging, analytics, and a production-level CI/CD pipeline, adding a robust testing framework, and refactoring some of our code into seperate modules. Both roles will also contribute to running usability studies and documenting the platform.

Topics: Web Development, Educational Technologies, Angular
Skills: Web development experience, HTML, CSS, Javascript, Angular, RxJS, NgRx, Firebase
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: David Lee

Causeway / Quizzes and Generative AI

The proposed work includes extending the application to support quizzes, adding quizzes for the existing tasks, and exploring the use of generative AI to support the quizzes feature. Both roles will also contribute to running usability studies and documenting the platform.

Topics: Web Development, Educational Technologies, Angular
Skills: Web development experience, HTML, CSS, Javascript, Angular, RxJS, NgRx, Firebase, Generative AI
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: David Lee

OpenROAD - An Open-Source, Autonomous RTL-GDSII Flow for Chip Design

Sun, 19 Jan 2025 00:00:00 +0000

The OpenROAD project is a non-profit project, originally funded by DARPA with the aim of creating open-source EDA tools; an Autonomous flow from RTL-GDSII that completes < 24 hrs, to lower cost and boost innovation in IC design. This project is now supported by Precision Innovations.

OpenROAD massively scales and supports EWD (Education and Workforce Development) and supports a broad ecosystem making it a vital tool that supports a rapidly growing Semiconductor Industry.

OpenROAD is the fastest onramp to gain knowledge, skills and create pathways for great career opportunities in chip design. You will develop important software and hardware design skills by contributing to these interesting projects. You will also have the opportunity to work with mentors from the OpenROAD project and other industry experts.

We welcome a diverse community of designers, researchers, enthusiasts, software engineers and entrepreneurs to use and contribute to OpenROAD and make a far-reaching impact in the rapidly growing, global Semiconductor Industry.

Improving Code Quality in OpenROAD

Topics: Coding Best Practices in C++, Code Quality Tooling, Continuous Integration
Skills: C++
Difficulty: Medium
Size: Medium (175 hours)
Mentors: Matt Liberty & Arthur Koucher

OpenROAD is a large and complex program. This project is to improve the code quality through resolving issues flagged by tools like Coverity and clang-tidy. New tools like the clang sanitizers ASAN/TSAN/UBSAN should also be set up and integrated with the Jenkins CI.

GUI Testing in OpenROAD

Topics: Testing, Continuous Integration
Skills: C++, Qt
Difficulty: Medium
Size: Large (350 hours)
Mentors: Matt Liberty & Peter Gadfort

The OpenROAD GUI is a crucial set of functionality for users to see and investigate their design. GUI testing is specialized and rather different from standard unit testing. The GUI therefore needs improvements to its testing to cover both interaction and rendering. The GUI uses the Qt framework. An open-source testing tool like https://github.com/faaxm/spix will be set up and key tests developed. This will provide the framework for all future testing.

Rectilinear Floorplans in OpenROAD

Topics: Electronic Design Automation, Algorithms
Skills: C++, data structures and algorithms
Difficulty: Medium
Size: Large (350 hours)
Mentors: Eder Monteiro & Augusto Berndt

OpenROAD supports block floorplans that are rectangular in shape. Some designs may require more complex shapes to fit. This project extends the tool to support rectilinear polygon shapes as floorplans. This will require upgrading data structures and algorithms in various parts of OpenROAD including floor plan generation, pin placement, and global placement.

LEF Reader and Database Enhancements in OpenROAD

Topics: Electronic Design Automation, Database, Parsing
Skills: Boost Spirit parsers, Database, C++
Difficulty: Medium
Size: Medium (175 hours)
Mentors: Osama Hammad & Ethan Mahintorabi

LEF (Library Exchange Format) is a standard format for describing physical design rules for integrated circuits. OpenROAD has support for many constructs but some newer ones for advanced process nodes are not supported. This project is to support parsing such information and storing in the OpenDB for use by the rest of the tool.

ORAssistant - LLM Data Engineering and Testing

Topics: Large Language Model, Machine Learning, Data Engineering, Model Deployment, Testing, Full-Stack Development
Skills: large language model engineering, database, evaluation, CI/CD, open-source or related software development, full-stack
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Jack Luar & Palaniappan R

This project is aimed at enhancing robustness and accuracy for OR Assistant, the conversational assistant for OpenROAD through comprehensive testing and evaluation. You will work with members of the OpenROAD team and other researchers to enhance the existing dataset to cover a wide range of use cases to deliver accurate responses more efficiently. This project will focus on data engineering and benchmarking and you will collaborate on a project on the LLM model engineering. Tasks include: creating evaluation pipelines, building databases to gather feedback, improving CI/CD, writing documentation, and improving the backend and frontend services as needed (non-exhaustive). You will gain valuable experience and skills in understanding chip design flows and applications. Open to proposals from all levels of ML practitioners.

ORAssistant - LLM Model Engineering

Topics: Large Language Model, Machine Learning, Model Architecture, Model Deployment
Skills: large language model engineering, prompt engineering, fine-tuning
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Jack Luar & Palaniappan R

This project is aimed at enhancing robustness and accuracy for OR Assistant, the conversational assistant for OpenROAD through enhanced model architectures. You will work with members of the OpenROAD team and other researchers to explore alternate architectures beyond the existing RAG-based implementation. This project will focus on improving reliability and accuracy of the existing model architecture. You will collaborate on a tandem project on data engineering for OR assistant. Tasks include: reviewing and understanding the state-of-the-art in retrieval augmented generation, implementing best practices, caching prompts, improving relevance and accuracy metrics, writing documentation and improving the backend and frontend services as needed (non-exhaustive). You will gain valuable experience and skills in understanding chip design flows and applications. Open to proposals from all levels of ML practitioners.

RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Wed, 15 Jan 2025 00:00:00 +0000

Topics: bioinformatics, spatial transcriptomics, gene expression generation, retrieval-augmented generation, large models
Skills:
- Programming Languages:
  - Proficient in Python, and familiarity with machine learning libraries such as PyTorch.
- Data Analysis:
  - Experience with spatial transcriptomics datasets and statistical modeling.
- Machine Learning:
  - Understanding of vision models, retrieval-based systems, and MLP architectures.
- Bioinformatics Knowledge (preferred):
  - Familiarity with scRNA-seq data integration and computational biology tools.
Difficulty: Advanced
Size: Large (350 hours). Given the scope of integrating RAG models, building a robust database, and ensuring interpretable predictions, this project involves substantial computational and data preparation work.
Mentors: Ziheng Duan (contact person)

Project Idea Description

Spatial transcriptomics (ST) is a revolutionary technology that provides spatially resolved gene expression measurements, enabling researchers to study cellular behaviour within tissues with unprecedented detail. This technology has transformed our understanding of complex biological systems, such as disease progression, tissue development, and cellular heterogeneity. However, the widespread adoption of ST is limited by its high cost and technical requirements.

Histology imaging, on the other hand, is far more accessible and cost-effective. If gene expression could be accurately predicted from histology images, it would enable researchers to leverage these abundant images for high-resolution biological insights without the need for expensive spatial transcriptomics experiments. This task has immense potential to democratize spatial transcriptomics research and significantly reduce costs.

Challenges in Current Approaches

Current methods for predicting gene expression from histology images typically involve:

Using large vision models to encode histology image patches into embeddings.
Employing Multi-Layer Perceptrons (MLPs) to map these embeddings to gene expression profiles.

While these approaches have shown promise, they suffer from two critical limitations:

Accuracy: The MLP-based mappings often fail to fully capture the biological complexity encoded in the histology images, leading to suboptimal predictions.
Interpretability: These models act as black boxes, providing no insight into the underlying biological rationale for the predictions. Researchers cannot determine why a specific gene expression profile was generated, limiting trust and utility in biological contexts.

Project Motivation

To overcome these limitations, this project proposes a novel Retrieval-Augmented Generation (RAG) framework for spatial transcriptomics. Instead of relying solely on black-box MLPs, RAG-ST will:

Retrieve relevant examples from a curated database of paired histology images, scRNA-seq data, and gene expression profiles.
Use these retrieved examples to inform and enhance the generation process, resulting in predictions that are both more accurate and biologically interpretable.

This approach not only grounds predictions in biologically meaningful data but also provides transparency by revealing which database entries influenced the results.

Project Objectives

Database Construction:
- Curate a large and diverse database of histology images paired with scRNA-seq and gene expression data.
Model Development:
- Develop a RAG framework combining vision-based encoders and retrieval-enhanced generation techniques.
- Incorporate interpretability mechanisms to link predicted gene expressions to retrieved examples.
Evaluation and Benchmarking:
- Assess RAG-ST against state-of-the-art methods, focusing on accuracy, interpretability, and biological validity.

Project Deliverables

Curated Database:
- A publicly available, well-documented database of histology images and gene expression profiles.
RAG-ST Framework:
- An open-source Python implementation of the RAG-ST model, with retrieval, generation, and visualization tools.
Benchmark Results:
- Comprehensive evaluations demonstrating the benefits of RAG-ST over conventional pipelines.
Documentation and Tutorials:
- User-friendly guides to facilitate adoption by the spatial transcriptomics research community.

Impact

By integrating retrieval-augmented generation with large models, RAG-ST represents a paradigm shift in spatial transcriptomics. It offers a cost-effective, accurate, and interpretable solution for gene expression prediction, democratizing access to high-quality spatial transcriptomic insights and fostering advancements in biological research.

FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Mon, 03 Jun 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (primary contact), Swami Sundararaman
Contributor(s): Lihaowen (Jayce) Zhu

In the realm of machine learning (ML), preprocessing of data is a critical yet often underappreciated phase, consuming approximately 80% of the time in common ML tasks. This extensive time consumption can be attributed to various challenges encountered from both data and computation perspectives.

From the data side, one significant challenge is the slow retrieval of data from data lakes, which are storage repositories that hold a vast amount of raw data in its native format. However, the process of extracting this data can be slow, causing computation cycles to wait for data arrival and leading to delays in the entire preprocessing phase. Furthermore, the size of the data often exceeds the memory capacity of standard computing systems. This is a frequent occurrence in ML, as datasets are typically large and complex. Handling such large datasets requires sophisticated memory management techniques to ensure efficient preprocessing without overwhelming the system’s memory.

On the computation side, a naive solution to data operations, especially aggregation, often leads to inefficiencies. These operations may require grouping a large chunk of data as a prerequisite before performing any actual computation. This grouping, without careful configuration and management, can trigger serious data shuffling, leading to extensive remote data movement when the data is distributed across various storage systems. Such data movement is not only time-consuming but also resource-intensive.

To mitigate these challenges, there is a pressing need to design better caching, prefetching, and heuristic strategies for data preprocessing. The team aims to significantly reduce the time and resources required for preprocessing by optimizing data retrieval and computational processes.

However, prior to the design and implementation of such a system, a systematic understanding of the preprocessing workflow is essential. Hence, throughout the program, the students will need to:

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

Project Deliverable

A rolodex for the commonly used dataset and corresponding preprocess operations and expected output formats/types
A Chameleon Trovi package that preprocess the dataset with single-machine preprocessing framework like pandas
A Chameleon Trovi package that preprocess the dataset in an existing distributed computation framework like Hadoop or Spark

(Re)Evaluating Artifacts for Understanding Resource Artifacts

Wed, 20 Mar 2024 00:00:00 +0000

Project Idea Description

Topics: Virtualization, Containerization, Profiling, Reproducibility
Skills: C and Python and DevOps experience.
Difficulty: Medium
Size: Large; 350 hours
Mentors: Tanu Malik

This project aims to characterize computer-science related artifacts that are either submitted to conferences or deposited in reproducibility hubs such as Chameleon. We aim to characterize experiments into different types and understand reproducibility requirements of this rich data set, possibly leading to a benchmark. We will then understand packaging requirements, especially of distributed experiments and aim to instrument a package archiver to reproduce a distributed experiment. Finally, we will use learned experiment characteristics to develop a classifier that will determine alternative resources where experiment can be easily reproduced.

Project Deliverable Specific Tasks include: A pipeline consisting of a set of scripts to characterize artifacts. Packaged artifacts and an analysis report with open-sourced data about the best guidelines to package using Chameleon. A classifier system based on artifact and resource characteristics.

Auto Appendix

Mon, 11 Mar 2024 14:48:10 +0000

The SC Conference Series, a leading forum on High Performance Computing (HPC), supports scientific rigor through an enhanced reproducibility of accepted papers. To that end, all manuscripts submitted to the SC Technical Papers program must contain an Artifact Description. Authors of accepted papers may request reproducibility badges, for which an Appendix describing the Artifact Evaluation is required.

In recent years, Chameleon has facilitated SC’s reproducibility initiative by enabling authors to develop and share computational, reproducible artifacts through the Chameleon cloud. The Chameleon platform helps authors and reviewers to easily share computational artifacts, which are included in the papers’ artifact appendices.

The proposed project aims to assess all AD/AE appendices submitted for reproducibility badge requests. This evaluation will focus on AD/AE appendices that utilized the Chameleon cloud as the execution platform, examining their potential for automation. Our aim is to evaluate the feasibility of fully automating various components of the appendices. Students will engage directly with the chairs of the SC24 Reproducibility Initiative in this effort.

Advancing SC Conference Artifact Reproducibility via Automation

Topics: Reproducibility Reproducible Research Artifact Evaluation Open Science
Skills: HPC, Cloud computing, Chameleon, MPI, OpenMP, CUDA
Difficulty: Difficult
Size: Large
Mentors: Sascha Hunold
Tasks:
- Perform an analysis of the current limitations of AD/AE appendices submitted for Artifact Evaluation.
- Re-run the computational artifacts to identify areas for enhancement, with a primary objective of achieving full automation of Artifact Evaluation using the Chameleon cloud.
- Evaluate the existing automation capabilities of the Chameleon cloud.
- Develop a set of recommendations for structuring Computational Artifacts, aimed at benefiting future SC conferences.

ML-Powered Problem Detection in Chameleon

Wed, 06 Mar 2024 16:33:57 -0600

Today’s Continuous Integration/Continuous Development (CI/CD) trends encourage rapid design of software using a wide range of software components, followed by frequent updates that are immediately deployed on the cloud. The complexity of cloud systems along with the component diversity and break-neck pace of development amplify the difficulty in identifying or fixing problems related to performance, resilience, and security. Furthermore, existing approaches that rely on human experts—e.g., methods involving manually-written rules/scripts—have limited applicability to modern CI/CD processes, as they are fragile, costly, and often not scalable. Consequently, there is growing interest in applying machine learning (ML) based methods for identifying vulnerabilities in code, non-compliant or otherwise problematic software, and resilience problems in systems and networks. However, despite some success stories in applying AI for cloud operations (e.g., in resource management), much of cloud operations still rely on human-centric methods, which require updates as the cloud undergoes CI/CD cycles. The goal of this summer project is to explore methods of automation for the Chameleon Cloud to enable faster detection and diagnosis of problems. Overall, the project will contribute to an overarching vision of building an infrastructure that collects and synthesizes cross-layer data from large-scale cloud systems, applying ML-powered methods to automate cloud ops, and, further, making this data available to researchers through coherent APIs and analytics engines.

Currently, Chameleon uses runbooks as manual guides for operational tasks, including routine maintenance and troubleshooting. However, these traditional runbooks often fall short in dynamic and fast-paced CI/CD environments, as they lack the flexibility to adapt to changes in software versions, deployment configurations, and the unique challenges of emerging issues. To overcome these challenges, the project will leverage ML to automate anomaly detection based on telemetry data collected from Chameleon Cloud’s monitoring frameworks. This method will not only facilitate rapid identification of performance anomalies but also enable automated generation of runbooks. These runbooks can then offer operators actionable steps to resolve issues efficiently, thereby making the anomaly mitigation process more efficient. Furthermore, this approach supports the automatic creation of targeted runbooks for newly generated support tickets, enhancing response times and system reliability.

Time-permitting, using a collection of automated runbooks (each targeting a specific problem), we will analyze support tickets, common problems, and their frequency to offer insights and suggestions to help roadmapping for Chameleon Cloud to offer the best return on investment on fixing problems.

A key aspect of this summer project is enhancing the reproducibility of experiments in the cloud and improving data accessibility. We plan to design infrastructures and APIs so that the telemetry data that is essential for anomaly detection and automated runbooks is systematically documented and made available. We also aim to collect and share insights and modules on applying ML for cloud operations, including ML pipelines, data labeling strategies, data preprocessing techniques, and feature engineering. By sharing these insights, we aim to promote best practices and support reproducible experiments on public clouds, thus fostering future ML-based practices within the Chameleon Cloud community and beyond. Time permitting, we will explore applying lightweight privacy-preserving approaches on telemetry data as well.

Topics: Machine Learning, Anomaly Detection, Automated Runbooks, Telemetry Data
Skills:
- Proficiency in Machine Learning: Understanding of ML algorithms for anomaly detection and automation.
- Cloud Computing Knowledge: Familiarity with CI/CD environments and cloud architectures.
- Programming Skills: Proficiency in languages such as Python, especially in cloud and ML contexts.
- Data Analysis: Ability to analyze telemetry data using data analytics tools and libraries.
Difficulty: Hard
Size: Large
Mentors: Michael Sherman

ReproNB: Reproducibility of Interactive Notebook Systems

Mon, 26 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: HPC, MPI, distributed systems
Skills: C++, Python
Difficulty: Difficult
Size: Large; 350 hours
Mentors: Tanu Malik

Notebooks have gained wide popularity in scientific computing. A notebook is both a web-based interactive front- end to program workflows and a lightweight container for sharing code and its output. Reproducing notebooks in different target environments, however, is a challenge. Notebooks do not share the computational environment in which they are executed. Consequently, despite being shareable they are often not reproducible. We have developed FLINC (see also eScience'22 paper) to address this problem. However, it currently does not support all forms of experiments, especially those relating to HPC experiments. In this project we will extend FLINC to HPC experiments. This will involve using recording and replaying mechanisms such as ReMPI and rr within FLINC.

Project Deliverable

The project deliverable will be a set of HPC experiments that are packaged with FLINC and available on Chamaeleon.

SciStream-Rep: An Artifact for Reproducible Benchmarks of Scientific Streaming Applications

Mon, 26 Feb 2024 00:00:00 +0000

SciStream is a framework and toolkit that attempts to tackle the problem of enabling high-speed(+100Gbps), memory-to-memory data streaming in scientific environments. This task is particularly challenging because data producers (e.g., data acquisition applications on scientific instruments, simulations on supercomputers) and consumers (e.g., data analysis applications) may be in different security domains and thus require bridging of those domains. Furthermore, either producers, consumers, or both may lack external network connectivity and thus require traffic forwarding proxies. If you want to learn more, please take a look at our HPDC'22 paper.

SciStream-Rep: An Artifact for Reproducible Benchmarks of Scientific Streaming Applications

Project Idea Description:

Topics: Network Performance Testing, Benchmarking, Data Streaming, Reproducibility
Skills: Python, Scripting, Linux, Containers, Networking, benchmark tools
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

This project focuses on expanding the scope of testing SciStream’s architecture by incorporating a variety of traffic patterns based on real scientific applications. The goal is to understand how different traffic patterns influence the performance of memory-to-memory data streaming in scientific scenarios by creating artifacts for reproducible experiments. Additionally, the project will explore the use of different forwarding elements, such as Nginx and HAProxy, to assess their impact on data streaming efficiency and security.

Reproducibility is especially difficult in shared network environments such as Chameleon and FABRIC testbeds. We can expect similar results for two exact same experiments, only when the network condition (external to our traffic) is similar for both experiments. By creating reproducible artifacts for Chameleon and FABRIC, we can build statistical confidence in the measured results by multiple repetitions from other researchers.

The Specific Tasks of the Project Include:

Developing a set of benchmarks to measure the performance of scientific streaming applications across a broader range of traffic patterns.
Creating a set of artifacts for generating traffic patterns typical of data streaming applications.
Deploying various forwarding elements within the SciStream architecture for the Chameleon and FABRIC testbeds.
Compiling a best practices document detailing the optimal configurations for Scistream.

Scistream-LB: A Dynamic Load Balancing Solution Using Programmable network devices

Project Idea Description:

Topics: Network Performance Testing, Data Streaming, Reproducibility, Programmable Data Planes
Skills: Python/Scripting, Linux, Docker/Containers, Networking fundamentals, Experience with OpenFlow/P4 programming
Difficulty: Difficult
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

The aim of this project is to create a specialized forwarding element using OpenFlow (OF) or P4 programming languages, tailored to enhance the SciStream data plane. This new development seeks to enable a more flexible and hardware-based (and therefore more efficient) alternative to conventional software-based forwarding mechanisms like NGINX or HAProxy, specifically designed to support the needs of high-performance data streaming environments for scientific applications. The OF/P4 forwarding elements will be packaged as artifacts for reproducibility experiments in Chameleon and FABRIC testbeds. Reproducibility is especially difficult in shared network environments such as Chameleon and FABRIC testbeds. We can expect similar results for two exact same experiments, only when the network condition (external to our traffic) is similar for both experiments. By creating reproducible artifacts for Chameleon and FABRIC, we can build statistical confidence in the measured results by multiple repetitions from other researchers.

Specific tasks of the project include:

Design and implementation of an OF/P4-based forwarding element that can be seamlessly integrated with the data plane of SciStream’s architecture.
Forwarding logic that supports efficient and secure memory-to-memory data streaming.
A set of benchmarks for evaluating the new forwarding element against traditional options, focusing on improvements in throughput, latency, and security.
An investigation on the potential advantages of programmable network elements for detailed control over data streaming paths and security configurations.
A package of the newly developed forwarding elements as artifacts for reproducibility experiments in Chameleon and FABRIC testbeds.

Chameleon Trovi Redesign

Wed, 21 Feb 2024 13:43:55 -0600

Trovi on Chameleon is an open-source service designed to significantly enhance the practical reproducibility of computer science research. By allowing Chameleon users to upload, share, and access packaged experiments and other research artifacts, Trovi aims to streamline the process of replicating and building upon existing studies. This capability is crucial in the scientific community, where the ability to accurately reproduce research results is as fundamental to validating, critiquing, and extending scientific findings as reading papers. The importance of Trovi lies in its potential to serve as a centralized hub that facilitates the exchange of valuable research outputs, promotes transparency, and fosters collaboration among researchers. By improving the ease with which experiments can be replicated and data can be shared, Trovi supports the advancement of knowledge and innovation in the field of computer science, making it an essential tool for researchers seeking to contribute to the development of reproducible and robust scientific research.

This project will focus on the evolution of Trovi. It will aim to enhance Trovi as a tool to advance practical reproducibility in CS research. Students will evaluate the most important use cases and enabling features necessary to enhance Trovi’s functionality and user experience. With these design insights, students will then create a robust interface that allows researchers to integrate experiment code and data easily as packaged artifacts, similar to the user-friendly design of Google Colab, and build off other users’ artifacts to create novel experiments, similar to the design of GitHub. Furthermore, students will create comprehensive documentation with valuable insights into what works well and what requires improvement, creating a dynamic feedback loop to guide the ongoing redesign process. Lastly, students will actively participate in designing webinars, creating and posting video tutorials, and organizing academic events at the University of Chicago to showcase the work on Trovi. This multifaceted project ensures a well-rounded experience and fosters a collaborative learning environment.

Each of the project ideas below focuses on a different aspect of the overall goal to enhance Trovi as a tool for advancing practical reproducibility in CS research. They are designed to offer a comprehensive approach, from technical development to community engagement, ensuring a well-rounded enhancement of the service.

Topics: User Interface Design User Experience Web Development
Skills: HTML/CSS, JavaScript, UX design principles
Difficulty: Moderate to Hard
Size: Medium to Large
Mentors: Mark Powers
Tasks:
- Conduct user research to understand the needs and pain points of current and potential Trovi users.
- Design wireframes and prototypes that incorporate user feedback and aim to simplify the process of uploading, sharing, and reusing research artifacts.
- Implement the frontend redesign using a modern web framework to ensure responsiveness and ease of use.

Packaged Artifacts Integration System

Topics: Cloud Computing Data Management Web APIs
Skills: Python, RESTful APIs, Docker, Git
Difficulty: Hard
Size: Large
Mentors: Mark Powers
Tasks:
- Develop a system that allows users to easily package and upload their experimental code and data to Trovi.
- Create a standardized format or set of guidelines for packaging experiments to ensure consistency and ease of use.
- Implement API endpoints that enable automated uploads, downloads, and integration with other tools like GitHub or Zenodo.
- Test the system with real-world experiments to ensure reliability and ease of integration.

Community Engagement and Educational Materials

Topics: Educational Technology Community Building Content Creation
Skills: Video Editing, Public Speaking, Event Planning
Difficulty: Moderate
Size: Medium
Mentors: Mark Powers
Tasks:
- Design and organize webinars that introduce Trovi and its new features to the research community.
- Create engaging video tutorials that guide users through the process of using Trovi for their research needs.
- Develop comprehensive documentation that covers both basic and advanced use cases, troubleshooting, and tips for effective collaboration using Trovi.
- Organize academic events, such as workshops or hackathons, that encourage the use of Trovi for collaborative research projects.

Feedback Loop and Continuous Improvement System

Topics: Software Engineering Data Analysis User Feedback
Skills: Python, SQL, Data Visualization, Web Development
Difficulty: Moderate
Size: Medium
Mentors: Mark Powers
Tasks:
- Implement a system within Trovi for collecting, storing, and analyzing user feedback and usage data.
- Develop dashboards that visualize feedback trends and identify areas for improvement.
- Create mechanisms for users to easily report bugs, request features, and offer suggestions for the platform.
- Use the collected data to prioritize development efforts and continuously update the platform based on user needs and feedback.

Data leakage in applied ML: reproducing examples of irreproducibility

Wed, 21 Feb 2024 00:00:00 +0000

Topics: applied machine learning, data leakage, reproducibility
Skills: Python, data analysis, machine learning
Difficulty: Medium
Size: Large (350 hours)
Mentors: Fraida Fund and Mohamed Saeed

Project Idea Description

Data leakage has been identified as a major cause of irreproducibility of a paper’s findings, when machine learning techniques are applied to problems in science. Data leakage includes errors such as:

pre-processing before splitting into training/test sets
feature selection before splitting into training/test sets
duplicated data points in both training and test sets
temporal leakage (e.g. shuffled K-fold cross validation with temporal data)
group leakage (e.g. shuffled K-fold cross validation with data that has group structure)

and leads to an overly optimistic evaluation of model performance, such that the finding may no longer be the same when the error is corrected.

Despite the seriousness of this problem, data leakage is often not covered in introductory machine learning courses, and many users of machine learning across varied science domains are unaware of it. Even those who have learned “rules” for avoiding data leakage (e.g. “never do feature selection on the test set”) may not understand the reasons for these “rules”, and how important they are for ensuring that the final result is valid and reproducible.

The goal of this project is to create learning materials demonstrating how instances of data leakage invalidate a result. These materials should be easily adoptable by instructors teaching machine learning in a wide variety of contexts, including those teaching a non-CS audience. To achieve this, the project proposes to re-implement published results that have been affected by data leakage, and package these implementations along with supporting material in a format suitable for use in classrooms and by independent learners. For each “irreproducible result”, the “package” should include -

a re-implementation of the original result
an explanation of the data leakage problem affecting the result, with an implementation of a “toy example” on synthetic data
a re-implementation of the result without the data analysis error, to show how the finding is affected
and examples of exam or homework questions that an instructor adopting this package may use to assess understanding.

Writing a successful proposal for this project

A good proposal for this project should include, for at least a few “types” of data leakage mentioned above -

a specific published result that could be used as an exemplar (you may find ideas among the review papers listed here)
a brief description of the details of the experiment that will reproduce that result (e.g. what data is used, what machine learning technique is used, what are the hyperparameters used for training)
and an explanation of why this result is suitable for this use (it uses a publicly available dataset, a machine learning technique that is familiar and accessible to students in an introductory course, the paper has sufficient detail to reproduce the result, etc.)

The contributor will need to create learning materials that are written in a clear, straightforward, and concise manner, without unncessary jargon. The proposal should show evidence of the contributor’s writing abilities.

Github link

To get a sense of the type of code you would be writing, here is an example of a learning module related to data leakage (however, it is not in the format described above): Beauty in the Classroom

Project Deliverables

“Packages” of learning materials for teaching about common types of data leakage
Trovi artifacts for “playing back” each of the “packages”

Evaluating congestion controls past and future

Wed, 21 Feb 2024 00:00:00 +0000

Topics: computer networks, congestion control, reproducibility
Skills: Python, Bash scripting, Linux, computer network performance evaluation
Difficulty: Medium
Size: Large (350 hours)
Mentors: Fraida Fund and Ashutosh Srivastava

Project Idea Description

In computer networks, congestion control protocols play an outsize role in determining our experience with networked applications. New congestion control algorithms are regularly proposed by researchers to improve throughput and latency performance, adapt to new types of networks, and align more closely with the needs of new applications.

However, our understanding of the benefits of a new congestion control protocol depends to a large extent on the evaluation - the network topology, the network delay and throughput, the type of flow, the type of competing traffic - and there is no single standard way to evaluate a congestion control protocol. The Pantheon project (which is no longer supported) sought to fill this gap somewhat and address the problem of reproducibility of congestion control results, but their approach is not easily adapted to evaluation scenarios representative of new types of applications or networks. Nor is it capable of representing the evaluation scenarios in most published results related to congestion control.

The goal of this project, therefore is to create an evaluation suite for congestion control protocols that can be used to reproduce existing congestion control results in the academic literature, and to evaluate new protocols under similar evaluation conditions, and to be easily extended to new scenarios. An “evaluation scenario” includes:

a Python notebook to realize the network topology on the FABRIC and/or Chameleon testbed, and configure the network characteristics,
scripts to generate the data flow(s) needed for the evaluation,
and scripts to capture data from the experiment and visualize the results.

Writing a successful proposal for this project

To write a good proposal for this project, you should review the most influential papers on TCP congestion control, and especially those related to TCP protocols that are available in the Linux kernel.

Use your findings to explain what your proposed evaluation suite will include (what network topologies, what flow generators), and justify this with reference to the academic literature. Also indicate which specific results you expect to be able to reproduce using this suite (e.g. include figures from influential papers showing evaluation results! with citation, of course).

You can also take advantage of existing open source code that reproduces a congestion control result, e.g. Replication: When to Use and When Not to Use BBR, or Some of the Internet may be heading towards BBR dominance: an experimental study.

Github link

There is no pre-existing Git repository for this project - at the beginning of the summer, the contributor will create a new repository for this project.

Project Deliverables

“Packages” of evaluation scenarios that can be used to evaluate a congestion control algorithm implemented in the Linux kernel
Trovi artifacts for realizing each evaluation scenario on Chameleon

Automatic reproducibility of COMPSs experiments through the integration of RO-Crate in Chameleon

Mon, 19 Feb 2024 00:00:00 +0000

Topics: Provenance, reproducibility, standards, image creation
Skills: Python, JSON, Bash scripting, Linux, image creation and deployment
Difficulty: Medium
Size: Large (350 hours)
Mentors: Raül Sirvent

Project Idea Description

The COMPSs programming model provides an interface for the programming of a sequential application that is transformed in a workflow that, thanks to the COMPSs runtime, is later scheduled in the available computing resources. Programming is enabled for different languages through the use of bindings: Java, C/C++ and Python (named PyCOMPSs). COMPSs is able to generate Workflow Provenance information after the execution of an experiment. The generated artifact (code + data + recorded metadata) enables the sharing of results through the use of tools such as the WorkflowHub portal, that provides the capacity of generating a DOI of the results to include them as permanent references in scientific papers.

The format of the metadata generated in COMPSs experiments follows the RO-Crate specification, and, more specifically, two profiles: the Workflow and Workflow Run Crate profiles. This metadata enables not only the sharing of results, but also their reproducibility.

This project proposes the creation of a service that enables the automatic reproducibility of COMPSs experiments in the Chameleon infrastructure. The service will be able to get a COMPSs crate (artifact that follows the RO-Crate specification), and, by parsing the available metadata, build a Chameleon compatible image for reproducing the experiment in the testbed. Small modifications to the COMPSs RO-Crate are foreseen (i.e. the inclusion of third party software required by the application).

Project Deliverables

Study the different environments and specifications (COMPSs, RO-Crate, Chameleon, Trovi, …).
Design the most appropriate integration, considering all the elements involved.
Integrate PyCOMPSs basic experiments reproducibility in Chameleon.
Integrate PyCOMPSs complex experiments reproducibility in Chameleon (i.e. with third party software dependencies).

BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Sat, 17 Feb 2024 00:00:00 +0000

Topics: bioinformatics, spatial transcriptomics, gene imputation, benchmarking, cross-platform/species analysis
Skills:
- Programming Languages:
  - Proficient in Python and/or R, commonly used in bioinformatics.
- Data Analysis:
  - Experience with statistical data analysis and machine learning models.
- Bioinformatics Knowledge (not required but preferred):
  - Proficiency in bioinformatics and computational biology.
  - Familiarity with spatial transcriptomics datasets and platforms.
Difficulty: Advanced
Size: Large (350 hours). Given the scope of integrating multi-platform, multi-species datasets and the complexity of benchmarking gene imputation methods, this project is substantial. It requires extensive data preparation, analysis, and validation phases, making it suitable for a larger time investment.
Mentors: Ziheng Duan (contact person)

Project Idea Description

The orchestration of cellular life is profoundly influenced by the precise control of gene activation and silencing across different spatial and temporal contexts. Understanding these complex spatiotemporal gene expression patterns is vital for advancing our knowledge of biological processes, from development and disease progression to adaptation. While single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to profile gene expression across thousands of cells simultaneously, its requirement for cell dissociation strips away the critical spatial context, limiting our comprehension of cellular interactions within their native environments. Recent strides in spatial transcriptomics have started to bridge this gap by enabling spatially resolved gene expression measurements at single-cell or even sub-cellular resolutions. These advancements offer unparalleled opportunities to delineate the intricate tapestry of gene expression within tissues, shedding light on the dynamic interactions between cells and their surroundings.

Despite these technological advances, a significant challenge remains: the datasets generated by spatial transcriptomic technologies are often incomplete, marred by missing gene expression values due to various technical and biological constraints. This limitation severely impedes our ability to fully interpret these rich datasets and extract meaningful insights from them. Gene imputation emerges as a pivotal solution to this problem, aiming to fill in these missing data points, thereby enhancing the resolution, quality, and interpretability of spatial transcriptomic datasets.

Recognizing the critical importance of this task, there is a pressing need for a unified benchmarking platform that can facilitate the evaluation and comparison of gene imputation methods across a diverse array of samples, spanning multiple sampling platforms, species, and organs. Currently, the bioinformatics and spatial transcriptomics fields lack such a standardized framework, hindering progress and innovation. To address this gap, our project aims to establish a comprehensive gene imputation dataset that encompasses a wide range of conditions and parameters. We intend to reproduce known methods and assess their efficacy, providing a solid and reproducible foundation for future advancements in this domain.

Project Deliverable

A comprehensive, preprocessed benchmark dataset that spans multiple sampling platforms, species, and organs, aimed at standardizing gene imputation tasks in spatial transcriptomics.
An objective comparison of state-of-the-art gene imputation methodologies, enhancing the understanding of their performance and applicability across diverse biological contexts.
A user-friendly Python package offering a suite of gene imputation tools, designed to fulfill the research needs of the spatial transcriptomics community by improving data completeness and reproducibility.

ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Sat, 10 Feb 2024 00:00:00 +0000

Topics: Distributed systems, Scalability, Bug analysis, Bug reproducibility
Skills: Java, Python, bash scripting, perf, Linux internals
Difficulty: Hard
Size: Large (350 hours)
Mentors: Bogdan "Bo" Stoica (contact person), Yang Wang

Project Idea Description

Large-scale distributed systems are integral to the infrastructure of a wide range of applications and services. The continuous evolution of these systems requires ongoing efforts to address inherent faults which span a variety of issues including availability, consistency, concurrency, configuration, durability, error-handling, integrity, performance, and security. Recent developments in the field and the rise of cloud computing have been marked by a notable increase in the scale at which such systems operate.

This increase in scale introduces specific challenges, particularly in terms of system reliability and performance. As distributed systems expand beyond single machines, addressing the growing demands for computation, memory and storage becomes more difficult. This underlying complexity leads to the emergence of scalability bugs — defects that surface in large-scale deployments, yet do not reveal themselves in a small-scale setting.

To better understand scalability bugs, we set out to investigate a set of scalability issues documented over the last 5 years from 10 popular open-source large-scale systems. These bugs have led to significant operational challenges, such as system downtime, reduced responsiveness, data loss, and data corruption. Moreover, addressing them required extensive collaboration and problem-solving efforts among engineers and bug reporters, with discussions often spanning a month or more.

We observed that traditional bug finding techniques are insufficient for detecting scalability bugs since these defects are triggered by a mixture of scale-related aspects not properly investigated by previous approaches. These characteristics include the number of components involved, the system load and workload size, the reliability of recovery protocols, and the magnitude of intermediate failures. Although previous research examined some of these aspects, it has typically done so either in isolation (individually), or without providing a comprehensive understanding of the fundamental bug patterns, symptoms, root causes, fixes, and, more importantly, how easily these bugs can be reproduced in-house.

Therefore, the main goal of this project is to systematically understand, characterize, and document the challenges associated with scalability bugs, at-large. Our approach is twofold: first, to analyze scalability bugs in terms of reproducibility, and second, to develop methodologies for triggering them and measuring their impact. Specifically, we aim to:

Provide detailed accounts of bug reproduction experiences for a diverse set of recently reported scalability bugs from our benchmark applications;
Identify specific challenges that prevent engineers from reproducing certain scalability bugs and investigate how prevalent these obstacles are;
Create a suite of protocols to effectively trigger and quantify the impact of scalability bugs, facilitating their investigation in smaller-scale environments.

Project Deliverable

A set of Trovi replayable artifacts enabling other researchers to easily reproduce scalability bugs for our benchmark applications;
A set of Jupyter notebook scripts allowing to conveniently replay each step in our investigation;
A detailed breakdown of the challenges faced when reproducing scalability bugs and how these obstacles differ from those related to more “traditional” types of bugs.

GPEC: An Open Emulation Platform to Evaluate GPU/ML Workloads on Erasure Coding Storage

Thu, 08 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage Systems, Machine Learning, Erasure Coding
Skills: C/C++, Python, PyTorch, Bash scripting, Linux, Erasure Coding, Machine Learning
Difficulty: Hard
Size: Large (350 hours)
Mentors: Meng Wang (primary contact), John Bent

Large-scale data centers store immense amounts of user data across a multitude of disks, necessitating redundancy strategies like erasure coding (EC) to safeguard against disk failures. Numerous research efforts have sought to assess the performance and durability of various erasure coding approaches, including single-level erasure coding, locally recoverable coding, and multi-level erasure coding.

Despite its widespread adoption, a significant research gap exists regarding the performance of large-scale erasure-coded storage systems when exposed to machine learning (ML) workloads. While conventional practice often leans towards replication for enhanced performance, this project seeks to explore whether cost-effective erasure encoding can deliver comparable performance. In this context, several fundamental questions remain unanswered, including: Can a typical erasure-coded storage system deliver sufficient throughput for ML training tasks? Can an erasure-coded storage system maintain low-latency performance for ML training and inference workloads? How does disk failure and subsequent repair impact the throughput and latency of ML workloads? What influence do various erasure coding design choices, such as chunk placement strategies and repair methods, have on the aforementioned performance metrics?

To address these questions, the most straightforward approach would involve running ML workloads on large-scale erasure coded storage systems within HPC data centers. However, this presents challenges for researchers and students due to limited access to expensive GPUs and distributed storage systems, especially when dealing with large-scale evaluations. Consequently, there is a need for a cost-effective evaluation platform.

The objective of this project is to develop an open-source platform that facilitates cheap and reproducible evaluations of erasure-coded storage systems concerning ML workloads. This platform consists of two key components: GPU Emulator: This emulator is designed to simulate GPU performance for ML workloads. Development of the GPU emulator is near completion. EC Emulator: This emulator is designed to simulate the performance characteristics of erasure-coded storage systems. It is still in the exploratory phase and requires further development.

The student’s responsibilities will include documenting the GPU emulator, progressing the development of the EC emulator, and packaging the experiments to ensure easy reproducibility. It is anticipated that this platform will empower researchers and students to conduct cost-effective and reproducible evaluations of large-scale erasure-coded storage systems in the context of ML workloads.

Project Deliverable

Build an EC emulator to emulate the performance characteristics of large-scale erasure-coded storage systems
Incorporate the EC emulator into ML workloads and GPU emulator
Conduct reproducible experiments to evaluate the performance of erasure-coded storage systems in the context of ML workloads
Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository with open-source code

Turn on, Tune in, Listen up: Maximizing Side-Channel Recovery in Cross-Platform Time-to-Digital Converters

Thu, 08 Feb 2024 00:00:00 +0000

Turn on, Tune in, Listen Up Is an open-source framework for implementing voltage flucturation sensors in FPGA devices for use in side-channel security research. Side-channels are an ever present hardware security threat. The reconfigurability of FPGAs significantly broadens the side-channel attack surface in many cloud heterogeneous systems. We have developed a highly tunable side-channel sensor, which significantly improves side-channel attack time and resolution in multiple contexts. Concurrent users sharing the same device may attack one another through the power side-channel (check out our paper), while consecutive users may attack one another through measurement of the physical wear-out state of the FPGA device (check out our paper). We have demonstrated these attack surfaces on both Intel (Altera) and AMD (Xilinx) platforms. Currently, our open-sourced sensor design and side-channel analysis flow is limited to AMD devices. We are seeking CSE/CS/CE/ECE researchers interested in FPGA design, heterogeneous computing and/or hardware security to combine our Intel and AMD side-channel sensors into a unified attack framework and comparing capabilities between vendors.

Open-source sensor repository updates

Topics: Hardware security, cloud security, heterogeneous computing, temporal and spatial side-channels
Skills: Experience with GitHub, FPGA development (AMD or Intel), and Python
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Dustin Richmond, Tyler Sheaves

Update existing open-source voltage fluctuation sensor to support both AMD and Intel devices. Currently our repository exclusively supports AMD FPGAs. We have added new features to our sensor and have demonstrated an implementation on Intel. We would like to consolidate this work into a unified repository containing side-channel analysis demonstrations using open-source target benchmark designs.

Specific tasks:

Adapt existing tooling scripts to support multiple vendor tool flows.
Adapt existing test infrastructure to target multiple SoC-type FPGA platforms (i.e. DE10-Nano, Pynq Z2, etc.).
Evaluate cross-platform sensor architecture on a collection of benchmark designs. Demonstrate each benchmark using a cross-platform unified side-channel analysis framework.
Draw a comparison between sensor implementations on different architectures.

Artificial Intelligence Explainability Accountability

Wed, 07 Feb 2024 00:00:00 +0000

Trustworthy Logical Reasoning Large Language Models (LLMs)

Logical LLMs is a project to translate the output from large language models (LLM) into a logic-based programming language (prolog) to detect inconsistencies and hallucinations automatically . The goals of this project would be to build a user interface for users to be able to give feedback which can be incorporated into the system. The project goal is to create a trustworthy hybrid open-source LLM tool that can learn from user feedback and explain its mistakes.

Collect Hallucinations and Facts

Topics: AI/ML, data collection, logic, user interfaces
Skills: javascript, html, python, bash, git
Difficulty: Easy/Moderate
Size: Large
Mentors: Leilani H. Gilpin (and a PhD student TBD).

Specific Tasks

Run queries in an LLM API with various prompts.
Create a user interface system that collects user feedback in a web browser.
Create a pipeline for storing the user data in a common format that can be shared in our database.
Document the tool for future maintenance.

Explaining failures in autograding

The eXplainable autograder (XAutograder) is a tool for autograding student coding assignments, while providing personalized explanations or feedback. The goal of this project is to create an introductory set of coding assignment with explanations of wrong answers. This benchmark suite will be used for testing our system. The project goal is to create a dynamic autograding system that can learn from student’s code and explain their mistakes

Design introductory questions and explanations

Topics: AI/ML, AI for education, XAI (Explainable AI_
Skills: python, git
Difficulty: Moderate
Size: Large
Mentors: Leilani H. Gilpin (and a PhD student TBD).

Specific Tasks

Design 5-10 basic programming questions (aggregated from online, other courses, etc).
Create tests of correctness (unit tests), and a testing framework which can input a set of answers, and provide a final assessment
Create a set of baseline explanations for various error cases, e.g., out of bounds error, syntax error, etc.
Create a pipeline for iterating on the test cases and/or explanation feedback.
Document the tool for future maintenance.

Causeway: Learning Web Development Through Micro-Roles

Wed, 07 Feb 2024 00:00:00 +0000

Thus far, we have developed a version of the platform that walks learners through the process of developing presentational components of a web application as well as smart components / containers that contain multiple presentational components and are responsible for fetching data from the backend and handling events and updates to the database. This content is still using Angular 13 and needs to be updated to Angular 17, as well as to make some improvements in our use of RxJS, NgRx, and Firebase. We’d also like to extend the content in multiple ways including: 1) extending the walkthrough to more components and containers besides the single example we have, ideally in a way that covers a complete application, and 2) extending beyond components and containers to cover defining database entities and relationships. We’d also like to develop a learning dashboard where users can see the different micro-roles and lessons that they’ve completed or that are upcoming for the project they are working on.

Causeway / Improving the Core Infrastructure and Experience

The proposed work includes updating the platform and the example infrastructure within the platform to the latest version of Angular and other associated libraries, implementing and testing logging and analytics, implementing a learning dashboard for users, and time permitting, creating new modules to cover defining database entities and relationships. Both roles will also contribute to running usability studies and documenting the platform so that it can be open-sourced.

Topics: Web Development, Educational Technologies, Angular
Skills: Web development experience, HTML, CSS, Javascript, Angular, RxJS, NgRx, Firebase
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: David Lee

Causeway / Extend the Learning Scope and Experience

The proposed work includes extending the component and container walkthroughs to cover a complete interactive application. This means writing a separate simple application, and organizing the code required to do so into units of work organized by our micro-role structure. Both roles will also contribute to running usability studies and documenting the platform so that it can be open-sourced.

Topics: Web Development, Educational Technologies, Angular
Skills: Web development experience, HTML, CSS, Javascript, Angular, RxJS, NgRx, Firebase
Difficulty: Medium
Size: Large (350 hours)
Mentors: David Lee

LAST: Let’s Adapt to System Drift

Wed, 07 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Computer systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Data Science and Machine Learning
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ray Andrew Sinurat (primary contact), Sandeep Madireddy

The performance of computer systems is constantly evolving, a natural outcome of updating hardware, improving software, and encountering hardware quirks over time. At the same time, machine learning (ML) models are becoming increasingly popular. They are being used widely to address various challenges in computer systems, notably in speeding up decision-making. This speed is vital for a quick and flexible response, essential for meeting service-level agreements (SLAs). Yet, an interesting twist has emerged: like the computer systems they aid, ML models also experience a kind of “aging.” This results in a gradual decline in their effectiveness, a consequence of changes in their operating environment.

The phenomenon of model “aging” is a ubiquitous occurrence across various domains, not limited merely to computer systems. This process of aging can significantly impact the performance of a model, emphasizing the critical importance of early detection mechanisms to maintain optimal functionality. In light of this, numerous strategies have been formulated to mitigate the aging of models. However, the generalizability and effectiveness of these strategies across diverse domains, particularly in computer systems, remain largely unexplored. This research aims to bridge this gap by designing and implementing a comprehensive data analysis pipeline. The primary objective is to evaluate the efficacy of various strategies through a comparative analysis, focusing on their performance in detecting and addressing model aging. To achieve a better understanding of this issue, the research will address the following pivotal questions:

Data-Induced Model Aging: What specific variations within the data can precipitate the aging of a model? Understanding the nature and characteristics of data changes that lead to model deterioration is crucial for developing effective prevention and mitigation strategies.
Efficacy of Aging Detection Algorithms: How proficient are the current algorithms in identifying the signs of model aging? Assessing the accuracy and reliability of these algorithms will provide insights into their practical utility in real-world scenarios.
Failure Points in Detection: In what scenarios or under what data conditions do the aging detection mechanisms fail? Identifying the limitations and vulnerabilities of these algorithms is vital for refining their robustness and ensuring comprehensive coverage.
Scalability and Responsiveness: How do these algorithms perform in terms of robustness and speed, particularly when subjected to larger datasets? Evaluating the scalability and responsiveness of the algorithms will determine their feasibility and effectiveness in handling extensive and complex datasets, a common characteristic in computer systems.

To better understand and prevent issues related to model performance, our approach involves analyzing various datasets, both system and non-system, that have shown notable changes over time. We aim to apply machine learning (ML) models to these datasets to assess the effects of these changes on model performance. Our goal is to leverage more advanced ML techniques to create new algorithms that address these challenges effectively. This effort is expected to contribute significantly to the community, enhancing the detection of model aging and improving model performance in computer systems.

Project Deliverable

Run pipeline on several computer systems and non-computer systems dataset
A Trovi artifact for data preprocessing and model training shared on Chameleon Cloud
A GitHub repository containing the pipeline source code

Reproducibility in Data Visualization

Tue, 06 Feb 2024 15:00:00 -0500

At the heart of evaluating reproducibility is a judgment about whether two results are indeed the same. This can be complicated in the context of data visualization due to rapidly evolving technology and differences in how users perceive the results. First, due to the rapid evolution of libraries including web technologies, visualizations created in the past may look different when rendered in the future. Second, as the goal of data visualization is communicating data to people, different people may perceive visualizations in a different way. Thus, when a reproduced visualization does not exactly match the original, judging whether they are “similar enough” is complicated by these factors. For example, changes in a colormap may be deemed minor by a computer but could lead people to different understandings of the data. The goals of this research are to capture visualizations in a way that allows their reproducibility to be evaluated and to develop methods to categorize the differences when a reproduced visualization differs from the original.

Investigate Solutions for Capturing Visualizations

Topics: Reproducibility, Data Visualization
Skills: Python and/or JavaScript, Data Visualization Tools
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: David Koop

The goal of this project is to investigate, augment, and/or develop solutions to capture visualizations that appear in formats including websites and Jupyter notebooks. In past work, we implemented methods to capture thumbnails as users interacted with visualizations. Other solutions can be used to capture interactive visualizations. We wish to understand the feasibility of recording such visualizations and their utility in evaluating reproducibility in the future.

Specific tasks:

Evaluate tools for capturing static visualizations on the web
Investigate tools for capturing dynamic visualizations on the web
Investigate how data including code or metadata can be captured with visualizations
Augment or develop tools to aid in capturing reproducible visualizations

Categorize Differences in Reproduced Visualizations

Topics: Reproducibility, Data Visualization
Skills: Python and/or JavaScript, Data Visualization Tools
Difficulty: Moderate/Hard
Size: Medium or Large (175 or 350 hours)
Mentors: David Koop

The goal of this project is to organize types of differences in reproduced visualizations and create tools to detect them. Publications and computational notebooks record renderings of visualizations. When they also include the code to reproduce the visualization, we can regenerate them in order to compare them. Often, the reproduced visualization does not match the original (see examples in this manuscript). This project seeks to categorize the types of differences that can occur in order and start understanding how they impact judgments of reproducibility.

Specific tasks:

Evaluate and/or develop tools to compare two visualizations
Evaluate the utility of artificial intelligence solutions
Organize and categorize the detected differences
Develop tools to determine the types or categories of differences present in two visualizations

FSA: Benchmarking Fail-Slow Algorithms

Tue, 06 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Ruidan Li (primary contact), Kexin Pei

In the realm of modern applications, achieving not only low but also predictable response times is a critical requirement. Performance instability, even when it amounts to just a few milliseconds of delay, can result in violations of Service Level Objectives (SLOs). Redundancy at the RAID group level provides a layer of protection; however, the early identification of potential slowdowns or failures is paramount in minimizing their impact on overall system latency.

Fail-Slow represents a unique type of fault within storage systems, characterized by the system’s ability to continue functioning while progressively deteriorating – its performance significantly drops below expected levels. Notably, fail-slow conditions are responsible for a considerable share of latency tails. Detecting fail-slow faults is particularly challenging, as they can be easily masked by the normal fluctuations in performance. Consequently, the identification of fail-slow faults is a critical area of research, demanding meticulous attention.

Several strategies have been developed to address the fail-slow issue, yet the question of their broad applicability remains. We plan to implement and assess various existing fail-slow detection algorithms, examining their strengths and weaknesses. Our analysis will concentrate on key questions:

How promptly can the algorithm identify a fail-slow symptom? What methods does the algorithm employ to accurately distinguish fail-slow incidents, thereby minimizing false negatives? Through what approach does the algorithm achieve the right sensitivity level to keep false positives in check?

This evaluation aims to shed light on the effectiveness of current methodologies in detecting fail-slow faults, crucial for enhancing system reliability and performance.

Building upon our evaluation of several fail-slow detection algorithms, our objective is to harness advanced machine learning (ML) models to develop a novel algorithm. This initiative seeks to address and potentially compensate for the identified weaknesses in existing methodologies. By focusing on the critical aspects of early detection, accurate differentiation, and optimal sensitivity, we aim to create a solution that reduces both false negatives and false positives, thereby enhancing overall system reliability. This approach represents a strategic effort to not only advance the current state of fail-slow detection but also to contribute significantly to the resilience and performance of storage systems.

Project Deliverable

A Trovi artifact for the existing Fail-Slow detection algorithms on Chameleon Cloud
A GitHub repository containing the full evaluation result
A Google Colab notebook for quick replay

Open Sensing Platform (OSP)

Mon, 05 Feb 2024 00:00:00 +0000

Open Sensing Platform I: Software to enable large scale outdoor sensor networks

Topics: Data Visualization, Backend, Web Development, UI/UX, Analytics
Skills:
- Required: React, Javascript, Python, SQL, Git
- Nice to have: Flask, Docker, CI/CD, AWS, Authentication
Difficulty: Medium
Size: Large (350 hours)
Mentors: Colleen Josephson, John Madden, Aaron Wu

Open Sensing Platform (OSP) is a new initiative expanding from our prior project DirtViz, a data visualization web platform for monitoring microbial fuel cell sensors (see GitHub). The mission is to scale up the current platform to support other researchers or citizen scientists in integrating their novel sensing hardware or microbial fuel cell sensors for monitoring and data analysis. Examples of the types of sensors currently deployed are sensors measuring soil moisture, temperature, current, and voltage in outdoor settings. The focus of the software half of the project involves building upon our existing visualization web platform, and adding additional features to support the mission. A live version of the website is available here.

Deliverables:
- Create a system for remote collaborators/citizen scientists to set up their sensors and upload securely, eg. designing user flow to create sensors
- Craft an intuitive navigation system so that data from deployment sites around the world can be easily viewed, eg. designing experience/system to locate deployment sites.
- Refine our web-based visualization tools to add additional features for users to analyze collected data, eg. lazy loading out-of-range data or caching queried data.
- Document the tool thoroughly for future maintenance

Open Sensing Platform II: Hardware to enable large scale outdoor sensor networks

Topics: Embedded system, wireless communication, low-power remote sensing
Skills:
- Required: C/C++, Git, Github, Platformio
- Nice to have: PCB design and debugging experience, STM32 HAL, ESP32 Arduino, protobuf, python, knowledge of standard communication protocols (I2C, SPI, and UART)
Difficulty: Hard
Size: Large (350 hours)
Mentors: Colleen Josephson, John Madden, Stephen Taylor

The Open Sensing Platform hardware aims to be a general purpose hardware platform for outdoor sensing (e.g. agriculture, ecological monitoring, etc.). The typical use case involves a sensor deployment in an agricultural field, remotely uploading measurements without interfering with farming operations. The current hardware revision (Soil Power Sensor) was originally designed for monitoring power output of microbial fuel cells using high fidelity voltage and current measurement channels, as well as auxiliary sensors such as the SDI-12 TEROS-12 soil moisture sensor. The primary activities of this project will involve low-level firmware design and implementation, but may also incorporate hardware design revisions if necessary. We are looking to expand functionality to other external sensors, as well as optimize for power consumption, via significant firmware design activities.

Long-range, low-power wireless communication is achieved through a LoRa capable STM32 microcontroller with in-lab experiments using an ESP32 microcontroller to enable the simpler WiFi interface. Both wireless interfaces communicate upload measurements to our data visualization dashboard, Open Sensing Platform I. The combined goal across both of these projects is to create a system that enables researchers to test and evaluate novel sensing solutions. We are looking to make the device usable to a wide range of researchers which may not have a background in electronics, so are interested in design activities that enhance user friendliness.

Deliverables: Contribution via commits to the GitHub repository with documentation on completed work. A changelog of contributions to the firmware.

OpenMLEC: Open-source MLEC implementation with HDFS on top of ZFS

Mon, 05 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage Systems, Erasure Coding
Skills: C/C++, Java, Bash scripting, Linux, HDFS, ZFS, Erasure Coding
Difficulty: Hard
Size: Large (350 hours)
Mentors: Meng Wang (Main contact person) and Anjus George

Multi-Level Erasure Coding (MLEC), which performs erasure coding at both network and local levels, has seen large deployments in practice. Our recent research work has shown that MLEC can provide high durability with higher encoding throughput and less repair network traffic compared to other erasure coding methods. This makes MLEC particularly appealing for large-scale data centers, especially high-performance computing (HPC) systems.

However, current MLEC systems often rely on straightforward design choices, such as Clustered/Clustered (C/C) chunk placement and the Repair-All (RALL) method for catastrophic local failures. Our recent simulations [1] have revealed the potential benefits of more complex chunk placement strategies like Clustered/Declustered (C/D), Declustered/Clustered (D/C), and Declustered/Declustered (D/D). Additionally, advanced repair methods such as Repair Failed Chunks Only (RFCO), Repair Hybrid (RHYB), and Repair Minimum (RMIN) have shown promise for improving durability and performance according to our simulations. Despite promising simulation results, these optimized design choices have not been implemented in real systems.

In this project, we propose to develop open-source MLEC implementations in real systems, offering a range of design choices from simple to complex. Our approach leverages ZFS for local-level erasure coding and HDFS for network-level erasure coding, supporting both clustered and declustered chunk placement at each level. The student’s responsibilities include setting up HDFS on top of ZFS, configuring various MLEC chunk placements (e.g., C/D, D/C, D/D), and implementing advanced repair methods within HDFS and ZFS. The project will culminate in reproducible experiments to evaluate the performance of MLEC systems under different design choices.

We will open-source our code and aim to provide valuable insights to the community on optimizing erasure-coded systems. Additionally, we will provide comprehensive documentation of our work and share Trovi artifacts on Chameleon Cloud to facilitate easy reproducibility of our experiments.

[1] Meng Wang, Jiajun Mao, Rajdeep Rana, John Bent, Serkay Olmez, Anjus George, Garrett Wilson Ransom, Jun Li, and Haryadi S. Gunawi. Design Considerations and Analysis of Multi-Level Erasure Coding in Large- Scale Data Centers. In The International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’23), 2023.

Project Deliverable

Open-source MLEC implementations with a diverse range of design choices.
Configuration setup for HDFS on top of ZFS, supporting various MLEC chunk placements.
Implementation of advanced repair methods within HDFS and ZFS.
Reproducible experiments to assess the performance of MLEC systems across distinct design choices.
Comprehensive documentation of the project and the provision of shared Trovi artifacts on Chameleon Cloud for ease of reproducibility.

EdgeRep: Reproducing and benchmarking edge analytic systems

Fri, 02 Feb 2024 00:00:00 +0000

Topics: video analytics, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Medium
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (contact person), Junchen Jiang

Project Idea Description

With the flourishing of ideas like smart cities and smart manufacturing, a massive number of edge devices (e.g., traffic or security cameras, thermometers, flood sensors, etc.) are deployed and connected to the network. These devices collect and analyze data across space and time, aiding stakeholders like city governments and manufacturers in optimizing their plans and operations. However, the sheer number of edge devices and the large amount of communication among the devices and central servers raises significant challenges in how to manage and schedule resources. This includes network bandwidth between the devices and computing power on both edge devices and bare metal servers, all to maintain the reliable service capability of running applications.

Moreover, given the limited resources available to edge devices, there’s an emerging trend to reduce average compute and/or bandwidth usage. This is achieved by leveraging the uneven distribution of interesting events with respect to both time and space in the input data. This, in turn, introduces further challenges in provisioning and managing the amount of resources available to edge devices. The resource demands of running applications can greatly depend on the input data, which is both dynamic and unpredictable.

Keeping these challenges in mind, the team previously designed and implemented a dynamic resource manager capable of understanding the applications and making decisions based on this understanding at runtime. However, such a resource manager has only been tested with a limited number and types of video analytic applications. Thus, through the OSRE24 project, we aim to:

Collect a wide range of videos to form a comprehensive video dataset
Reproduce other state-of-art self-adaptive video analytic applications
Package the dataset as well as the application to publish them on Chameleon Trovi site

Project Deliverable

Collect a wide range of videos to form a comprehensive video dataset
Reproduce other state-of-art self-adaptive video analytic applications
Package the dataset as well as the application to publish them on Chameleon Trovi site

FEP-Bench: Benchmarks for understanding featuring engineering and preprocessing bottlenecks

Fri, 02 Feb 2024 00:00:00 +0000

Topics: storage system, scheduling, distributed system, machine learning
Skills: Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Yuyang (Roy) Huang (contact person), Swami Sundararaman

Project Idea Description

However, prior to the design and implementation of such a system, a systematic understanding of the preprocessing workflow is essential. Hence, throughout the program, the students will need to:

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

Project Deliverable

Understand the current system used to preprocess data for ML training, for example, Hadoop or Spark.
Collect the common datasets used for different types of ML models.
Collect the typical operations used for preprocessing these datasets.
Benchmark the performance in these operations under the existing frameworks under various experimental settings.
Package the benchmark such that the team can later use it for reproduction or evaluation.

FetchPipe: Data Science Pipeline for ML-based Prefetching

Fri, 02 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: Storage systems, machine learning
Skills: C/C++, Python, PyTorch, Bash scripting, Linux, Machine Learning modeling
Difficulty: Hard
Size: Large (350 hours)
Mentors: Daniar H. Kurniawan (primary contact), Haryadi Gunawi

The contemporary landscape of high-performance servers, particularly those designed for data centers and AI/ML training, prominently features solid-state drives (SSDs) and spinning disks (HDDs) as primary storage devices. These components play a crucial role in shaping overall system performance, underscoring the importance of addressing and minimizing Input/Output (I/O) latency. This is particularly crucial given the widespread adoption of hybrid storage systems, where caching and prefetching strategies are instrumental in optimizing storage performance. Caching involves using faster but less dense memory to store frequently accessed data, while prefetching aims to reduce latency by fetching data from slower memory to cache before it is needed. Although both caching and prefetching present valid challenges, our primary emphasis is on the prefetching problem due to the inherent difficulty in predicting future access.

Traditional prefetchers, dating back 1-2 decades, heavily rely on predefined rules for prefetching based on LBA access sequences, limiting their adaptability to complex scenarios. For instance, the read-ahead prefetcher is confined to prefetching the next data item within a file for faster sequential access. Addressing this limitation, recent advancements include learning-based methods, such as Long Short-Term Memory (LSTM) techniques like DeepPrefetcher and Delta LSTM, which model the LBA delta to cover a broader range of LBAs. However, they are still struggling to achieve high accuracy when the workload pattern changes drastically. Although there are some sophisticated prefetchers capable of learning complex I/O access patterns using Graph structure, they face challenges in their deployment due to the computational cost.

In this project, our goal is to provide an end-to-end data science pipeline to empower the research on ML-based prefetchers. We believe that this pipeline is crucial for fostering active collaboration between the ML community and storage systems researchers. This collaboration aims to optimize existing ML-based prefetching solutions. Specifically, we will provide the dataset for training/testing and some samples of ML-based models that can further be developed by the community. Furthermore, we will also provide a setup for evaluating the ML model when deployed in storage systems.

Project Deliverable

Compile I/O traces from various open traces and open systems.
Develop a pipeline for building ML-based prefetching solutions.
Build a setup to evaluate the model in a real hybrid storage system.
Publish a Trovi artifact shared on Chameleon Cloud and a GitHub repository

Reproducible Performance Benchmarking for Genomics Workflows on HPC Cluster

Fri, 02 Feb 2024 00:00:00 +0000

Project Idea description

We aim to characterize the performance of genomic workflows on HPC clusters by conducting two research activities using a broad set of state-of-the-art genomic applications and open-source datasets.

Performance Benchmarking and Characterizing Genomic Workflows:

Topics: High Performance Computing (HPC), Data Analysis, Scientific Workflows
Skills: Linux, Python, Bash Scripting, Data Science Toolkit, Kubernetes, Container Orchestration, Genomics Applications (e.g. BWA, FastQC, Picard, GATK, STAR)
Difficulty: Medium
Size: Large (350 hours)
Mentor(s): In Kee Kim

In this activity, students will perform comprehensive performance measurements of genomic data processing on HPC clusters using state-of-the-art applications, workflows, and real-world datasets. They will collect and package datasets for I/O, memory, and compute utilization using industry-standard tools and best practices. Measurement will be done using Kubernetes container orchestration on a multi-node cluster to achieve scalability, with either custom-made metrics collection system or integration of existing industry standard tools. (e.g. Prometheus).

Quantifying Performance Interference and Assessing Their Impact on Workflow Execution Time:

Topics: Machine Learning, Data Analysis, and Scientific Workflows and Computations
Skills: Linux, Python, Bash Scripting, Data Science Toolkit, Kubernetes, Container Orchestration
Difficulty: Difficult
Size: Medium (175 hours)
Mentor(s): In Kee Kim

In this activity, students will measure the slowdown of various applications due to resource contention (e.g. CPU and I/O). Students will analyze whether an application is compute-bound, I/O bound, or both, then analyze the correlation between resource utilization and execution time. Following that, students will assess the impact of per-application slowdown to the slowdown of a whole workflow. To the best of our knowledge, this will be the first study which systematically quantifies per-application interference when running genomics workflow on an HPC cluster.

For both subprojects, all experiments will also be conducted in a reproducible manner (e.g., as a Trovi package or Chameleon VM images), and all code will be open-sourced (e.g., shared on a public Github repo).

Project Deliverable:

A Github repository and/or Chameleon VM image containing source code for application executions & metrics collection. Jupyter notebooks and/or Trovi artifacts containing analysis and mathematical models for application resource utilization & the effects of data quality.

LiveHD

Thu, 01 Feb 2024 00:00:00 +0000

The goals is to enable a more productive flow where the ASIC/FPGA designer can work with multiple hardware description languages like CHISEL, Pyrope, or Verilog.

There are several projects, some compiler infrastructure around LiveHD. Others around how to interface LLMs to improve chip design productivity.

There are the following projects available:

Slang with LiveHD
Hardware Hierarchical Dynamic Structures (hdds)
HDLEval for LLMs
C++ Profiler Optimizer with LLMs
Decompiler from Assembly to C++ with LLMs

Slang with LiveHD

Project Idea

slang is one of the best open source Verilog front-ends available. LiveHD uses slang, but only a subset of Verilog is supported. The goal is to add more slang features.

Project Deliverable

The slang/LiveHD interface creates LiveHD IR (LNAST IR). The plan is to keep extending the translation to support more features. This is a project that allows small steps. The goal is to support all Verilog 2001, and potentially some System Verilog features.

Topics: SysteVerilog, Compilers
Skills Needed: Knowledge of Verilog, C++17, some compiler background.
Difficulty: Medium
Size: Large
Mentor: Jose Renau, Sakshi Garg

Hardware Hierarchical Dynamic Structures (hdds)

Project Idea

hdds aims to build efficient tree and graph data structures commonly used by hardware compilers. A key difference is the hierarchical nature, and patterns.

Project Deliverable

There are 2 main components: Graph and Tree.

For each, there is a hierarchical implementation that allows to connect tree/graphs in a hieararchy. For example, a graph can call another graph with input and outputs like a Verilog module calls other Verilog modules.

Both classes should have iterators for traversing in topological sort.

Topics: Data structures for compilers
Skills Needed: Data structures, C++17
Difficulty: Medium
Size: Large
Mentor: Jose Renau, Sakshi Garg

HDLEval for LLMs

Project Idea

LLMs can be used to create new hardware. The goal of this project is to create multiple prompts so that LLM/compiler designers can have examples to improve their flows.

Project Deliverable

The idea is to create many sample projects where a “input” creates a Verilog artifact. The specification should not assume Verilog as output because other HDLs like Chisel could be used.

The goal is to create many sample circuits that are realistic and practical. The description can have

Topics: Verilog, LLMs
Skills Needed: Verilog or Chisel
Difficulty: Low
Size: Small or medium
Mentor: Jose Renau

C++ Profiler Optimizer with LLMs

Project Idea

Fine-tune, and/or RAG, a LLM to leverage profiling tools so that it can provide code optimization recommendations for C++ and possibly Rust code.

Project Deliverable

Create a Python package (poetry?) called aiprof that analyzes the execution of a C++ or Rust program and provide code change recommendations to improve performance.

aiprof ./binary

aiprof uses perf tools but also other tools like redspy, zerospy, and loadspy to find problematic code areas and drive the GPT optimizer.

The plan is to find several examples of transformations to have a database so that a model like CodeLlama or mixtral can be fine-tuned with code optimization recomendations.

Topics: C++, perf tools
Skills Needed: C++17, Linux performance counters
Difficulty: Medium
Size: Large
Mentor: Jose Renau

Decompiler from Assembly to C++ with LLMs

Project Idea

There are several decompilers from assembly to C like ghidra and retdec. The idea is to enhance both outputs to feed an LLM to generate nicer C++ code.

Project Deliverable

ghidra and retdec generate C code out of assembly. The idea is to start with these tools as baseline, but feed it to a LLM to generate C++ code instead of plain C.

Create a Python package (poetry?) called aidecomp that integrates both decompilers. It allows to target C or C++17.

To check that the generated code is compatible with the function translated, a fuzzer could be used. This allows aidecomp to iterate the generation if the generated code is not equivalent.

Topics: C++, decompilers
Skills Needed: C++17
Difficulty: Medium
Size: Large
Mentor: Jose Renau

Drishti

Tue, 30 Jan 2024 10:15:00 -0700

Drishti / Server-side Visualization Service

The proposed work will include investigating and building server-side solutions to support the visualization of larger I/O traces and logs, while integrating with the existing analysis, reports, and recommendations.

Topics: I/O HPC visualization, performance analysis
Skills: Python, HTML/CSS, JavaScript
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Drishti / Visualization and Analysis of AI-based Applications

Drishti to handle metrics from non-MPI applications, specifically, AI/ML codes and applications. This work entails adapting the existing framework, heuristics, and recommendations to support metrics collected from AI/ML workloads.

Topics: I/O HPC AI visualization, performance analysis
Skills: Python, AI, performance profiling
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench

Tue, 30 Jan 2024 10:15:00 -0700

h5bench / Reporting and Enhancing

The proposed work will include standardizing and enhancing the reports generated by the suite, and integrate additional I/O kernels (e.g., HACC-IO).

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench / Compression

The proposed work will focus on including compression capabilities into the h5bench core access patterns through HDF5 filters.

Topics: I/O HPC benchmarking, compression
Skills: C/C++, Python, HDF5
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

StatWrap

Wed, 24 Jan 2024 00:00:00 +0000

Additional information:

Reproducibility Checklists

Topics: reproducibility, user interface, checklists
Skills: JavaScript, React
Difficulty: Medium
Size: Large (350 hours)
Mentor: Luke Rasmussen

This goal of this project is to develop support within StatWrap to generate customizable reproducibility checklists. The developer will use the metadata and user input collected by StatWrap to automatically generate checklists. This functionality will allow investigators to automatically generate a document indicating what practices they’ve followed to support reproducibility. Part of the project will involve surveying proposed reproducibility checklists and considering what to implement in StatWrap. This work will take a systematic approach to documenting reproducibility, much like PRISMA checklists for systematic reviews or CONSORT checklists for clinical trials.

The specific tasks of the project include:

Identify candidate reproducibility checklists to use as guides
Create the data structure for configuring reproducibility checklists
Display the reproducibility checklist in the user interface
Store responses and comments to the checklist as provided by the user
Generate a reproducibility checklist report from StatWrap

OpenROAD - An Open-Source, Autonomous RTL-GDSII Flow for Chip Design

Mon, 22 Jan 2024 00:00:00 +0000

OpenROAD massively scales and supports EWD (Education and Workforce Development) and supports a broad ecosystem making it a vital tool that supports a rapidly growing Semiconductor Industry.

Create OpenROAD Tutorials and Videos

Topics: Documentation, Tutorials, Videos, VLSI design basics
Skills: Video/audio recording and editing, training and education
Difficulty: Medium
Size: Large (350 hours)
Mentor: Indira Iyer, Vitor Bandeira

Create short videos for training and course curriculum highlighting key features and flows in OpenROAD-flow-scripts.

Improve the OpenROAD AutoTuner Flow and documentation

Topics: OpenROAD-flow-scripts, AutoTuner, Design Exploration
Skills: Knowledge of ML for hyperparameter tuning, Cloud-based computation, Basic VLSI design and tools knowledge, python, C/C++
Difficulty: Medium
Size: Large (350 hours)
Mentor: Vitor Bandeira, Indira Iyer

Test, analyze and enhance the AutoTuner to improve usability, documentation and QoR. The Autotuner is an important tool in the OpenROAD flow - OpenROAD-flow-scripts for Chip design exploration that significantly reduces design time. You will use state-of-the-art ML tools to test the current tool exhaustively for good PPA (performance, power, area) results. You will also update existing documentation to reflect any changes to the tool and flow.

Implement a memory compiler in the OpenROAD Flow

Topics: OpenROAD-flow-scripts, Memory Compiler,
Skills: Basic VLSI design and tools knowledge, python, tcl, C/C++, memory design a plus
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Matt Liberty, Austin Rovinski

Implement a memory compiler as part of the OpenROAD flow to improve the placement and layout efficiency of large, memory-intensive designs. You will start with an existing code base to develop this feature: https://github.com/The-OpenROAD-Project-staging/OpenROAD/tree/dffram This is another option: https://github.com/AUCOHL/DFFRAM Enhance code to support DFFRAM support for the OpenROAD native flow, OpenROAD-flow-scripts.

Integrate a tcl and python linter

Topics: Linting, Workflow
Skills: tcl, python, linting
Difficulty: Easy
Size: Small (90 hours)
Mentor: Vitor Bandeira, Austin Rovinski

Integrate a tcl and python linter for tools in OpenROAD and OpenROAD-flow-scripts to enforce error checking, style and best practices.

LLM assistant for OpenROAD - Create Model Architecture and Prototype

Topics: Large Language Model, Machine Learning, Model Architecture, Model Deployment
Skills: large language model engineering, prompt engineering, fine-tuning
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Indira Iyer, Jack Luar

This project involves the creation of a conversational assistant designed around OpenROAD to answer user queries. You will be working in tandem with members of the OpenROAD team and other researchers to deliver a final deployable prototype. You will focus on the design and implementation of modular LLM architectures. You will be experimenting through different architectures and justifying which approach works the best on our domain-specific data. Open to proposals from all levels of ML practitioners.

LLM assistant for OpenROAD - Data Engineering and testing

Topics: Large Language Model, Machine Learning, Data Engineering, Model Deployment, Testing
Skills: large language model engineering, prompt engineering, fine-tuning
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Indira Iyer, Jack Luar

This project involves the creation of a conversational assistant designed around OpenROAD to answer user queries. You will be working in tandem with members of the OpenROAD team and other researchers to deliver a final deployable prototype. This project will focus on the data engineering portion of the project. This may include: training pipelines specifically tailored for fine-tuning LLM models, data annotation, preprocessing and augmentation. Open to proposals from all levels of ML practitioners.

Create Unit tests for OpenROAD tools

Topics: OpenROAD-flow-scripts, unit testing
Skills: Basic VLSI design and tools knowledge, python, tcl, C/C++, Github
Difficulty: Medium
Size: Medium ( 175 hours)
Mentor: Vitor Bandeira, Indira Iyer

You will build unit tests to test specific features of the OpenROAD tool which will become part of the regression test. Here is an example of a test for UPF support: https://github.com/The-OpenROAD-Project/OpenROAD/blob/master/test/upf/mpd_aes.upf. This is a great way to learn VLSI flow basics and the art of testing them for practical applications.

StatTag: Connecting statistical software to Microsoft Word

Mon, 22 Jan 2024 00:00:00 +0000

StatTag is a free, open-source software plug-in for conducting reproducible research. It facilitates the creation of dynamic documents using Microsoft Word documents and statistical software, such as Stata, SAS, R, and Python. Users can use StatTag to embed statistical output (estimates, tables and figures) into a Word document and then with one click individually or collectively update output with a call to the statistical program.

What makes StatTag different from other tools for creating dynamic documents is that it allows for statistical code to be edited directly from Microsoft Word. Using StatTag means that modifications to a dataset or analysis no longer require transcribing or re-copying results into a manuscript or table.

StatTag works by interpreting specially formatted comments (“tags”) within a code file. StatTag then reads the code file, executes the code through the corresponding language interpreter, formats the results, and inserts them into the Word document as a field.

There are versions of StatTag for both Microsoft Windows and macOS. Proposed projects here are specific to the Microsoft Windows version, which is developed in the C# programming language.

Additional Information:

Support Additional Programming Languages

Topics: reproducibility, statistics
Skills: C# and one of: MATLAB, Octave, SQL, Julia
Difficulty: Medium
Size: Medium or large (175 or 350 hours)
Mentor: Luke Rasmussen

Following the same structure used for other language support in StatTag, develop support for a new programming language (suggested languages are provided, but applicants can propose others). This will include:

Creating a Parser class to support StatTag-specific interpretation of results (e.g., identifying a line of code that is writing to a CSV file, then loading that CSV file)
Creating an Automation class that manages communication with the supported programming language’s interpreter. Python support uses a Jupyter kernel, and both SAS and Stata support invoke DLLs directly.
Integrating the language into the UI (e.g., allowing it to be a valid code file, adding the icon for the code file to the UI)
Additional setup/configuration as needed (e.g., SQL support would require secure configuration for connecting to the databse server).

Develop unit tests to demonstrate code is functioning. Create test scripts in the implemented language to exercise and demonstrate end-to-end execution.

Process Tags in Jupyter Notebooks

Topics: reproducibility, jupyter
Skills: C#, Jupyter Notebooks, Python
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Luke Rasmussen

StatTag uses

StatTag currently has support for Python, and utilizes the Jupyter kernel to interact with Python. However, we currently do not fully support processing StatTag ’tags’ in a Jupyter notebook.

Following the same structure used for RMarkdown integration in StatTag, develop support for Jupyter Notebooks in StatTag. StatTag should be able to:

Take as input one or more Jupyter Notebooks
Confirm that the Jupyter Notebook uses Python
Identify StatTag formatted tags within the notebook
Pass relevant code to the Python processor already implemented in StatTag

In addition, develop unit tests to demonstrate code is functioning as intended. Create test Jupyter Notebooks to exercise and demonstrate end-to-end execution.

AIIO / Graph Neural Network

Wed, 17 Jan 2024 10:15:56 -0700

[AIIO] (https://github.com/hpc-io/aiio) revolutionizes the way for users to automatically tune the I/O performance of applications on HPC systems. It currently works on linear regression models but has more opportunities to work on heterogeneous data, such as programming info. This requires extending the linear regression model to more complex models, such as heterogeneous graph neural networks. The proposed work will include developing the graph neural work-based model to predict the I/O performance and interpretation.

AIIO / Graph Neural Network

Topics: AIIO/Graph Neural Network`
Skills: Python, Github, Machine Learning
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Suren Byna

The Specific tasks of the project include:

Develop the data pre-processing pipeline to convert I/O logs into formats which are required by the Graph Neural Network
Build and test the Graph Neural Network to model the I/O performance for HPC applications.
Test and evaluate the accuracy of the Graph Neural Network with test cases from AIIO

FasTensor / Stream Processing

Wed, 17 Jan 2024 10:15:56 -0700

[FasTensor] (https://github.com/BinDong314/FasTensor) is a generic tensor processing engine with scalability from single nodes to thousands of nodes on HPC. FasTensor supports applications from traditional SQL query to complex DFT solver in scientific applications. It has a 1000X performance advantage over MapReduce and Spark in supporting generic data processing functions on tensor structure. In this project, we propose to expand FasTensor with streaming functionality to support online data processing. Specifically, participants of this project will develop a stream endpoint for retrieving live data output from applications, such as DAS. The stream endpoint performs the function to maintain the pointer of data, which could be a n-dimensional subset of a tensor.

FasTensor / Stream Processing

Topics: FasTensor/Streaming Processing
Skills: C++, github
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, John Wu

The Specific tasks of the project include:

Building a mock workflow based on our DAS application (https://github.com/BinDong314/DASSA) to test stream processing. The mock workflow comprises a data producer, which generates DAS data, and a data consumer, which processes the data.
Developing a Stream Endpoint (e.g., I/O driver) to iteratively read dynamically increasing data from a directory. The stream endpoint essentially includes open, read, and write functions, and a pointer to remember current file pointer.
Integrating the Stream Endpoint into the FasTensor library.
Evaluating the performance of the mock workflow with the new Stream Endpoint.
Documenting the execution mechanism.

SLICES/pos: Reproducible Experiment Workflows

Sat, 06 Jan 2024 00:00:00 +0000

SLICES-RI is a european research initiative aiming to create a digital research infrastructure providing an experimental platform for the upcoming decades. One of the main goals of this initiative is the creation of fully reproducible experiments. The SLICES research infrastructure will consist of different experiment sites focusing on different research domains such as AI experiments, Cloud and HPC-driven experiments, or investigations on wireless networks.

To achieve reproducibility, the research group on network architectures and services of the Technical University of Munich develops the SLICES plain orchestrating service (SLICES/pos). This framework supports a fully automated structured experiment workflow. The structure of this workflow acts as a template for the design of experiments. Users that adhere to this template will create inherently reproducible experiments, a feature we call reproducible-by-design.

The SLICES/pos framework currently exists in two versions: (1) A fully-managed pos deployment, that uses the SLICES/pos framework to manage the entire testbed and (2) a hosted SLICES/pos deployment. The hosted SLICES/pos deployment is a temporary deployment that runs inside existing testbeds such as Chameleon or CloudLab.

Additional Information:

plain orchestrating service

Support Additional Programming Languages

Topics: reproducibility, statistics
Skills: Python
Difficulty: Medium
Size: Large (350 hours)
Mentor: Sebastian Gallenmüller, Georg Carle, and Kate Keahey

Design a set of basic examples that demonstrate the usage of pos that can be executed on the SLICES/pos testbed in Munich and the Chameleon testbed. This set of basic examples acts as a demonstration of pos’ capabilities and as a tutorial for new users. Based on these introductory examples, a more complex experiment shall be designed and executed, demonstrating the portability of the experiments between testbeds. This experiment involves the entire experiment workflow consisting of the setup and configuration of the testbed infrastructure, the collection of measurement results, and finally, their evaluation and publication. Multiple results of this experiment shall be created on different testbeds and hardware configurations. The results of the experiments will differ depending on the different hardware platforms on which the experiment was executed. These results shall be evaluated and analyzed to find a common connection between the different result sets of the experiments.

Create introductory examples demonstrating the usage of pos
Design and create a portable complex network experiment based on SLICES/pos
Execute the experiment on different testbeds (Chameleon, SLICES/pos testbed)
Analysis of reproduced experiment
Automated analysis of experimental results
Deduction of a model describing the fundamental connections between different experiment executions

Static Python Perf: Measuring the Cost of Sound Gradual Types

Sat, 06 Jan 2024 00:00:00 +0000

Gradual typing is a solution to the longstanding tension between typed and untyped languages: let programmers write code in any flexible language (such as Python), equip the language with a suitable type system that can describe invariants in part of a program, and use run-time checks to ensure soundness.

For now, though, the cost of run-time checks can be enormous. Order-of-magnitude slowdowns are common. This high cost is a main reason why TypeScript is unsound by design — its types are not trustworthy in order to avoid run-time costs.

Recently, a team at Meta built a gradually-typed variant of Python called (drumroll) Static Python. They report an incredible 4% increase in CPU efficiency at Instagram thanks to the sound types in Static Python. This kind of speedup is unprecedented.

Other languages may want to follow the Static Python approach to gradual types, but there are big reasons to doubt the Instagram numbers:

the experiment code is closed source, and
the experiment itself is not easily reproducible (even for Instagram!).

Static Python needs a rigorous, reproducible performance evaluation to test whether it is indeed a fundamental advance for gradual typing.

Related Work:

Gradual Soundness: Lessons from Static Python https://programming-journal.org/2023/7/2/
Producing Wrong Data Without Doing Anything Obviously Wrong! https://users.cs.northwestern.edu/~robby/courses/322-2013-spring/mytkowicz-wrong-data.pdf
On the Cost of Type-Tag Soundness https://users.cs.utah.edu/~blg/resources/pdf/gm-pepm-2018.pdf

Design and Run an Experiment

Topics: performance, cluster computing, statistics
Skills: Python AST parsing, program generation, scripting, measuring performance
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Ben Greenman

Design an experiment that covers the space of gradually-typed Static Python programs in a fair way. Since every variable in a program can have up to 3 different types, there are easily 3^20 possibilities in small programs — far too many to measure exhaustively.

Run the experiment on an existing set of benchmarks using a cluster such as CloudLab. Manage the cluster machines across potentially dozens of reservations and combine the results into one comprehensive view of Static Python performance.

Derive Benchmarks from Python Applications

Topics: types, optimization, benchmark design
Skills: Python
Difficulty: Medium
Size: Small to Large
Mentor: Ben Greenman

Build or find realistic Python applications, equip them with rich types, and modify them to run a meaningful performance benchmark. Running a benchmark should produce timing information, and the timing should not be significantly influenced by random variables, I/O actions, or system events.

PolyPhy

Mon, 01 Jan 2024 00:00:00 +0000

PolyPhy is a GPU oriented agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used an early prototype called Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data. You can see an instructive overview of PolyPhy in our workshop and more details about our research here.

Under the hood, PolyPhy uses a richer 3D scalar field representation of the reconstructed network, instead of a typical discrete representation like a graph or a mesh. The ultimate purpose of PolyPhy is to become a toolkit for a range of specialists across different disciplines: astronomers, neuroscientists, data scientists and even artists and designers. PolyPhy aspires to be a tool for discovering connections between different disciplines by creating quantitatively comparable structural analytics.

PolyPhy Web Presence

Topics: Web Development UX Social Media
Skills: full stack web development, Javascript, good communicator
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Oskar Elek, Ezra Huscher

The online presentation of a software project is without a doubt one of the core ingredients of its success. This project aims to develop a sustainable web presentce for PolyPhy, catering to interested contributors, active collaborators, and users alike.

Specific tasks:

Closely work with the mentors on understanding the context of the project and its detailed requirements in preparation of the proposal.
Port the existing website into a more modern Javascript framework (such as Next.js) that provides a user-friendly CMS and admin interface.
Update the contents of the website with new information from the repository repository page as well as other sources as directed by the mentors.
Develop a simple functional system for posting updates about the project to selected social media and other communication platforms (LinkedIn, Twitter/X or Mastodon, mailing list) which will also be reflected on the website.
Optional: improve the UX of the website where needed.
Optional: implement website analytics (visitor stats etc).

Data Visualization and Analysis with PolyPhy/Polyglot

Topics: Data Science Data Visualization Point Clustering 3D Neural Embeddings
Skills: data science, Python, Javascript, statistics, familiarity with AI and latent embedding spaces a big plus
Difficulty: Challenging
Size: Large (350+ hours)
Mentors: Oskar Elek, Kiran Deol

The aim of this project is to explore a novel data-scientific usecase using PolyPhy and its associated web visualization interface PolyGlot. The contributor is expected to identify a dataset they are already well familiar with, and that fits the application scope of the PolyPhy/PolyGlot tooling: a complex point cloud arising from a 3D or a higher dimensional process which will benefit from latent pattern identification and a subsequent visual as well as quantitative analysis. The contributor needs to have the rights for using the dataset - either by owning the copyright or via the open-source nature of the data.

Specific tasks:

Closely work with the mentors on understanding the context of the project and its detailed requirements in preparation of the proposal.
Become acquainted with the tooling (PolyPhy, PolyGlot) prior to the start of the project period.
Document the nature of the target dataset and define the complete data pipeline with assistance of the mentors, including the specific analytic tasks and objectives.
Implement the data pipeline in PolyPhy and PolyGlot.
Document the process and resulting findings in a publicly available report.

OSRE Catalyst

Thu, 23 Mar 2023 00:00:00 +0000

Contributing to an open source project is a great way to build a technical portfolio, learn industry tools/practices, and have real-world impact – all while embedded in a collaborative community. The UC Santa Cruz Open Source Program Office (OSPO) wants to support more students on this path, especially those who have been minoritized in tech. We are partnering with an HBCU for a pilot summer program offering, with hopes to expand our reach in 2024.

Through a hybrid (in-person/remote) model, participating students will spend four weeks on the UCSC campus learning about open source, followed by four weeks remotely contributing to an open source project. Participants will be well-supported by our instructional team, as well as their small peer cohort, through community-building and mentorship spanning the full eight weeks.

Pilot Program Mentor & Developer

Topics: Education, Broadening Participation, Mentorship and Support, Community
Skills: communication, organization, GitHub/Markdown, basic web programming (HTML, CSS, JavaScript), open source contribution, version control/git workflow, mentorship, teaching
Difficulty: Novice to Intermediate
Size: Medium or Large (175 or 350 hours)
Mentors: Emily Lovell, James Davis

Given that this is a program pilot, your involvement and feedback will directly help shape its future!

Possible tasks:

Help cultivate a welcoming and supportive learning community
Support students in completing hands-on activities related to open source contribution (e.g. evaluating potential projects/communities, using git, setting up a development environment)
Develop technology-specific tutorials to introduce students to languages/libraries/etc. employed by their project
Offer mentorship around how to navigate documentation, large codebases, and contributor communities
Share your own input and perspective on what it’s like to be a newcomer to open source!

eBPF Monitoring Tools

Tue, 21 Feb 2023 00:00:00 +0000

eBPF is a technology that allows sandboxed programs to run in a priviledged context such as a Linux kernel. eBPF is for operating systems what Javascript is for web browsers: new functionality can be safely loaded without restarting or continually upgrading the operating system or browser and executed efficiently. eBPF is used to introduce new functionality into a running Linux kernel, including next-generation networking, observability, and security functionality. The following is just one idea of many possible.

Implement Darshan functionality as eBPF tool

Topics: performance, I/O, workload characterization
Difficulty: Medium
Size: Medium or large (175 or 350 hours)
Mentors: Tyler Reddy

Darshan is an HPC I/O characterization tool that collect statistics using a lightweight design that makes it suitable for full time deployment. Darshan is an interposer library that catches and counts IO requests (open, write, read, etc.) to a file/file system and it keeps the counters in buckets in data structure that can be queried. How many reads of small size, medium size, large size) for example are the types of things that are counted.

Having this be an interposer library requires users to link their application with this library. Having this function in epbf would make this same function transparent to users. Darshan has all the functions and could provide the list of functions to implement and the programmer could build and test these functions in ebpf on a linux machine. This could be a broadly available open tool that would be generally useful and but one of perhaps hundreds of examples of where ebpf based tools that could be in the open community for all to leverage.

Reproducible Evaluation of Multipath Network Protocols

Thu, 16 Feb 2023 00:00:00 +0000

Lead Mentor: Ilknur Aydin

As mobile devices with dual WiFi and cellular interfaces become widespread, network protocols have been developed that utilize the availability of multiple paths. However, the relative effectiveness of these protocols is highly dependent on the characteristics of the network (including the relationship between the two paths, which are often not independent). Researchers typically evaluate a multipath protocol for a small set of network scenarios, which vary from one publication to the next. It is therefore difficult to get a good picture of how different protocols perform in a range of settings.

Framework for repeatable, direct comparison of multipath transport protocols

Topics: Computer networks, wireless systems
Skills: Linux, networking, data analysis and visualization, writing
Difficulty: Large
Size: 350 hours
Mentor(s): Ilknur Aydin and Fraida Fund

In single-path congestion control, the Pantheon work created a reference set of executable benchmarks that researchers could use to evaluate novel congestion control designs against existing work in a wide range of the scenarios. This project seeks to achieve something similar for multipath protocols, using publicly available networking testbeds like FABRIC. For this project, the participant will:

Prepare a set of network benchmarks for multipath protocols, using live network links, real link traces, and emulated scenarios
Develop an experiment using the benchmarks to evaluate existing multipath protocol implementations
Prepare materials that researchers can use to evaluate novel multipath protocols against the others in the benchmark

Proactive Data Containers (PDC)

Sun, 12 Feb 2023 00:00:00 +0000

Proactive Data Containers (PDC) are containers within a locus of storage (memory, NVRAM, disk, etc.) that store science data in an object-oriented manner. Managing data as objects enables powerful optimization opportunities for data movement and transformations, and storage mechanisms that take advantage of the deep storage hierarchy and enable automated performance tuning.

Command line and python interface to an object-centric data management system

Topics: Python, object-centric data management, PDC
Skills: Linux, C, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Houjun Tang, Suren Byna

Proactive Data Containers (PDC) is an object-centric data management system for scientific data on high performance computing systems. It manages objects and their associated metadata within a locus of storage (memory, NVRAM, disk, etc.). Managing data as objects enables powerful optimization opportunities for data movement and transformations, and storage mechanisms that take advantage of the deep storage hierarchy and enable automated performance tuning. This project includes developing and updating efficient and user friendly command line and Python interfaces for PDC.

Is Reproducibility Enough? Understanding the Impact of Missing Settings in Artifact Evaluation

Wed, 08 Feb 2023 00:00:00 +0000

While Artifact Evaluation tries to ensure that the evaluation results in a paper are reproducible, it leaves one question: How about experiment settings NOT reported by the paper? Such “missing settings” may create multiple problems: 1) sometimes the artifacts simply do not work under these missing settings, creating problems when a later work needs to compare to an earlier work under these settings; 2) sometimes the artifacts do not perform well under these missing settings, which may create a bias during the evaluation; 3) to improve the artifact to work under these missing settings, sometimes one needs to re-design the system, which may change the results of the original experiments.

In this project, we plan to understand the impact of this problem: On the necessity side, how would these missing settings affect the conclusions of the original work? On the feasibility side, how much effort does it require to carry out extensive experiments? We plan to answer these questions by reproducing prior works, running them on popular settings that are not reported by these works, and fixing problems if any.

Measuring Research Prototypes under Unreported Settings

Topics: reproducibility, databases, key-value stores, DNN training
Skills: Java/Python, Linux, TPC/YCSB
Difficulty: Medium
Size: 350 hours
Mentor(s): Yang Wang, Miao YU
Contributor(s): Xueyuan Ren

The student will first pick one or a few systems she is interested in. Then she will first try to reproduce their reported results. If successful, she will further try to measure these systems under previously unreported settings. During the procedure, she will need to diagnose and fix any problems that may show up. Finally, she will analyze whether the original conclusions still hold under these new settings and whether fixing any problems will change the performance characteristics of the target systems.

OpenRAM

Wed, 08 Feb 2023 00:00:00 +0000

OpenRAM is an award winning open-source Python framework to create the layout, netlists, timing and power models, placement and routing models, and other views necessary to use SRAMs in ASIC design. OpenRAM supports integration in both commercial and open-source flows with both predictive and fabricable technologies. Most recently, it has created memories that are included on all of the eFabless/Google/Skywater MPW tape-outs.

Layout verses Schematic (LVS) visualization

Topics: VLSI Design Basics, Python
Skills: Python, VLSI, JSON
Difficulty: Easy/Medium
Size: Medium or Large (175 or 350 hours)
Mentors: Jesse Cirimelli-Low, Matthew Guthaus
Contributor(s): Mahnoor Ismail

Create a visualization interface to debug layout verses schematic mismatches in Magic layout editor. Results will be parsed from a JSON output of Netgen.

noWorkflow

Tue, 07 Feb 2023 00:00:00 +0000

The noWorkflow project aims at allowing scientists to benefit from provenance data analysis even when they don’t use a workflow system. Also, the goal is to allow them to avoid using naming conventions to store files originated in previous executions. Currently, when this is not done, the result and intermediate files are overwritten by every new execution of the pipeline.

noWorkflow was developed in Python, and it is currently able to capture provenance of Python scripts using Software Engineering techniques such as abstract syntax tree (AST) analysis, reflection, and profiling, to collect provenance without the need of a version control system or any other environment.

At the moment of this writing, the main version of noWorkflow is in the 2.0-alpha branch. We intend to release it before the summer.

Verify the reproducibility of an experiment

Topics: Reproducibility
Skills: Python, SQL or SQLAlchemy ORM
Difficulty: Moderate
Size: Medium or large (175 or 350 hours)
Mentors: João Felipe Pimentel, Juliana Freire

Implement an algorithm to compare the provenance from two (or more) trials (i.e., executions of an experiment) to check their reproducibility. The provenance stored in the relational (sqlite) database by noWorkflow 2 contains intermediate variable values from a trial. These values could be compared to check how much or where executions deviate from each other.

Specific tasks:

Compare trials of the same script (Medium)
Estimate how much on trial deviate from another (Medium)
Consider different scripts and execution flows (Large)
Indicate which parts of the scripts are not reproducible (Large)

Control levels of provenance collection

Topics: Log experiments
Skills: Python
Difficulty: Challenging
Size: Large (350 hours)
Mentors: João Felipe Pimentel, Juliana Freire
Contributor(s): Jesse Lima

Add support for different levels of provenance collection in noWorkflow 2. Currently, noWorkflow 2 collects Python construct evaluations and all the dependencies among the evaluations. However, this collection is inefficient, since some of the collected provenance may not be necessary for end-users. In this project, it is desirable to provide ways to temporarily disable the provenance collection and to manually indicate the provenance in this situation.

Specific tasks:

Disable the collection inside specific functions (through decorators?)
Disable the collection inside specific regions of the code (through with statements?)
Collect only function activations in a region, instead of all variable dependencies
Disable the collection of specific modules
Design a DSL to express general dependencies for parts of the code where the collection is disabled

Upgrade noWorkflow collection to support new Python constructs

Topics: Log experiments
Skills: Python
Difficulty: Moderate
Size: Large (350 hours)
Mentors: João Felipe Pimentel, Juliana Freire

Implement new AST transformations for provenance collection. While noWorkflow 2 works for newer Python versions, most of its implementation was targeted at Python 3.7. Newer Python versions have new constructs in which the provenance is ignored.

Specific tasks:

Identify which AST constructs implementations are missing
Design AST transformations to execute functions before and after the evaluation of the constructs
Create the dependencies for the new constructs

ScaleBugs: Reproducible Scalability Bugs

Tue, 07 Feb 2023 00:00:00 +0000

Scalable systems lay essential foundations of the modern information industry. HPC data centers tend to have hundreds to thousands of nodes in their clusters. The use of “extreme-scale” distributed systems has given birth to a new type of bug: scalability bugs. As its name suggests, scalability bugs may be presented depending on the scale of a run, and thus, symptoms may only be observable in large-scale deployments, but not in small or median deployments. For example, Cassandra-6127 is a scalability bug detected in the popular distributed database Cassandra. The scalability bug causes unnecessary CPU usage, however, the symptom is not observed unless ~1000 nodes are deployed. This demonstrates the main challenge of studying scalability bugs: it is extremely challenging to reproduce without deploying the system at a large scale.

In this project, our goal is to build a dataset of reproducible scalability bugs. To achieve this, we will go through the existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. For each bug report, we determine if the reported bug depends on the scale of the run, such as the number of nodes utilized. With the collected scale-dependent bugs, we then will craft the workload to reproduce those scalability bugs. Our workloads will be designed to trigger some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance. For example, a successful reproduction should be able to show the performance drop along with an increasing number of nodes.

Building a Dataset of Reproducible Scalability Bugs

Topics: Scalability systems, bug patterns, reproducibility, bug dataset
Skills: Linux Shell, Docker, Java, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Cindy Rubio González, Haryadi S. Gunawi, Hao-Nan Zhu
Contributor(s): Goodness Ayinmode, Zahra Nabila Maharani

The student will build a dataset of reproducible scalability bugs. Each bug artifact in the dataset will contain (1) the buggy and fixed versions of the scalability system, (2) a runtime environment that ensures reproducibility, and (3) a workload shell script that could demonstrate the symptoms of the bug under different scales.

Specific Tasks

Work with the mentors to understand the context of the project.
Learn the background of scalability systems.
Inspect the bug reports from Apache JIRA and identify scale-dependent bugs.
Craft shell scripts to trigger the exact scalability bug described by the bug report.
Organize the reproducible scalability bugs and write documentation to build the code and trigger the bug.

Strengthening Underserved Segments of the Open Source Pipeline

Tue, 07 Feb 2023 00:00:00 +0000

Contributing to an open source project offers novices the opportunity to join a community of practitioners, build a technical portfolio, gain experience with industry tools and technologies, and have real-world impact. This project seeks to invite and support broader, more diverse participation in open source by supporting early contributors – especially those who have been historically minoritized within tech.

This work builds upon a number of existing projects with similar or overlapping goals. Some examples:

The Teaching Open Source (TOS) community, which brings together instructors teaching open source
The Professors’ Open Source Software Experience (POSSE) workshops and wiki, for faculty teaching - or wanting to teach - open source
Internships such as Google Summer of Code (GSoC), Outreachy, and the MLH Fellowship
Open Source Comes to Campus, offering student workshops on tools and culture [no longer active]
Google Code-in, inviting pre-university students to make open source contributions [no longer active]

This project will investigate gaps in currently available resources/programs and seek to address them, beginning with the exploration of engaging high school students with open source. Depending on early findings, this project could also entail the development of resources for independent learners and/or mentors.

Learning Resource Development + Repository-Building

Topics: Education, Broadening Participation, Mentorship and Support, Community Development
Skills: independent research, communication, organization, GitHub/Markdown, basic web programming (HTML, CSS, JavaScript)
Difficulty: Novice to Intermediate
Size: Medium or Large (175 or 350 hours)
Mentors: Emily Lovell, James Davis
Contributor(s): Nandini Saagar

As an early contributor to this project, you will help gather information to inform the project direction – and then help bring it to life!

Possible tasks:

Meet with teachers and/or community members to identify new opportunities to engage with students (e.g. outside-of-school workshops, classroom visits, materials for teachers to use independently)
Evaluate and test existing learning activities with a high school audience in mind (e.g. consider necessary pre-requisites, time required, ideal activity format)
Evaluate and organize existing resources for newcomers (e.g. Up For Grabs, Hacktoberfest, internship/fellowship opportunites)
Help design and pilot new learning activities and/or workshops
Assist in curating an open source repository of the aforementioned resources
Conduct outreach to our target communities (e.g. brainstorm a catchy repository name, compose inviting and inclusive emails, design visual project elements)
Share your own input and perspective on what it’s like to be a newcomer to open source!

LabOP - an open specification for laboratory protocols, that solves common interchange problems stemming from variations in scale, labware, instruments, and automation.

Mon, 06 Feb 2023 00:00:00 +0000

Project idea 1: Software, hardware, and wetware building LabOP with simultaneous language & protocol development & test executions

Topics: Software standard development, Laboratory automation, Biology
Skills: Python, Semantic Web Technologies (RDF, OWL), interest to think about describing biological & chemical laboratory processes
Difficulty: Moderate
Size: Large (350 hours)
Mentors:
1. Tim Fallon
2. Dan Bryce

About: The Laboratory Open Protocol Language (LabOP)

See link: https://bioprotocols.github.io/labop/

LabOP is an open specification for laboratory protocols, that solves common interchange problems stemming from variations in scale, labware, instruments, and automation. LabOP was built from the ground-up to support protocol interchange. It provides an extensible library of protocol primitives that capture the control and data flow needed for simple calibration and culturing protocols to industrial control.

Software Ecosystem

LabOP’s rich representation underpins an ecosystem of several powerful software tools, including:

labop: the Python LabOP library, which supports:
- Programming LabOP protocols in Python,
- Serialization of LabOP protocols conforming to the LabOP RDF specification,
- Execution in the native LabOP semantics (rooted in the UML activity model),
- Specialization of protocols to 3rd-party protocol formats (including Autoprotocol, OpenTrons, and human readible formats), and
- Integration with instruments (including OpenTrons OT2, Echo, and SiLA-based automation).
laboped: the web-based LabOP Editor, which supports:
- Programming LabOP protocols quickly with low-code visual scripts,
- Storing protocols on the cloud,
- Exporting protocol specializations for use in other execution frameworks,

About the Bioprotocols Working Group

The Bioprotocols Working Group is an open community organization developing a free and open standard for representation of biological protocols.

To join the Bioprotocols Working Group:

Join the community mailing list at: https://groups.google.com/g/bioprotocols
Join the #collab-bioprotocols channel on the Bits in Bio Slack.

Leadership

Elected Term: August 24th, 2022 - August 23rd, 2023

Chair: Dan Bryce (SIFT)

Finance Committee:

Governance

Approved by community vote on August 16th, 2022

https://bioprotocols.github.io/labop/about#Governance

Mission:

The Bioprotocols Working Group is an open community organization developing free and open standards for representation of biological protocols. In support of that goal, the organization also develops tools and practices and works with other organizations to facilitate dissemination and adoption of these standards.

As an organization, the Bioprotocols Working Group holds the following values:

The standards developed by the community should be available under permissive free and open licenses.
Technical decisions of the community should be made following open and inclusive processes.
The community is strengthened by fostering a culture of diversity and inclusion, in which all constructive participants feel comfortable making their voices heard.

GPU Emulator for Easy Reproducibility of DNN Training

Sun, 05 Feb 2023 00:00:00 +0000

Deep Neural Networks (DNN) have achieved success in many machine learning (ML) tasks including image recognition, video classification and natural language processing. Nonetheless, training DNN models is highly computation intensive and usually requires running complex computations on GPUs, while GPU is a very expensive and scarce resource. Therefore, many research works on DNN training are delayed because of the lack of access to GPUs. However, many research prototypes don’t require GPUs but only the performance profiles of GPUs. For example, research on DNN training storage systems doesn’t need to run real computations on GPUs, but only needs to know how much time each GPU computation will take. Meanwhile, GPU performance in DNN training is predictable and reproducible, as every batch of training performs a deterministic sequence of mathematical operations on a fixed number of data.

Therefore, in this project we seek to build a GPU emulator platform on PyTorch to easily reproduce DNN training without using real GPUs. We will measure the performance profiles of GPU computations for different models, GPU types, and batch sizes. Based on the measured GPU performance profiles, we will build a platform to emulate the GPU behaviors and reproduce DNN training using CPUs only. We will make the platform and the measurements open-source, allowing other researchers to reproduce the performance measurements and easily conduct research on DNN training systems. We will also encourage the community to enrich the database by adding GPU performance measurements for their own models and GPU types. We will be the first one to build and release this kind of GPU emulator for DNN training, and we believe researchers and the community can benefit a lot from it, especially after more and more GPU performance profiles are added by the community.

Building a platform to emulate GPU performance in DNN training

Topics: DNN training, reproducibility, GPU emulator, performance measurement - Skills: Linux, Python, PyTorch, deep learning
Difficulty: Medium
Size: 350 hours
Mentor(s): Vijay Chidambaram, Yeonju Ro
Contributor(s): Haoran Wu

The student will measure the GPU performance profiles for different models and GPU types, based on which the student will build a platform to emulate the GPU behaviors and easily reproduce DNN training. The GPU performance measurements should be made open-source and reproducible for other researchers to reproduce results and add GPU profiles for their own needs.

Specific tasks:

Work with mentors on understanding the context of the project.
Study and get familiar with the PyTorch DNN training pipelines
Measure GPU performance profiles for different DNN models and GPU types
Based on the GPU performance measurements, build a platform to emulate the GPU behaviors and reproduce DNN training without using real GPUs
Organize and document the codes to make them reproducible for the community

Reproduce and benchmark self-adaptive edge applications under dynamic resource management

Thu, 02 Feb 2023 00:15:56 -0700

With the flourishing of the ideas like smart cities or smart manufacturing, a massive amount of edge devices (e.g., traffic or security cameras, thermometers, flood sensors, et al.) are deployed and connected to the network to collect/analyze data across the space and time and help the stakeholders like city governments or manufacturers optimizing their plans and operations. Such a large number of edge devices and large amount of communications among the devicesdd or to the central servers rise a big challenge on how to manage/schedule the resource (i.e., network bandwidth between the devices and/or computing power on both edge devices and bare metal servers) to ensure the running applications’ capability of providing a reliable service. Furthermore, with the nature of limited resources available to the edge devices, there is an uprising trend to reduce the average compute and/or bandwidth usage by leveraging the uneven distribution of interesting events with respect to both time and space in the input data. This brings further challenges for provisioning and managing the amount of resources available to the edge devices, as the running applications’ resource demands can greatly depend on the input data which is both dynamic and unpredictable.

With these challenges in mind, the team previously designed and implemented a dynamic resource manager that could understand the applications and make decisions based on such understanding at run time. This understanding is achieved based on a key insight - applications will have different magnitudes of performance improvement/degradation toward the change in the amount of resources available depending on the input data and how many resources the applications currently have, which we define as applications’ sensitivities. However, such a resource manager has only been tested with a limited number and types of video analytic applications. Hence, through the OSRE23 project, we aim to:

reproduce other state-of-art self-adaptive video analytic applications,
integrate the reproducible applications into the resource manager framework,
compare the performance with and without resource manager.

Reproduce/benchmark the self-adaptive video analytic applications’ performance under dynamic resource management

Topics: Benchmark, Reproducibility, Video analytics, Machine Learning, Resource Management
Skills: Python, PyTorch, TensorFlowd
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Junchen Jiang, Yuyang Huang
Contributor(s): Faishal Zharfan

Integrate various types of video analytic applications into the aforementioned dynamic resource manager and reproduce/benchmark the applications’ performance.

Specific tasks:

Reproduce state-of-art video analytic applications
Integrate such applications into the resource manager framework - Benchmark video analytic applications
Analysis the benchmarked performance results

FlashNet: Towards Reproducible Data Science for Storage System

Thu, 02 Feb 2023 00:00:00 +0000

The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers. The person must know both the storage side as well the ML side as if studying two different fields at the same time. This project aims to answer these questions:

How can we encourage data scientists to look into storage problems?
How can we create a transparent platform that allows such decoupling?
Within the storage/ML community can we create two collaborative communities, the storage engineers and the storage data scientists?

In the ML/Deep Learning community, the large ImageNet benchmarks have spurred research in image recognition. Similarly, we would like to provide benchmarks for fostering storage research in ML-based per-IO latency prediction. Therefore, we present FlashNet, a reproducible data science platform for storage systems. To start a big task, we use I/O latency prediction as a case study. Thus, FlashNet has been built for I/O latency prediction tasks. With FlashNet, data engineers can collect the IO traces of various devices. The data scientists then can train the ML models to predict the IO latency based on those traces. All traces, results, and codes will be shared in the FlashNet training ground platform which utilizes Chameleon trovi for better reproducibility.

In this project, we plan to improve the modularity of the FlashNet pipeline and develop the Chameleon trovi packages. We will also continue to improve the performance of our binary-class and multiclass classifiers and test them on the new production traces that we collected from SNIA IOTA public trace repository. Finally, we will optimize the deployment of our continual-learning mechanism and test it in a cloud system environment. To the best of our knowledge, we are building the world-first end-to-end data science platform for storage systems.

Building FlashNet Platform

Topics: Storage systems, reproducibility, machine learning, continual learning
Skills: C++, Python, PyTorch, Experienced with Machine Learning pipeline
Difficulty: Medium
Size: Large (350 hours)
Mentors: Haryadi S. Gunawi
Contributor(s): Justin Shin, Maharani Ayu Putri Irawan

Build an open-source platform to enable collaboration between storage and ML communities, specifically to provide a common platform for advancing data science research for storage systems. The platform will be able to reproduce and evaluate different ML models/architecture, dataset patterns, data preprocessing techniques, and various feature engineering strategies.

Specific tasks:

Work with mentors on understanding the context of the project.
Reproduce the FlashNet evaluation results from prior works.
Build and improve FlashNet components based on the existing blueprint.
Collect and analyze the FlashNet evaluation results.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Thu, 02 Feb 2023 00:00:00 +0000

A high-throughput workflow execution system is needed to continuously gain insights from th e increasingly abundant genomics data. However, genomics workflows often have long execution times (e.g., hours to days) due to their large input files. This characteristic presents many complexities when managing systems for genomics workflow execution. Furthermore, based on our observation of a large-scale genomics data processing platform, ~2% of genomics workflows exhibit a tail behavior which multiplied their execution time up to 15x of the median, resulting in weeks of execution.

On the other hand, input files for genomic workflows often vary in quality due to differences in how they are collected. Prior works suggested that these quality differences can affect genomics workflow execution time. Yet, to the best of our knowledge, input quality has never been accounted for in the design of a high-throughput workflow execution system. Even worse, there does not appear to be a consensus on what constitutes ‘input quality,’ at least from a computer systems perspective.

In this project, we seek to analyze a huge dataset from a large-scale genomics processing platform in order to gain insights on how ‘input quality’ affects genomic workflows’ execution times. Following that, we will build machine learning (ML) models for predicting workflow execution time, in particular those which exhibit tail behavior. We believe these insights and models can become the foundation for designing a novel tail-resilient genomics workflow execution system. Along the way, we will ensure that each step of our analysis is reproducible (e.g., in the form of Jupyter notebooks) and make all our ML models open-source (e.g., in the form of pre-trained models). We sincerely hope our work can offload some burdens commonly faced by operators of systems for genomics and, at the same time, benefit future researchers who work on the intersection of computer systems and genomics.

Analyze genomics data quality & build exec. time prediction models

Topics: genomics, data analysis, machine learning
Skills: Linux, Python, Matplotlib, Pandas/Numpy, any ML library
Difficulty: Medium
Size: 350 hours
Mentor(s): In Kee Kim
Contributor(s): Charis Christopher Hulu

Analyze a large-scale trace of genomics workflow execution along with metrics from various genomics alignment tools (e.g., FastQC, Picard, and GATK metrics) and find features that correlate the most with workflow execution time and its tail behavior. Then, based on the results, we will build ML models that accurately predict genomic workflows’ execution times.

Specific tasks:

Acquire basic understanding of genomics data processing & workflow execution (will be guided by the mentor)
Reproduce past analysis & models built by prior members of the project
Propose features from FastQC/Picard/GATK metrics that can be used as a predictor for execution time and tail behavior
Write a brief analysis as to why those features might work
Build ML models for predicting execution time
Package the analysis in the form of Jupyter notebooks
Package the models in a reloadable format (e.g., pickle)

Reproducible Evaluation of Multi-level Erasure Coding

Thu, 02 Feb 2023 00:00:00 +0000

Massive storage systems rely heavily on erasure coding (EC) to protect data from drive failures and provide data durability. Existing storage systems mostly adopt single-level erasure coding (SLEC) to protect data, either performing EC at the network level or performing EC at the local level. However, both SLEC approaches have limitations, as network-only SLEC introduces heavy network traffic overhead, and local-only SLEC cannot tolerate rack failures.

Accordingly, some data centers are starting to use multi-level erasure coding (MLEC), which is a hybrid approach performing EC at both the network level and the local level. However, prior EC research and evaluations mostly focused on SLEC, and it remains to be answered how MLEC is compared to SLEC in terms of durability, capacity overhead, encoding throughput, network traffic, and other overheads.

Therefore, in this project we seek to build a platform to evaluate the durability and overheads of MLEC. The platform will allow us to evaluate dozens of EC strategies in many dimensions including recovery strategies, chunk placement choices, various parity schemes, etc. To the best of our knowledge, there is no other evaluation platform like what we propose here. We seek to make the platform open-source and the evaluation reproducible, allowing future researchers to benefit from it and conduct more research on MLEC.

Building a platform to evaluate MLEC

Topics: Storage systems, reproducibility, erasure coding, evaluation
Skills: Linux, C, Python
Difficulty: Medium
Size: 350 hours
Mentor(s): John Bent and Anjus George
Contributor(s): Zhiyan "Alex" Wang

Build a platform to evaluate the durability and overheads of MLEC. The platform will be able to evaluate different EC strategies in various dimensions including repair strategies, chunk placement choices, parity schemes, etc. Analyze the evaluation results.

Specific tasks:

Work with mentors on understanding the context of the project.
Reproduce the SLEC evaluation results from prior SLEC evaluation tools
Based on prior SLEC evaluation tools, build a platform to evaluate the durability and overheads of MLEC under various EC strategies
Collect and analyze the MLEC evaluation results

Automatic Cluster Performance Shifts Detection Toolkit

Wed, 01 Feb 2023 10:15:56 -0700

High-performance computing (HPC) clusters typically suffer from performance degradation over time. The heterogeneous nature of clusters and the inevitable defects in various infrastructure layers will result in a harder performance prediction inside. On the other hand, when software upgrades or any such events happen, we might also observe performance improvement or degradation even though nothing changes in the hardware. Due to these uncertainties, it is necessary to send early notification to administrators of changes in cluster performance in a specific time window to inform scheduling decisions and increase cluster utilization.

We are targeting HPC clusters that cater to heterogeneous, compute, and I/O intensive workloads that range from scientific simulation to AI model training that have high degree of parallelization in their workloads. In this scenario, we plan to use the Darshan open-source toolkit (https://github.com/darshan-hpc/darshan) as data collection or profiling tools to design our performance drift algorithms. Furthermore, we will possibly incorporate the distribution shift detection into Darshan, making it viable as a notification to the HPC system administrators.

Our goal is to show the efficacy of our algorithm by plotting the profiling data that display specific time windows where the performance shifts happened after being processed by our algorithm. Finally, we will package all our profiling data and experiment scripts inside Jupyter notebook, especially Chameleon Trovi, to help others reproduce our experiments.

Through this research, we seek to contribute the following:

Designing an algorithm to detect performance shifts in HPC clusters that can be adapted for heterogeneous workloads
Real-time detection of the performance shifts without introducing great overheads into the system
Contribute to Darshan to be able to automatically detect performance changes while profiling the clusters.

Automatic and Adaptive Performance Shifts Detection

Topics: Statistical Machine Learning, Deep Learning, and High-Performance Computing (HPC)
Skills: C++, Python, Statistics, good to have: Machine Learning, Deep learning
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Sandeep Madireddy (https://www.anl.gov/profile/sandeep-r-madireddy, http://www.mcs.anl.gov/~smadireddy/ ), Ray Andrew Sinurat (https://rayandrew.me)
Contributor(s): Kangrui Wang

All in all, these are the specific tasks that the student should do:

Collaborate and work with mentors to understand the goal of this project.
Implement distribution shift detection in pure statistical or machine/deep learning
Deploy the algorithm and try to see its efficacy in the clusters.
Package this experiment to make it easier for others to reproduce

OpenROAD - An Open-Source, Autonomous RTL-GDSII Flow for VLSI Designs (2023)

Wed, 01 Feb 2023 00:00:00 +0000

The OpenROAD project is a non-profit, DARPA-funded and Google sponsored project committed to creating low-cost and innovative Electronic Design Automation (EDA) tools and flows for IC design. Our mission is to democratize IC design, break down barriers of cost and access and mitigate schedule risk through native and open source innovation and collaboration with ecosystem partners. OpenROAD provides an autonomous, no-human-in-the-loop, 24-hour, RTL-GDSII flow for fast ASIC design exploration, QoR estimation and physical implementation for a range of technologies above 12 nm. We welcome a diverse community of designers, researchers, enthusiasts, software engineers and entrepreneurs to use and contribute to OpenROAD and make a far-reaching impact. OpenROAD has been used in > 600 tapeouts across a range of ASIC applications with a rapidly growing and diverse user community.

Enhance OpenROAD GUI Flow Manager

Topics: GUI, Visualization, User Interfaces
Skills: C++, Qt
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty, Ethan Mahintorabi

Develop custom features for analysis and visualizations in the [OpenROAD GUI] (https://openroad.readthedocs.io/en/latest/main/src/gui/README.html) to support native and third party flows. These include OpenROAD-flow-scripts, OpenLane and other third-party flows . Create documentation: commands, developer guide notes, tutorials to show GUI usage for supported flows.

Profile and tune OpenROAD flow for Runtime improvements

Topics: OpenROAD-flow-scripts, Flow Manager, Runtime Optimization
Skills: Knowledge about Computational resource optimization, Cloud-based computation, Basic VLSI design and tools knowledge
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty, Ethan Mahintorabi

Test, analyze and develop verifiable and re-producible strategies to improve run times in OpenROAD-flow-scripts. These include optimizations of computational resources over the cloud, tuning of algorithmic and design flow parameters. Create test plans using existing or new designs to show runtime improvements.

Update OpenROAD Documentation and Tutorials

Topics: Documentation, Tutorials, VLSI design basics
Skills: Knowledge of EDA tools, basics of VLSI design flow, tcl, shell scripts, Documentation, Markdown
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Indira Iyer, Vitor Bandeira
Contributor(s): Jack Luar

Review and update missing documentation and tutorials in OpenROAD-flow-scripts for existing and new features. Here is an example Tutorial link: https://openroad-flow-scripts.readthedocs.io/en/latest/tutorials/FlowTutorial.html for reference.

LEF and Liberty Model Testing

Topics: Testing, LEF, ‘LIB’, VLSI design basics
Skills: Knowledge of EDA tools, basics of VLSI design, lef and lib model abstracts, tcl, shell scripts, Verilog, Layout
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty

Test the accuracy of generated LIB and LEF models for signoff in OpenROAD-flow-scripts for flat and hierarchical design flows. Build test cases to validate and add to the regression suite.

Teaching Computer Networks with Reproducible Research

Wed, 18 Jan 2023 00:00:00 +0000

Lead Mentor: Fraida Fund

In the field of computer networks and wireless communication systems, the availability of open access networking and cloud computing testbeds (GENI, CloudLab, Chameleon, FABRIC, and others) has been transformative in promoting reproducible research and in making high-quality experiential learning available to students and educators at a wide range of colleges and universities. This project seeks to unite research and education use of these testbeds by developing new ways of using reproducible research to teach computer networks and related topics.

Bringing foundational results into the classroom

Topics: Computer networks, reproducibility, education
Skills: Linux, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD

To make foundational results from computer networks more concrete, this project seeks to reproduce a selection of key results and package them for use as interactive classroom demonstrations. (An example of a “foundational” result might be the result from the 1980s that motivates congestion control by showing how congestion collapse occurs when the network is under heavy load.) This involves:

Reproducing the original results on an open-access testbed
Packaging the materials for use as a classroom demo, with interactive elements
Creating assessment questions and sample “solutions” related to the materials, that instructors may use in homework assignments or exams

Developing a “classroom competition” for adaptive video delivery policies

Topics: Computer networks, adaptive video, reproducibility, education
Skills: Linux, Python, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contributor(s): Srishti Jaiswal

A carefully designed competition can be a fun and exciting way for students to challenge themselves and gain “ownership” of a new topic. This projects builds on an existing open source reproducible result for adaptive video delivery, and will challenge students to extend this work and design their own adaptive video policies for head-to-head competition against their classmates. This includes:

Packaging the result to make it easier for students to reproduce and then build on the original work
Implementing other adaptive video policies from the literature, so that students can use them as a baseline
Developing different network settings (using live link traces and emulated link patterns) in which student submissions may be evaluated
Developing an evaluation framework for scoring student submissions on different criteria and in different network settings, and making the results available in a leaderboard format

Using Reproducibility in Machine Learning Education

Wed, 18 Jan 2023 00:00:00 +0000

Lead Mentor: Fraida Fund

The computer science and engineering classroom is as essential part of the reproducibility “ecosystem” - because of broad reach and potential for big impact, and because for many students, the classroom is their first exposure to research in their field. For machine learning in particular, reproducibility is an important element of the research culture, and can be a valuable part of any introductory or advanced courses in the field. These projects will develop highly interactive open educational resources, that may be adopted by instructors of graduate or undergraduate machine learning courses to incorporate more instruction about reproducibility and reproducible research.

Introducing “levels” of reproduction and replication in ML

Topics: Machine learning, reproducibility, education
Skills: Python, machine learning, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contributor(s): Mohamed Saeed

In machine learning, replicating a published result to confirm the validity of the experimental results and the broader conclusions of the paper can take several forms, with increasing levels of effort:

using authors’ code and pre-trained weights, run the model on the same benchmarks as the original paper
training a model using authors’ code and published hyperparameters,
training a model using authors’ code and a new hyperparamter search,
validating the authors’ code e.g. with unit tests, in addition to training,
re-implementing the model,
designing additional experiments to validate that the suggested mechanism is in fact responsible for the result,
and more.

This project will develop interactive materials (using one or more exemplar published results) to illustrate and to highlight relevant aspects and pitfalls of each of these “levels” of reproduction and replication.

Packaging existing reproducible results for the ML classroom

Topics: Machine learning, reproducibility, education
Skills: Python, machine learning, writing
Difficulty: Medium
Size: 350 hours
Mentor(s): Fraida Fund and TBD
Contribuor(s): Shekhar, Jonathan Edwin

The goal is to make it easier for instructors to expose students to state-of-the-art research in the classroom. This project will work with an existing set of recent reproducible results in machine learning, and will package them for easier consumption by students and more effective use in the classroom. This may include, but is not necessarily limited to:

Re-validating the result and re-packaging along with computational environment on an open access testbed
Creating tutorial material around the result, including interactive visualizations to demonstrate key elements of the work
Creating one-click demos for applying the model/technique to a new test sample
Curating test samples to highlight important advantages and limitations of the result
Creating assessment questions and sample “solutions” that instructors may use to “assign” the work to students

Public Artifact Data and Visualization

Mon, 09 Jan 2023 10:15:56 -0700

Reproducibility and Artifact Evaluation efforts have focused on reproducing the results, but not necessarily on storing, visualizing and making the results accessible. This set of projects builds the initial building blocks to log, capture, and visualize experiments.

Experiment Log

Topics: Provide tools to log experiments
Difficulty: Simple
Size: Medium or large (175 or 350 hours)
Mentors: Anjo Vahldiek-Oberwagner

Develop a client and server side tool to start/stop an experiment, timestamp the experiment. Document each iteration of the experiment and create a database to visualize the log of experiments.

Capture HW/SW state & continuous monitoring

Topics: Record initial state
Difficulty: Medium
Size: Medium or large (175 or 350 hours)
Mentors: Anjo Vahldiek-Oberwagner

Provide simple tools to gather the initial state of each experimental machine and its connected devices, configurations, software versions, … Upload into the experiment log database and visualize the recorded data. Ideally, provide diff function between experimental runs.

In a second step, monitor the machine’s state during the execution. This includes, network, memory, CPU, general OS statistics.

Record and visualize experimental results

Topics: Record results in various formats and visualize them
Difficulty: Hard
Size: Medium or large (175 or 350 hours)
Mentors: Anjo Vahldiek-Oberwagner
Contributor(s): Jiayuan Zhu, Krishna Madhwani

Description: Experiments generate results in various formats (e.g., CSV, json, text files, …). The goal of this project is to provide tools to extract common formats, connect the results to the experiment log and visualize them. Ideally, allowing to compare different experimental runs. Initially, the project could dump their results into a Prometheus instance (https://prometheus.io/) which would later become available for everyone to explore the data.

Polyphorm / PolyPhy

Thu, 15 Dec 2022 00:00:00 +0000

PolyPhy infrastructure engineering and practices

Topics: DevOps Code Refactoring CI/CD
Skills: fluidity in Python, experience with OOP, experience with building and packaging libraries, understanding GitHub and its tools ecosystem
Difficulty: Challenging
Size: 350+ hours
Mentors: Oskar Elek, Anisha Goel
Contributor(s): Prashant Jha

Your responsibility in this project will be developing new infrastructure of the PolyPhy project as well as maintaining the existing codebases. This is a multifaceted role that will require coordination with the team and active approach to understanding the technical needs of the community.

Specific tasks:

Work with the technical lead to develop effective interfaces for PolyPhy, providing access to its functionality on the level of both Python/Jupyter code and the command line.
Maintain the existing codebase and configure it according to the team’s needs.
Develop and extend the current CI/CD functionality and related code metrics.
Document the best practices related to the above.

Write PolyPhy’s technical story and content

Topics: Writing Documentation Storytelling
Skills: experienced writing structured text, well read, technical or scientific education, webdev basics (preferably NodeJS)
Difficulty: Moderate
Size: 350 hours
Mentors: Oskar Elek, Ezra Huscher

Integral to PolyPhy’s presentation is a “story” - a narrative understanding - that the users and the project contributors can relate to. Your responsibility will be to develop the written part of that understanding, as well as major portions of technical documentation that match it.

Specific tasks:

Work with mentors on understanding the context of the project.
Write and edit diverse pages of the project website.
Work with mentors to improve project’s written community practices (diversity, communication).
Write and edit narrative and explanatory parts of PolyPhy’s documentation.
Create tutorials that present core functionality of the toolkit.

Community engagement and management

Topics: Community Management Social Media Networking
Skills: documented experience with current social media landscape, social and well spoken, ability to communicate technical concepts
Difficulty: Moderate
Size: 175 or 350 hours
Mentors: Oskar Elek, Ezra Huscher

Your responsibility will be to build and engage the community around PolyPhy. This includes its standing team and stakeholders, current expert users, potential adopters as well as the general public. The scope (size) of the project depends on the level of commitment during and beyond the Summer and is negotiable upfront.

Specific tasks:

Manage the team’s communication channels (Slack, Zoom, email) and maintain active presence therein.
Develop social media presence for PolyPhy on Twitter, LinkedIn and other selected social media platforms.
Manage and extend the online presence for the project, including its website, mailing list, and other applicable outreach activities.
Research and engage with new communities that would benefit from PolyPhy, both as its expert users and contributors.

Adaptive Load Balancers for Low-latency Multi-hop Networks

Mon, 07 Nov 2022 10:15:56 -0700

This project aims at designing efficient, adaptive link level load balancers for networks that handle different kinds of traffic, in particular networks where flows are heterogeneous in terms of their round trip times. Geo distributed data centers are one such example. With the large-scale deployments of 5G in the near future, there will be even more applications, including more bulk transfers of videos and photos, augmented reality applications and virtual reality applications which take advantage of 5G’s low latency service. With the development and discussion about Web3.0 and Metaverse, the network workloads across data centers are only going to get more varied and challenging. All these add to heavy, bulk of data being sent to the data centers and over the backbone network. These traffic have varying quality of service requirements, like low latency, high throughput and high definition video streaming. Wide area network (WAN) flows are typically data heavy tasks that consist of backup data taken for a particular data center. The interaction of the data center and WAN traffic creates a very interesting scenario with its own challenges to be addressed. WAN and data center traffic are characterized by differences in the link utilizations and round trip times. Based on readings and literature review, there seems to be very little work on load balancers that address the interaction of data center and WAN traffic. This in turn motivates the need for designing load balancers that take into account both WAN and data center traffic in order to create high performance for more realistic scenarios. This work proposes a load balancer that is adaptive to the kind of traffic it encounters by learning from the network conditions and then predicting the optimal route for a given flow.

Through this research we seek to contribute the following :

Designing a load balancer, that is adaptive to datacenter and WAN traffic, and in general can be adapted to varied traffic conditions
Real time learning of the network setup and predicting optimal paths
Low latency, high throughput and increased network utilization deliverables

Adaptive, Dynamic Load Balancing for data center and WAN traffic

Topics: ‘data center networking’, TCP/IP stack’, ‘congestion control’, ’load balancing’
Skills: C++, python, linux ; experience with network simulators would be helpful
Difficulty: moderate/ challenging
Size: Medium or Large (175 or 350 hours)
Mentors: Katia Obraczka,Abdul Kabbani, Lakshmi Krishnaswamy

Specific tasks:

Understanding the OMNeT++ network simulator and creating simple networks and data center topologies to understand the simulation environment.
Implementing existing load balancers on OMNeT++ and exploring the effect of different features of the load balancers with data center traffic and WAN traffic.
Finding and testing out WAN specific traffic that may exist, like video streaming traffic, large database queries etc.
Working with the mentors on developing a learning-based load balancer framework that learns from past sample traffic, network conditions, to adapt dynamically to current network conditions.

Apache AsterixDB

Mon, 07 Nov 2022 10:15:56 -0700

AsterixDB is an open source parallel big-data management system. AsterixDB is a well-established Apache project that has beedddn active in research for more than 10 years. It provides a flexible data model that supports modern NoSQL applications with a powerful query processor that can scale to billions of records and terabytes of data. Users can interact with AsterixDB through a power and easy to use declarative query language, SQL++, which provides a rich set of data types including timestamps, time intervals, text, and geospatial, in addition to traditional numerical and Boolean data types.

Geospatial Data Science on AsterixDB

Topics: Data science, SQL++, documentation
Skills: SQL, Writing, Spreadsheets
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentors: Ahmed Eldawy, Akil Sevim

Build a data science project using AsterixDB that analyzes geospatial data among other dimensions. Use Chicago Crimes as the main dataset and combine with other datasets including points of interests ZIP Code boundaries. During this project, we will answer interesting questions about the data and visualize the results such as:

What is the most common crime type in a specific date or over the weekends?
Where do most of the arrests happen?
How are the crime rates change over time for different regions?

The goals of this project are:

Understand how to build a scalable data science project using AsterixDB.
Translate common questions to SQL queries and run them on large data.
Learn how to visualize the results of queries and present them.
Write detailed documentation about the process of building a data science application in AsterixDB.
Improve the documentation of AsterixDB while working in the project to improve the experience for future users.

Machine Learning Integration

As a bonus task, and depending on the progress of the project, we can explore the integration of machine learning with AsterixDB through Python UDFs. We will utilize the AsterixDB Python integration through user-defined functions to connect AsterixDB backend with scikit-learn to build some unsupervised and supervised models for the data. For example, we can cluster the crimes based on their location and other attributes to find interesting patterns or hotspots.

CephFS

Mon, 07 Nov 2022 10:15:56 -0700

CephFS is a distributed file system on top of Ceph. It is implemented as a distributed metadata service (MDS) that uses dynamic subtree balancing to trade parallelism for locality during a continually changing workloads. Clients that mount a CephFS file system connect to the MDS and acquire capabilities as they traverse the file namespace. Capabilities not only convey metadata but can also implement strong consistency semantics by granting and revoking the ability of clients to cache data locally.

CephFS namespace traversal offloading

Topics: Ceph, filesystems, metadata, programmable storage
Skills: C++, Ceph / MDS
Difficulty: Medium
Size: Large (350 hours)
Mentor: Carlos Maltzahn

The frequency of metadata service (MDS) requests relative to the amount of data accessed can severely affect the performance of distributed file systems like CephFS, especially for workloads that randomly access a large number of small files as is commonly the case for machine learning workloads: they purposefully randomize access for training and evaluation to prevent overfitting. The datasets of these workloads are read-only and therefore do not require strong coherence mechanisms that metadata services provide by default.

The key idea of this project is to reduce the frequency of MDS requests by offloading namespace traversal, i.e. the need to open a directory, list its entries, open each subdirectory, etc. Each of these operations usually require a separate MDS request. Offloading namespace traversal refers to a client’s ability to request the metadata (and associated read-only capabilities) of an entire subtree with one request, thereby offloading the traversal work for tree discovery to the MDS.

Once the basic functionality is implemented, this project can be expanded to address optimization opportunities, e.g. describing regular tree structures as a closed form expression in the tree’s root, shortcutting tree discovery.

DirtViz (2022)

Mon, 07 Nov 2022 10:15:56 -0700

DirtViz is a project to visualize data collected from sensors deployed in sensor networks. We have deployed a number of sensors measuring qualities like soil moisture, temperature, current and voltage in outdoor settings. This project involves extending (or replacing) our existing plotting scripts to create a fully-feledged dataviz tool tailored to the types of data collected from embedded systems sensor networks.

Visualize Sensor Data

Topics: Data Visualization, Analytics
Skills: javascript, python, bash, webservers, git, embedded systems
Difficulty: Easy/Moderate
Size 175 hours
Mentor: Colleen Josephson

Develop set of visualization tools (ideally web based) that easily allows users to zoom in on date ranges, change axes, etc.
Document the tool thoroughly for future maintenance
If interested, we are also interested in investigating correlations between different data streams

Eusocial Storage Devices

Mon, 07 Nov 2022 10:15:56 -0700

As storage devices get faster, data management tasks rob the host of CPU cycles and main memory bandwidth. The Eusocial project aims to create a new interface to storage devices that can leverage existing and new CPU and main memory resources to take over data management tasks like availability, recovery, and migrations. The project refers to these storage devices as “eusocial” because we are inspired by eusocial insects like ants, termites, and bees, which as individuals are primitive but collectively accomplish amazing things.

Dynamic function injection for RocksDB

Skills: C/C++, Java
Difficulty: Challenging
Size 175 or 350 hours
Mentor: Jianshen Liu

Recent research reveals that the compaction process in RocksDB can be altered to optimize future data access by changing the data layout in compaction levels. The benefit of this approach can be extended to different data layout optimization based on application access patterns and requirements. In this project, we want to create an interface that would allow users to dynamically inject layout optimization functions to RockDB, using containerization technologies such as Webassembly.

Reference: Saxena, Hemant, et al. “Real-Time LSM-Trees for HTAP Workloads.” arXiv preprint arXiv:2101.06801 (2021).

Demonstrating a composable storage system accelerated by memory semantic technologies

Skills: C/C++, Bash, Python, System architecture, Network fabrics
Difficulty: Challenging
Size 350 hours
Mentor: Jianshen Liu

Since the last decade, the slowing down in the performance improvement of general-purpose processors is driving the system architecture to be increasingly heterogeneous. We have seen the kinds of domain-specific accelerator hardware (e.g., FPAG, SmartNIC, TPU, GPU) are growing to take over many different jobs from the general-purpose processors. On the other hand, the network and storage device performance have been tremendously improved with a trajectory much outweighed than that of processors. With this trend, a natural thought to continuously scale the storage system performance economically is to efficiently utilize and share different sources from different nodes over the network. There already exist different resource sharing protocols like CCIX, CXL, and GEN-Z. Among these GEN-Z is the most interesting because, unlike RDMA, it enables remote memory accessing without exposing details to applications (i.e., not application changes). Therefore, it would be interesting to see how/whether these technologies can help improve the performance of storage systems, and to what extent. This project would require building a demo system that uses some of these technologies (especially GEN-Z) and run selected applications/workloads to better understand the benefits.

References: Gen-Z: An Open Memory Fabric for Future Data Processing Needs: https://www.youtube.com/watch?v=JLb9nojNS8E, Pekon Gupta, SMART Modular; Gen-Z subsystem for Linux, https://github.com/linux-genz

When will Rotational Media Users abandon SATA and converge to NVMe?

Skills: Entrepreneurial mind, interest in researching high technology markets
Difficulty: Medium
Size: 350 hours
Mentor: Carlos Maltzahn

Goal: Determine the benefits in particular market verticals such as genomics and health care to converge the storage stack in data center computer systems to the NVMe device interface, even when devices include rotational media (aka disk drives). The key question: “When do people abandon SATA and SAS and converge to NVMe?”

Background: NVMe is a widely used device interface for fast storage devices such as flash that behave much more like random access memory than the traditional rotational media. Rotational media is accessed mostly via SATA and SAS which has served the industry well for close to two decades. SATA in particular is much cheaper than NVMe. Now that NVMe is widely available and quickly advancing in functionality, an interesting question is whether there is a market for rotational media devices with NVMe interfaces, converging the storage stack to only one logical device interface, thereby enabling a common ecosystem and more efficient connectivity from multiple processes to storage devices.

The NVMe 2.0 specification, which came out last year, has been restructured to support the increasingly diverse NVMe device environment (including rotational media). The extensibility of 2.0 encourages enhancements of independent command sets such as Zoned Namespaces (ZNS) and Key Value (NVMe-KV) while supporting transport protocols for NVMe over Fabrics (NVMe-oF). A lot of creative energy is now focused on advancing NVMe while SATA has not changed in 16 years. Having all storage devices connect the same way not only frees up space on motherboards but also enables new ways to manage drives, for example via NVMe-oF that allows drives to be networked without additional abstraction layers.

Suggested Project Structure: This is really just a suggestion for a starting point. As research progresses, a better structure might emerge.

Convergence of software stack: seamless integration between rotational media and hot storage

Direct tiering: one unified interface to place data among fast and slow devices on the same NVMe fabric depending on whether the data is hot or cold.

Computational storage:

What are the architectures of computational NVMe devices? For example, offloading compute to an FPGA vs an onboard processor in a disk drive?
Do market verticals such as genomics and health care for one over the other? When do people abandon SATA and converge to NVMe?

Project tasks:

Review current literature
Survey what the industry is doing
Join weekly meetings to discuss findings with Ph.D. students, experienced industry veterans, and faculty (Thursday’s 2-3pm, can be adjusted if necessary)
Product is a slide deck with lots of pictures

Interesting links:
https://www.opencompute.org/wiki/Storage/NVMeHDD
https://2021ocpglobal.fnvirtual.app/a/event/1714 (video and slides, requires $0 registration)
https://www.storagereview.com/news/nvme-hdd-edges-closer-to-reality
https://www.tomshardware.com/news/seagate-demonstrates-hdd-with-pcie-nvme-interface
https://nvmexpress.org/everything-you-need-to-know-about-the-nvme-2-0-specifications-and-new-technical-proposals/
https://www.tomshardware.com/news/nvme-2-0-supports-hard-disk-drives

FasTensor

Mon, 07 Nov 2022 10:15:56 -0700

FasTensor is a parallel execution engine for user-defined functions on multidimensional arrays. The user-defined functions follow the stencil metaphor used for scientific computing and is effective for expressing a wide range of computations for data analyses, including common aggregation operations from database management systems and advanced machine learning pipelines. FasTensor execution engine exploits the structural-locality in the multidimensional arrays to automate data management operations such as file I/O, data partitioning, communication, parallel execution, and so on.

Continuous Integration

Topics: Data Management, Analytics
Skills: C++, github
Difficulty: Medium
Size: Large (350 hours)
Mentor: John Wu, Bin Dong, Suren Byna

Develop a test suite for the public API of FasTensor
Automate execution of the test suite
Document the continuous integration process
Develop performance testing suite

FasTensor

Mon, 07 Nov 2022 10:15:56 -0700

Tensor execution engine on GPU

Topics: Data Management, Analytics
Skills: C++, github
Difficulty: Difficult
Size: Large (350 hours)
Mentor: John Wu, Bin Dong, Suren Byna

Tensor based computing is needed by scientific applications and now advanced AI model training. Most tensor libraries are hand customized and optimized on GPU, and most of they only serve one kind of application. For example, TensorFlow is only optimized for AI model training. Optimizing generic tensor computing libraries on GPU can benefit wide applications. Our FasTensor, as a generic tensor computing library, can only work efficiently on CPU now. How to run the FasTensor on GPU is still none-explored work. Research and development challenges will include but not limited to: 1) how to maintain structure-locality of tensor data on GPU; 2) how to reduce the performance loss when the structure-locality of tensor is broken on GPU.

Develop a mechanism to move user-define computing kernels onto GPU
Evaluate the performance of the execution engine
Document the execution mechanism
Develop performance testing suite

Continuous Integration

Topics: Data Management, Analytics
Skills: C++, github
Difficulty: Medium
Size: Large (300 hours)
Mentor: John Wu, Bin Dong, Suren Byna

Develop a test suite for the public API of FasTensor
Automate execution of the test suite
Document the continuous integration process

HDF5

Mon, 07 Nov 2022 10:15:56 -0700

HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections.

The HDF5 technology suite includes:

A versatile data model that can represent very complex data objects and a wide variety of metadata.
A completely portable file format with no limit on the number or size of data objects in the collection.
A software library that runs on a range of computational platforms, from laptops to massively parallel systems, and implements a high-level API with C, C++, Fortran 90, and Java interfaces.
A rich set of integrated performance features that allow for access time and storage space optimizations.
Tools and applications for managing, manipulating, viewing, and analyzing the data in the collection.

Python Interface to HDF5 Asynchronous I/O

Topics: Python, Async I/O, HDF5
Skills: Python, C, HDF5
Difficulty: Medium
Size: Large (350 hours)
Mentor: Suren Byna, Houjun Tang

HDF5 is a well-known library for storing and accessing (known as “Input and Output” or I/O) data on high-performance computing systems. Recently, new technologies, such as asynchronous I/O and caching, have been developed to utilize fast memory and storage devices and to hide the I/O latency. Applications can take advantage of an asynchronous interface by scheduling I/O as early as possible and overlapping computation with I/O operations to improve overall performance. The existing HDF5 asynchronous I/O feature supports the C/C++ interface. This project involves the development and performance evaluation of a Python interface that would allow more Python-based scientific codes to use and benefit from the asynchronous I/O.

LiveHD (2022)

Mon, 07 Nov 2022 10:15:56 -0700

Projects for LiveHD. Lead Mentors: Jose Renau and Sheng-Hong Wang.

HIF Tooling


Title	HIF tooling
Description	Tools around Hardware Interchange Format (HIF) files
Mentor(s)	Jose Renau
Skills	C++17
Difficulty	Medium
Size	Medium 175 hours
Link

HIF (https://github.com/masc-ucsc/hif) stands for Hardware Interchange Format. It is designed to be a efficient binary representation with simple API that allows to have generic graph and tree representations commonly used by hardware tools. It is not designer to be a universal format, but rather a storate and traversal format for hardware tools.

LiveHD has 2 HIF interfaces, the tree (LNAST) and the graph (Lgraph). Both can read/write HIF format. The idea of this project is to expand the hif repository to create some small but useful tools around hif. Some projects:

hif_diff + hif_patch: Create the equivalent of the diff/patch commands that exist for text but for HIF files. Since the HIF files have a more clear structure, some patches changes are more constrained or better understood (IOs and dependences are explicit).
hif_tree: Print the HIF hierarchy, somewhat similar to GNU tree but showing the HIF hieararchy.
hif_grep: capacity to grep for some tokens and outout a hif file only with those. Thena hif_tree/hif_cat can show the contents.

Mockturtle


Title	Mockturtle
Description	Perform synthesis for graph in LiveHD using Mockturtle
Mentor(s)	Jose Renau
Skills	C++17, synthesis
Difficulty	Medium
Size	Medium 175 hours
Link

There are some issues with Mockturtle integration (new cells) and it is not using the latest Mockturtle library versions. The goal is to use Mockturtle (https://github.com/lsils/mockturtle) with LiveHD. The main characteristics:

Use mockturtle to tmap to LUTs
Use mockturtle to synthesize (optimize) logic
Enable cut-rewrite as an option
Enable hierarchy cross optimization (hier:true option)
Use the graph labeling to find cluster to optimize
Re-timing
Map to LUTs only gates and non-wide arithmetic. E.g: 32bit add is not mapped to LUTS, but a 2-bit add is mapped.
List of resources to not map:
- Large ALUs. Large ALUs should have an OpenWare block (hardcoded in FPGAs and advanced adder options in ASIC)
- Multipliers and dividers
- Barrell shifters with not trivial shifts (1-2 bits) selectable at run-time
- memories, luts

Query Shell


Title	Query Shell
Description	Create a console app that interacts with LiveHD to query parameters about designs
Mentor(s)	Jose Renau
Skills	C++17
Difficulty	Medium
Size	Medium 175 hours
Link

Based on replxx (like lgshell)
Query bits, ports… like
- https://github.com/rubund/netlist-analyzer
- https://www.jameswhanlon.com/querying-logical-paths-in-a-verilog-design.html
It would be cool if subsections (selected) parts can be visualized with something like https://github.com/nturley/netlistsvg
The shell may be expanded to support simulation in the future
Wavedrom/Duh dumps

Wavedrom and duh allows to dump bitfield information for structures. It would be interesting to explore to dump tables and bit fields for Lgraph IOs, and structs/fields inside the module. It may be a way to integrate with the documentation generation.

Example of queries: show path, show driver/sink of, do topo traversal,….

As an interesting extension would be to have some simple embedded language (TCL or ChaiScript or ???) to control queries more easily and allow to build functions/libraries.

Lgraph and LNAST check pass


Title	Lgraph and LNAST check pass
Description	Create a pass that check the integrity/correctness of Lgraph and LNAST
Mentor(s)	Jose Renau
Skills	C++17
Difficulty	Medium
Size	Large 350 hours
Link

Create a pass that checks that the Lgraph (and/or LNAST) is semantically correct. The LNAST already has quite a few tests (pass.semantic), but it can be further expanded. Some checks:

No combinational loops
No mismatch in bit widths
No disconnected nodes
Check for inefficient splits (do not split buses that can be combined)
Transformations stages should not drop names if same net is preserved
No writes in LNAST that are never read
All the edges are possible. E.g: no pin ‘C’ in Sum_op

unbitwidth


Title	unbitwidth
Description	Not all the variables need bitwidth information. Find the small subset
Mentor(s)	Jose Renau
Skills	C++17
Difficulty	Medium
Size	Medium 175 hours
Link

This pass is needed to create less verbose CHISEL and Pyrope code generation.

The LGraph can have bitwidth information for each dpin. This is needed for Verilog code generation, but not needed for Pyrope or CHISEL. CHISEL can perform local bitwidth inference and Pyrope can perform global bitwidth inference.

A new pass should remove redundant bitwidth information. The information is redundant because the pass/bitwidth can regenerate it if there is enough details. The goal is to create a pass/unbitwidth that removes either local or global bitwidth. The information left should be enough for the bitwidth pass to regenerate it.

Local bitwidth: It is possible to leave the bitwidth information in many places and it will have the same results, but for CHISEL the inputs should be sized. The storage (memories/flops) should have bitwidth when can not be inferred from the inputs.
Global bitwidth: Pyrope bitwidth inference goes across the call hierarchy. This means that a module could have no bitwidth information at all. We start from the leave nodes. If all the bits can be inferred given the inputs, the module should have no bitwidth. In that case the bitwidth can be inferred from outside.

LiveHD (2023)

Mon, 07 Nov 2022 10:15:56 -0700

Projects for LiveHD.
Lead Mentors: Jose Renau and Sakshi Garg.
Contributor(s): Shahzaib Kashif

LiveHD is a “compiler” infrastructure for hardware design optimized for synthesis and simulation. The goals is to enable a more productive flow where the ASIC/FPGA designer can work with multiple hardware description languages like CHISEL, Pyrope, or Verilog.

There are several projects available around LiveHD. A longer explanation and more project options are available at projects. Contact the mentors to find a project that fits your interests.

A sample of helpful projects:

Mockturtle


Title	Mockturtle
Description	Perform synthesis for graph in LiveHD using Mockturtle
Mentor(s)	Jose Renau and Sakshi Garg
Skills	C++17, synthesis
Difficulty	Medium
Size	Medium 175 hours
Link

Mockturtle (https://github.com/lsils/mockturtle) is a synthesis tool partially integrated with LiveHD. The goal of this task is to iron out bugs and issues and to use the LiveHD Tasks API to parallelize the synthesis.

Main features:

The current synthesis divides the circuit in partitions. Each partition can be synthesized in parallel.
Support hierarchical synthesis to optimize cross Lgraphs (cross verilog module optimization)

The goal is to use Mockturtle (https://github.com/lsils/mockturtle) with LiveHD. The main characteristics:

Use mockturtle to tmap to LUTs
Use mockturtle to synthesize (optimize) logic
Enable cut-rewrite as an option
Enable hierarchy cross optimization (hier:true option)
Use the graph labeling to find cluster to optimize
Re-timing
Map to LUTs only gates and non-wide arithmetic. E.g: 32bit add is not mapped to LUTS, but a 2-bit add is mapped.
List of resources to not map:
- Large ALUs. Large ALUs should have an OpenWare block (hardcoded in FPGAs and advanced adder options in ASIC)
- Multipliers and dividers
- Barrell shifters with not trivial shifts (1-2 bits) selectable at run-time
- memories, luts

LiveHD Console


Title	LiveHD Console
Description	Create a console app that interacts with LiveHD to query parameters about designs
Mentor(s)	Jose Renau and Sakshi Garg
Skills	C++17
Difficulty	Medium
Size	Medium 175 hours
Link

Current LiveHD uses replxx but it a no longer maintained shell/console. The result is that it fails in newer versions of OSX.

There is an alternative Crossline (https://github.com/jcwangxp/Crossline). This affects main/main.cpp and nothing else.

In addition to replace the current console with auto-completion, the plan is to add “query” capacity to visualize some of the LiveHD internals.

Query bits, ports… like
- https://github.com/rubund/netlist-analyzer
- https://www.jameswhanlon.com/querying-logical-paths-in-a-verilog-design.html
It would be cool if subsections (selected) parts can be visualized with something like https://github.com/nturley/netlistsvg
The shell may be expanded to support simulation in the future
Wavedrom/Duh dumps

Example of queries: show path, show driver/sink of, do topo traversal,….

Compiler error generation pass


Title	Lgraph and LNAST check pass
Description	Create a pass that check the integrity/correctness of Lgraph and LNAST
Mentor(s)	Jose Renau and Sakshi Garg
Skills	C++17
Difficulty	Medium
Size	Large 350 hours
Link

Create a pass that checks that the Lgraph (and/or LNAST) is semantically correct. The LNAST already has quite a few tests (pass.semantic), but it can be further expanded. Some checks:

No combinational loops
No mismatch in bit widths
No disconnected nodes
Check for inefficient splits (do not split buses that can be combined)
Transformations stages should not drop names if same net is preserved
No writes in LNAST that are never read
All the edges are possible. E.g: no pin ‘C’ in Sum_op

Open Source Autonomous Vehicle Controller

Mon, 07 Nov 2022 10:15:56 -0700

The OSAVC is a vehicle-agnostic open source hardware and software project. This project is designed to provide a real-time hardware controller adaptable to any vehicle type, suitable for aerial, terrestrial, marine, or extraterrestrial vehicles. It allows control researchers to develop state estimation algorithms, sensor calibration algorithms, and vehicle control models in a modular fashion such that once the hardware set has been developed switching algorithms requires only modifying one C function and recompiling.

Lead mentor: Aaron Hunter

Projects for the OSAVC:

Vehicle/Craft sensor driver development

Topics: Driver code to integrate sensor to a microcontroller
Skills: C, I2C, SPI, UART interfaces
Size 175 hours
Difficulty Medium
Mentor Aaron Hunter

Help develop a sensor library for use in autonomnous vehicles. Possible sensors include range finders, ping sensors, IMUs, GPS receivers, RC receivers, barometers, air speed sensors, etc. Code will be written in C using state machine methodology and non-blocking algorithms. Test the drivers on a Microchip microncontroller.

Path finding algorithm using OpenCV and machine learning

Topics: Computer vision, blob detection
Skills: C/Python, OpenCV
Size 175 or 350 hours
Difficulty Medium
Mentor Aaron Hunter

Use OpenCV to identify a track for an autonomous vehicle to follow. Build on previous work by developing a new model using EfficientDet and an existing training set of images. Port the model to TFlite and implement on the Coral USB Accelerator. Evaluate its performance against our previous efforts.

State estimation/sensor fusion algorithm development

Topics: Kalman filtering, Mahoney
Skills: C/Python, Matlab/Simulink, numerical optimization algorithms
Size 350 hours
Difficulty Challenging
Mentor Aaron Hunter

Implement an optimal state estimation algorithm from a model. This model can be derived from a Kalman filter or some other state estimation filter (e.g., Mahoney filter). THe model takes sensor readings as input and provides an estimate of the state of a vehicle. Finally, convert the model to standard C using the Simulink code generation or implement in Python (for use on a single board computer, e.g., Raspberry Pi)

Open Source Autonomous Vehicle Controller

Mon, 07 Nov 2022 10:15:56 -0700

Lead mentor: Aaron Hunter

Projects for the OSAVC:

Vehicle/Craft sensor driver development

Topics: Driver code to integrate sensor to a microcontroller
Skills: C, I2C, SPI, UART interfaces
Size 175 hours
Difficulty Medium
Mentor Aaron Hunter, Carlos Espinosa, Pavlo Vlastos

Help develop sensor libraries for use in autonomous vehicles. We are in particular interested in sensors for UAVs: airspeed sensors (pitot tube) or barometers, but also proximity detectors (ultrasonic), and range sensors. Code will be written in C using state machine methodology and non-blocking algorithms. Test the drivers on a Microchip microncontroller.

Technical Documentation

Topics: Documentation
Skills: Technical writing, markdown language, website
Size 175 hours
Difficulty Medium
Mentor Aaron Hunter/Carlos Espinosa/Pavlo Vlastos
Contributor(s) Aniruddha Thakre

Technical Documentation: Write a tutorial to demonstrate how to start with an OSAVC and program it with the robotic equivalent of HelloWorld, moving onto more sophisticated applications. Create a web page interface to the OSAVC repo highlighting this tutorial. In this project you will start from scratch with an OSAVC PCB and bring it to life, while documenting it in a way to help new users.

ROS/Gazebo Robot Simulation

Topics: Robot simulation with ROS/Gazebo
Skills ROS/Gazebo, Python
Size 175 or 350 hours
Difficulty Medium to Hard
Mentor Aaron Hunter, Carlos Espinosa, Pavlo Vlastos
Contributor(s) Damodar Datta Kancharla

Generate a simulated world and a quadcopter model in ROS/Gazebo. Provide a link from Mavlink to ROS using the mavros package and simulate a real vehicle data stream to command the simulated quadcopter in Gazebo. At the same time return the image stream from Gazebo to allow for offline processing of ML models on the images.

OpenRAM

Mon, 07 Nov 2022 10:15:56 -0700

Replace logging framework with library

Topics: User Interfaces, Python APIs
Skills: Python
Difficulty: Easy
Size: Medium (175 hours)
Mentors: Matthew Guthaus,Jesse Cirimelli-Low

Replace the custom logging framework in OpenRAM with Python logging module. New logging should allow levels of detail as well as tags to enable/disable logging of particular features to aid debugging.

ROM generator

Topics: VLSI Design Basics, Memories, Python
Skills: Python, VLSI
Difficulty: Medium/Challenging
Size: Large (350 hours)
Mentors: Matthew Guthaus

Use the OpenRAM API to generate a Read-Only Memory (ROM) file from an input hex file. Project will automatically generate a Spice netlist, layout, Verilog model and timing characterization.

Register File generator

Topics: VLSI Design Basics, Memories, Python
Skills: Python, VLSI
Difficulty: Medium/Challenging
Size: Large (350 hours)
Mentors: Matthew Guthaus

Use the OpenRAM API to generate a Register File from standard library cells. Project will automatically generate a Spice netlist, layout, Verilog model and timing characterization.

Built-In Self Test and Repair

Topics: VLSI Design Basics, Python, Verilog, Testing
Skills: Python, Verilog
Difficulty: Medium/Challenging
Size: Medium (175 hours)
Mentors: Matthew Guthaus, Bugra Onal

Finish integration of parameterized Verilog modeule to support Built-In-Self-Test and Repair of OpenRAM memories using spare rows and columns in OpenRAM memories.

Layout verses Schematic (LVS) visualization

Topics: VLSI Design Basics, Python
Skills: Python, VLSI, JSON
Difficulty: Easy/Medium
Size: Medium or Large (175 or 350 hours)
Mentors: Matthew Guthaus,Jesse Cirimelli-Low

Create a visualization interface to debug layout verses schematic mismatches in Magic layout editor. Results will be parsed from a JSON output of Netgen.

OpenROAD - A Complete, Autonomous RTL-GDSII Flow for VLSI Designs

Mon, 07 Nov 2022 10:15:56 -0700

OpenROAD is a front-runner in open-source semiconductor design automation tools and know-how. OpenROAD reduces barriers of access and tool costs to democratize system and product innovation in silicon. The OpenROAD tool and flow provide an autonomous, no-human-in-the-loop, 24-hour RTL-GDSII capability to support low-overhead design exploration and implementation through tapeout. We welcome a diverse community of designers, researchers, enthusiasts and entrepreneurs who use and contribute to OpenROAD to make a far-reaching impact. Our mission is to democratize and advance design automation of semiconductor devices through leadership, innovation, and collaboration.

OpenROAD is the key enabler of successful Chip initiatives like the Google-sponsored Efabless that has made possible more than 150 successful tapeouts by a diverse and global user community. The OpenROAD project repository is https://github.com/The-OpenROAD-Project/OpenROAD.

Design of static RAMs in VLSI designs for good performance and area is generally time-consuming. Memory compilers significantly reduce design time for complex analog and mixed-signal designs by allowing designers to explore, verify and configure multiple variants and hence select a design that is optimal for area and performance. This project requires the support of memory compilers to OpenROAD-flow-scripts based on popular PDKS such as those provided by OpenRAM.

OpenLane Memory Design Macro Floorplanning

Topics: Memory Compilers, OpenRAM, Programmable RAM
Skills: python, basic knowledge of memory design, VLSI technology, PDK, Verilog
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matthew Guthaus, Mehdi Saligane

Improve and verify OpenLane design planning with OpenRAM memories. Specifically, this project will utilize the macro placer/floorplanner and resolve any issues for memory placement. Issues that will need to be addressed may include power supply connectivity, ability to rotate memory macros, and solving pin-access issues.

OpenLane Memory Design Timing Analysis

Topics: Memory Compilers, OpenRAM, Programmable RAM
Skills: python, basic knowledge of memory design, VLSI technology, PDK, Verilog
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matthew Guthaus, Mehdi Saligane

Improve and verify OpenLane Static Timing Analysis using OpenRAM generated library files. Specifically, this will include verifying setup/hold conditions as well as creating additional checks such as minimum period, minimum pulse width, etc. Also, the project will add timing information to Verilog behavioral model.

OpenLane Memory Macro PDK Support

Topics: Memory Compilers, OpenRAM, Programmable RAM
Skills: python, basic knowledge of memory design, VLSI technology, PDK, Verilog
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matthew Guthaus, Mehdi Saligane

Integrate and verify FreePDK45 OpenRAM memories with an OpenLane FreePDK45 design flow. OpenLane currently supports only Skywater 130nm PDK, but OpenROAD supports FreePDK45 (which is the same as Nangate45). This project will create a design using OpenRAM memories with the OpenLane flow using FreePDK45.

VLSI Power Planning and Analysis

Topics: Power Planning for VLSI, IR Drop Analysis, Power grid Creation and Analysis
Skills: C++, tcl, VLSI Layout
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Mehdi Saligane mailto:mehdi@umich.edu, Ming-Hung mailto:minghung@umich.edu

Take the existing power planning (pdngen.tcl) module of openroad and recode the functionality in C++ ensuring that all of the unit tests on the existing code pass correctly. Work with a senior member of the team at ARM. Ensure that designs created are of good quality for power routing and overall power consumption.

Demos and Tutorials

Topics: Demo Development, Documentation, VLSI design basics
Skills: Knowledge of EDA tools, basics of VLSI design flow, tcl, shell scripts, Documentation, Markdown
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Indira Iyer, Vitor Bandeira

For OpenLane, develop demos showing: The OpenLane flow and highight key features GUI visualizations Design Explorations and Experiments Different design styles and particular challenges

Comprehensive Flow Testing

Topics: Testing, Documentation, VLSI design basics
Skills: Knowledge of EDA tools, basics of VLSI design, tcl, shell scripts, Verilog, Layout
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Indira Iyer

Develop detailed test plans to test the OpenLane flow to expand coverage and advanced features. Add open source designs to the regression test suite to improve tool quality and robustness. This includes design specification, configuration and creation of all necessary files for regression testing. Suggested sources : ICCAS benchmarks, opencores, LSOracle for synthesis flow option.

Enhance GUI features

Topics: GUI, Visualization, User Interfaces
Skills: C++, Qt
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty, Vitor Bandeira

For OpenROAD, develop and enhance visualizations for EDA data and algorithms in the OpenROAD GUI. Allow deeper understanding of the tool results for users and tool internals for developers.

Automate OpenDB code Generation

Topics: Database, EDA
Skills: C++, Python, JSON, Jinja templating
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Matt Liberty, Tom Spyrou

For OpenROAD- Automatic code generation for the OpenDB database which allows improvements to the data model with much less hand coding. Allow the generation of storage, serialization, and callback code from a custom schema description format. r

Implement an NLP based AI bot aimed at increasing users, enhancing usability and building a knowledge base

Topics: AI, ML, Analytics
Skills: Python. ML libraries (e.g., Tensorflow, PyTorch)
Difficulty: Medium
Size: Medium or Large (175 or 350 hours)
Mentor: Vitor Bandeira, Indira Iyer

The OpenROAD project contains a storehouse of knowledge in it’s Github repositories within Issues and Pull requests. Additionally, project related slack channels also hold useful information in the form of questions and answers, problems and solutions in conversation threads. Implement an AI analytics bot that filters, selects relevant discussions and classifies/records them into useful documentation and actionable issues. This should also directly track, increase project usage and report outcome metrics.

Package Management & Reproducibility

Mon, 07 Nov 2022 10:15:56 -0700

Project ideas related to reproducibility and package management, especially as it relates to store type package managers (NixOS, Guix or Spack).

Lead Mentor: Farid Zakaria mailto:fmzakari@ucsc.edu

Investigate the dynamic linking landscape

Topics: Operating Systems Compilers Linux Package Management NixOS
Skills: Experience with systems programming and Linux familiarity
Difficulty: Moderate to Challenging
Size: Large (350 hours)
Mentors: Farid Zakaria & Tom Scogland mailto:scogland1@llnl.gov

Dynamic linking as specified in the ELF file format has gone unchallenged since it’s invention. With many new package management models that eschew the filesystem hierarchy standard (i.e. Nix, Guix and Spack), many of the idiosyncrasies that define the way in which libraries are discovered are no longer useful and potentially harmful.

Specific tasks:

Continue development on Shrinkwrap a tool to make dynamic library loading simpler and more robust.
Evaluate it’s effectiveness across a wide range of binaries.
Upstream contributions to NixOS or Guix to leverage the improvement when suitable.
Investigate alternative improvements to dynamic linking by writing a dynamic linker “loadder wrapper” to explore new ideas.

Polyphorm / PolyPhy

Mon, 07 Nov 2022 10:15:56 -0700

Polyphorm is an agent-based system for reconstructing and visualizing optimal transport networks defined over sparse data. Rooted in astronomy and inspired by nature, we have used Polyphorm to reconstruct the Cosmic web structure, but also to discover network-like patterns in natural language data. You can find more details about our research here. Under the hood, Polyphorm uses a richer 3D scalar field representation of the reconstructed network, instead of a discrete representation like a graph or a mesh.

PolyPhy will be a Python-based redesigned version of Polyphorm, currently in the beginning of its development cycle. PolyPhy will be a multi-platform toolkit meant for a wide audience across different disciplines: astronomers, neuroscientists, data scientists and even artists and designers. All of the offered projects focus on PolyPhy, with a variety of topics including design, coding, and even research. Ultimately, PolyPhy will become a tool for discovering connections between different disciplines by creating quantitatively comparable structural analytics.

Develop website for PolyPhy

Topics: Web Development Dynamic Updates UX
Skills: web development experience, good communicator, (HTML/CSS), (Javascript)
Difficulty: Moderate
Size: Medium or large (175 or 350 hours)
Mentors: Oskar Elek

Develop a clean and welcoming website for the project. The organization needs to reflect the needs of PolyPhy users, but also provide a convenient entry point for interested project contributors. No excessive pop-ups or webjunk.

Specific tasks:

Work with mentors on understanding the context of the project.
Port the contents of the repository page to a dedicated website.
Design the structure of the website according to best OS practices.
Work with the visual designer (see below) in creating a coherent and organic presentation.
Interactively link important metrics from the project dev environment as well as documentation.

Design visual experience for PolyPhy’s website and presentations

Topics: Design Art UX
Skills: vector and bitmap drawing, sense for spatial symmetry and framing, (interactive content creation), (animation)
Difficulty: Moderate
Size: Medium (175 hours)
Mentors: Oskar Elek

Develop visual content for the project using its main themes: nature-inspired computation, biomimetics, interconnected structures. Aid in designing visual structure of the website as well as other public-facing artifacts.

Specific tasks:

Work with mentors on understanding the context of the project.
Design imagery and other graphical elements to visually (re-)present PolyPhy.
Work with the technical writer (see below) in designing a coherent story.
Work with the web developer (see above) in creating a coherent and organic presentation.

Write PolyPhy’s technical story and content

Topics: Writing Documentation Storytelling
Skills: experienced writing structured text over 10 pages, well read, (technical or scientific education)
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Oskar Elek

Integral to PolyPhy’s presentation is a story that the users and the project contributors can relate to. The objective is to develop the verbal part of that story, as well as major portions of technical documentation that matches it. The difficulty of the project is scalable.

Specific tasks:

Work with mentors on understanding the context of the project.
Write different pages of the project website.
Work with mentors to improve project’s written community practices (diversity, communication).
Write and edit narrative and explanatory parts of PolyPhy’s documentation.
Work with the visual designer (see above) in designing a coherent story.

Video tutorials and presentation for PolyPhy

Topics: Video Presentation Tutorials Didactics
Skills: video editing, creating educational content, communication, (native or fluent in another language)
Difficulty: Easy-Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Oskar Elek, Drew Ehrlich

Create a public face for PolyPhy that reflects its history, context, and teaches its functionality to users in different degrees of familiarity.

Specific tasks:

Work with mentors on understanding the context and history of the project.
Interview diverse project contributors.
Create a video documenting PolyPhy’s history, with roots in astronomy, complex systems, fractals.
Create a set of tutorial videos for starting and intermediate PolyPhy users.
Create an accessible template for future tutorials.

Implement heterogeneous data I/O ops

Topics: I/O Operations File Conversion Numerics Testing
Skills: Python, experience working with scientific or statistical data, good debugging skills
Difficulty: Moderate-Challenging
Size: Medium or Large (175 or 350 hours)
Mentors: Oskar Elek, Anisha Goel

By default, PolyPhy operates with an unordered set of points as an input and scalar fields (float ndarrays) as an output, but others are applicable as well. Design and implement interfaces to load and export different data formats (CSV, OBJ, HDF5, FITS…) and modalities (points, meshes, density fields). The difficulty of the project can be scaled based on contributor’s interest.

Specific tasks:

Research which modalities are used by members of the target communities.
Implement modular loaders for the inputs and an interface to PolyPhy core.
Implement exporters for simulation datasets and visualization captures.
Write testing code for the above.
Integrate external packages as necessary.

Setup CI/CD for PolyPhy

Topics: Continuous Integration Continuous Deployment DevOps
Skills: experience with CI/CD, GitHub, Python package deployment
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Oskar Elek, Anisha Goel

The objective is to setup a CI/CD pipeline that automates the build testing and deployment of the software. The resulting process needs to be robust to contributor errors and work in the distributed conditions of a diverse contributor base.

Specific tasks:

Automate continuous building, testing, merging and deployment for PolyPhy in GitHub.
Publish the CI/CD metrics and build assets to the project webpage.
Work with other contributors in educating them about the best practices of using the developed CI/CD pipeline.
Add support for automated packaging using common management systems (pip, Anaconda).

Refine PolyPhy’s UI and develop new functional elements

Topics: UI/UX Visual Experience
Skills: Python programming, UI/UX development experience, (knowledge of graphics)
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Oskar Elek, David Abramov

The key feature of PolyPhy is its interactivity. By interacting with the underlying simulation model, the user can adjust its parameters in real time and respond to its behavior. For instance, an astrophysics expert can load a dataset of 100k galaxies and reconstruct the large-scale structure of the intergalactic medium. A responsive UI combined with real-time visualization allows them to judge the fidelity of the reconstruction and make necessary changes.

Specific tasks:

Implement a platform-agnostic UI to house PolyPhy’s main rendering context as well as secondary analytics.
Work with the visualization developer (see below) to integrate the rendering functionality.
Optimize to UI’s performance.
Test the implementation on different OS platforms.

Create new data visualization regimes

Topics: Interactive Visualization Data Analytics 3D Rendering
Skills: basic graphics theory and math, Python, GPU programming, (previous experience visualizing novel datasets)
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Oskar Elek, David Abramov

Data visualization is one of the core components of PolyPhy, as it provides a real-time overview of the underlying MCPM simulation. Through the feedback provided by the visualization, PolyPhy users can adjust the simulation model and make new findings about the dataset. Various operations over the reconstructed data (e.g. spatial searching) as well as important statistical summaries also benefit from clear visual presentation.

Specific tasks:

Develop novel ways of visualizing scientific data in PolyPhy.
Work with diverse data modalities - point clouds, graphs, scalar and vector fields.
Add support for visualizing metadata, such as annotations and labels.
Create UI elements for plotting statistical summaries computed in real-time.

Discrete graph extraction from simulated scalar fields

Topics: Graph Theory Data Science
Skills: good understanding of discrete math and graph theory, Python, (GPU programming)
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Oskar Elek, Farhanul Hasan

Develop a custom method for graph extraction from scalar field data produced by PolyPhy. Because PolyPhy typically produces network-like structures, representing these structures as weighted discrete graphs is very useful for efficiently navigating the data. The most important property of this abstracted representation is that it preserves the topology of the base scalar field by navigating the 1D ridges of the scalar field.

Specific tasks:

Become familiar with different algorithms for graph growing and skeleton extraction.
Implement the most suitable method in PolyPhy, interpreting the source scalar field as a throughput (transport) network. The weights of the resulting graph need to reflect the source throughputs between the respective node locations.
Implement common graph operations, e.g. hierarchical clustering and reduction, shortest path between two nodes, range queries.
Optimize the runtime of the implemented methods.
Work with the visualization developer (see above) to visualize the resulting graphs.

Proactive Data Containers (PDC)

Mon, 07 Nov 2022 10:15:56 -0700

Python interface to an object-centric data management system

Topics: Python, object-centric data management, PDC
Skills: Python, C, PDC
Difficulty: Medium
Size: Large (350 hours)
Mentor: Suren Byna, Houjun Tang

Proactive Data Containers (PDC) is an object-centric data management system for scientific data on high performance computing systems. It manages objects and their associated metadata within a locus of storage (memory, NVRAM, disk, etc.). Managing data as objects enables powerful optimization opportunities for data movement and transformations, and storage mechanisms that take advantage of the deep storage hierarchy and enable automated performance tuning. Currently PDC has a C interface. Providing a python interface would make it easier for more Python applications to utilize it.

Skyhook Data Management

Mon, 07 Nov 2022 10:15:56 -0700

SkyhookDM

The Skyhook Data Management project extends object storage with data management functionality for tabular data. SkyhookDM enables storing and query tabular data in the Ceph distributed object storage system. It thereby turns Ceph into an Apache Arrow-native storage system, utilizing the Arrow Dataset API to store and query data with server-side data processing, including selection and projection that can significantly reduce the data returned to the client.

SkyhookDM is now part of Apache Arrow (see blog post).

Support reading from Skyhook in Dask/Ray using the Arrow Dataset API

Topics: Arrow, Dask/Ray
Skills: C++
Size: 175 hours
Difficulty: Medium

Mentor: Jayjeet Chakraboorty

Problem: Dask and Ray are parallel-computing frameworks similar to Apache Spark but in a Python ecosystem. Each of these frameworks support reading tabular data from different data sources such as a local filesystem, cloud object stores, etc. These systems have recently added support for the Arrow Dataset API to read data from different sources. Since, the Arrow dataset API supports Skyhook, we can leverage this capability to offload compute-heavy Parquet file decoding and decompression into the Ceph storage layer. This can help us speed up the queries significantly as CPU will get freed up in the Dask/Ray workers for other processing tasks.

Implement Gandiva based query executor in SkyhookDM

Topics: Arrow, Gandiva, SIMD
Skills: C++
Size: 350 hours
Difficulty: Hard

Mentor: Jayjeet Chakraboorty

Problem: Gandiva allows efficient evaluation of query expressions using runtime code generation using LLVM. The generated code leverages SIMD instructions and is highly optimized for parallel processing in modern CPUs. It is natively supported by Arrow for compiling and executing expressions. SkyhookDM currently uses the Arrow Dataset API (which internally uses Arrow Compute APIs) to execute query expressions inside the Ceph OSDs. Since, the Arrow Dataset API particularly does not support Gandiva currently, the goal of this project is to add support for Gandiva in the Arrow Dataset API in order to accelerate query processing when offloaded to the storage layer. This will help Skyhook combat some of the peformance issues due to the inefficient serialization interface of Arrow.

References:

Add Ability to create and save views from Datasets

Topics: Arrow, Database views, virtual datasets
Skills: C++
Size: 175 hours
Difficulty: Medium

Mentor: Jayjeet Chakraboorty

Problem - Workloads may repeat the same or similar queries over time. This causes repetition of IO and compute operations, wasting resources. Saving previous computation in the form of materialized views can provide benefit for future workload processing. Solution - Add a method to the Dataset API to create views from queries and save the view as an object in a separate pool with some object key that can be generated from the query that created it.

Reference: https://docs.dremio.com/working-with-datasets/virtual-datasets.html

Integrating Delta Lake on top of SkyhookDM

Topics: data lakes, lake house, distributed query processing
Skills: C++
NSize: 175 or 350 hours
Difficulty: Medium

Mentor: Jayjeet Chakraboorty

Delta Lake is a new architecture for querying big data lakes through Spark, providing transactions. An important benefit of this integration will be to provide an SQL interface for SkyhookDM functionality, through Spark SQL. This project will further build upon our current work connecting Spark to SkyhookDM through the Arrow Dataset API. This would allow us to run some of the TPC-DS queries (popular set of SQL queries for benchmarking databases) on SkyhookDM easily.

Reference: [Delta Lake paper] (https://databricks.com/jp/wp-content/uploads/2020/08/p975-armbrust.pdf)

Efficient Communication with Key/Value Storage Devices

Sun, 27 Feb 2022 00:00:00 +0000

Network key value stores are used throughout the cloud as a storage backends (eg AWS ShardStore) and are showing up in devices (eg NVMe KV SSD). The KV clients use traditional network sockets and POSIX APIs to communicate with the KV store. An advancement that has occurred in the last 2 years is a new kernel interface that can be used in lieu of the POSIX API, namely io_uring. This new interface uses a set of shared memory queues to provide for kernel-to-user communication and permits zero copy transfer of data. This scheme avoids the overhead of system calls and can improve performance.

Implement `io_uring` communication backend

Topics: performance, I/O, network, key-value, storage
Difficulty: Medium
Size: Medium or large (120 or 150 hours)
Mentors: Philip Kufeldt (Seagate), Aldrin Montana (UC Santa Cruz) Contributor(s): Manank Patel

Seagate has been using a network-based KV HDD as a research vehicle for computational storage. This research vehicle uses open-source user library that implements a KV API by sending network protobuf-based RPCs to a network KV store. Currently it is implemented with the standard socket and POSIX APIs to communicate with the KV backend. This project would implement an io_uring communication backend and compare the results of both implementations.

DirtViz 2.0 (2023)

Mon, 07 Feb 2022 00:00:00 +0000

DirtViz is a project to visualize data collected from sensors deployed in sensor networks. We have deployed a number of sensors measuring qualities like soil moisture, temperature, current and voltage in outdoor settings. This project involves extending our existing visualization stack, DirtViz 1.0 (see github), and expanding it to version 2.0. The project goal is to create a fully-fledged dataviz tool tailored to the types of data collected from embedded systems sensor networks.

Visualize Sensor Data

Topics: Data Visualization, Analytics
Skills: javascript, python, bash, webservers, git, embedded systems
Difficulty: Easy/Moderate
Size: Large, 350 hours
Mentors: Colleen Josephson, Sonia Naderi, Stephen Taylor, John Madden

Specific tasks:

Refine our web-based visualization tools to easily allow users to zoom in on date ranges, change axes, etc.
Create a system for remote collaborators/citizen scientists to upload their own data in a secure manner
Craft an intuitive navigation system so that data from deployment sites around the world can be easily viewed
Document the tool thoroughly for future maintenance
If interested, we are also open to you investigating correlations between different data streams and doing self-directed data analysis