AI | UCSC OSPO

NETAI: AI-Powered Network Anomaly Detection and Diagnostics Platform

Thu, 05 Feb 2026 00:00:00 +0000

NETAI (Network AI) is an AI-powered network anomaly detection and diagnostics platform for the National Research Platform (NRP). This project combines Kubernetes-native LLM integration, network performance monitoring, and predictive analytics to create an intelligent assistant for network operators. Students will work with cutting-edge technologies including Large Language Models (LLMs), Kubernetes, perfSONAR network measurements, time-series analysis, and containerized AI/ML workloads, while contributing to real-world applications in network operations and diagnostics.

The project involves developing a Kubernetes chatbot that leverages NRP’s managed LLM service (providing access to models like Qwen3-VL, GLM-4.7, and GPT-OSS) to help network operators understand complex network behaviors, diagnose anomalies, and receive natural language explanations of network issues. Students will integrate perfSONAR measurement data with traceroute path analysis to create an interactive network topology visualization, and develop AI/ML models for predictive network performance analysis using NRP’s GPU resources.

In addition, students will gain hands-on experience with fine-tuning LLMs on historical network diagnostics data, developing time-series forecasting models for network metrics, and implementing anomaly detection using deep learning techniques. The entire AI/ML pipeline will be containerized and deployed as Kubernetes workloads, utilizing GPU-enabled pods for model training and inference, ensuring scalability and seamless integration with existing NRP infrastructure.

The platform builds upon existing network diagnostics capabilities, combining end-to-end throughput measurements with detailed traceroute data to enable operators to visualize network paths, identify performance bottlenecks, and understand relationships between metrics and underlying infrastructure. The AI enhancement will provide predictive capabilities, automated incident reporting, and intelligent recommendations for network remediation strategies.

NETAI / LLM Integration & Kubernetes Chatbot

The proposed work includes developing a Kubernetes-native chatbot that integrates with NRP’s managed LLM service to provide intelligent network diagnostics assistance. Students will create a conversational interface that can answer questions about network performance, explain anomalies in natural language, and suggest remediation strategies. They will fine-tune LLMs on historical network diagnostics data, test results, and traceroute information to create domain-specific assistants. Students will implement RESTful APIs for chatbot interactions, develop prompt engineering strategies for network diagnostics, and create context-aware responses that incorporate real-time network telemetry. The chatbot will be deployed as Kubernetes services, utilizing GPU pods for inference and integrating with the existing diagnostics platform.

Topics: Large Language Models, Kubernetes, Chatbots, Natural Language Processing, Network Diagnostics, API Development
Skills: Python, Kubernetes, LLM APIs (Qwen3-VL, GLM-4.7, GPT-OSS), Prompt Engineering, REST APIs, Docker, GPU Computing
Difficulty: Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

NETAI / Network Anomaly Detection Models

The proposed work includes developing deep learning models for network anomaly detection using historical perfSONAR and traceroute data. Students will create models that can identify slow links, high packet loss, excessive retransmits, and failed network tests automatically. They will implement anomaly detection algorithms using techniques such as autoencoders, LSTM networks, and transformer architectures. Students will train models on NRP’s GPU clusters using historical network telemetry stored in SQLite databases, develop feature engineering pipelines for network metrics, and create real-time inference services deployed as Kubernetes workloads. The models will be integrated into the diagnostics platform to provide automated anomaly detection alongside the interactive visualization.

Topics: Deep Learning, Anomaly Detection, Time-Series Analysis, Network Monitoring, Model Training, GPU Computing
Skills: Python, PyTorch/TensorFlow, scikit-learn, Pandas, NumPy, SQLite, Kubernetes, GPU Pods, MLOps
Difficulty: Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

NETAI / Predictive Analytics & Forecasting

The proposed work includes developing predictive models that can forecast network performance degradation and identify patterns in network anomalies before they impact users. Students will create time-series forecasting models for network metrics such as throughput, latency, and packet loss, using techniques like ARIMA, Prophet, and deep learning-based forecasting. They will implement few-shot learning approaches to adapt models to new network topologies and measurement patterns, develop early warning systems for potential network issues, and create automated incident report generation using LLMs. Students will leverage NRP’s GPU resources for training forecasting models and deploy them as Kubernetes services for real-time predictions integrated with the diagnostics dashboard.

Topics: Time-Series Forecasting, Predictive Analytics, Machine Learning, Network Performance, Early Warning Systems, LLM Integration
Skills: Python, PyTorch/TensorFlow, Prophet, ARIMA, Pandas, NumPy, Time-Series Analysis, Kubernetes, GPU Computing
Difficulty: Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

NETAI / Kubernetes Deployment & Infrastructure

The proposed work includes setting up Kubernetes-based infrastructure for deploying the entire NETAI platform, including LLM services, ML models, and the diagnostics dashboard. Students will create Helm charts for deploying containerized AI/ML workloads, configure GPU-enabled pods for model training and inference, and implement persistent storage solutions for maintaining historical network telemetry. They will develop GitLab CI/CD pipelines for automated testing and deployment, set up monitoring and observability using Prometheus and Grafana for tracking model performance and resource usage, and create scalable deployment strategies that leverage NRP’s distributed computing resources. Students will also integrate the platform with existing perfSONAR infrastructure and ensure seamless operation within the NRP cluster.

Topics: Kubernetes, DevOps, CI/CD, GPU Computing, Container Orchestration, Infrastructure as Code, Monitoring
Skills: Kubernetes, Helm, GitLab CI/CD, Prometheus, Grafana, Docker, GPU Pods, Persistent Storage, Infrastructure Automation
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

Project Resources

National Research Platform: https://nrp.ai/
NRP LLM Service: https://nrp.ai/documentation/userdocs/ai/llm-managed/
perfSONAR: https://www.perfsonar.net/
MaDDash: https://github.com/esnet/maddash
Network Monitoring Documentation: https://nrp.ai/documentation/

Background

This project addresses critical gaps in network performance monitoring for the National Research Platform by integrating AI/ML capabilities with existing perfSONAR-based diagnostics. The platform combines end-to-end network measurements with detailed path-level analysis, enhanced by intelligent AI assistants that can help operators understand complex network behaviors and predict potential issues. By leveraging NRP’s managed LLM service and GPU resources, students will create a Kubernetes-native system that scales across the distributed research network infrastructure, providing both real-time diagnostics and predictive analytics to improve network reliability and performance for researchers nationwide.

VINE: Precision Agriculture Data Platform & Digital Twin

Thu, 05 Feb 2026 00:00:00 +0000

VINE (Vineyard Intelligence Network & Environment) is an AI/ML research project focused on precision agriculture using the National Research Platform (NRP). This project leverages the innovative demonstration at Iron Horse Vineyards to study how AI and machine learning can optimize agricultural practices through data-driven insights. Students will work with cutting-edge AI/ML technologies, distributed computing on NRP, and large-scale data analysis, while contributing to real-world applications in sustainable agriculture and climate adaptation.

The project involves AI/ML research using agricultural data from Iron Horse Vineyards, leveraging the computational resources of the National Research Platform for training and deploying machine learning models. Students will work with agricultural datasets including sensor data, multi-spectral drone imagery, and historical records, developing models for predictive analytics, computer vision, and time-series forecasting. The integration of NRP’s distributed infrastructure enables scalable AI research that can process large volumes of sensor data, multi-spectral imagery, and historical agricultural records.

Students will gain hands-on experience with AI/ML model development for agricultural applications, learning how to analyze multi-spectral drone imagery, process time-series sensor data, and build predictive models for irrigation scheduling, pest detection, and harvest timing. They will deploy and train models on NRP’s Kubernetes clusters, utilize GPU resources for deep learning workloads, and work with agricultural datasets for comprehensive research. The project emphasizes using distributed computing on NRP to scale AI/ML experiments and create open, shareable datasets for collaborative research.

The platform builds upon the success demonstrated at Iron Horse Vineyards, where AI-driven analytics have shown potential for 10% water use reduction and improved yield optimization. This project aims to advance AI/ML research in precision agriculture by utilizing NRP’s computational capabilities, creating reproducible research that can benefit the broader agricultural and research communities.

VINE / Data Pipeline & Integration

The proposed work includes building data pipelines to ingest, process, and prepare agricultural data from Iron Horse Vineyards and other sources for AI/ML research. Students will develop pipelines to collect sensor data (soil moisture, temperature, CO2, weather), multi-spectral drone imagery, and historical agricultural records. They will create data validation and quality assurance processes, implement data preprocessing for ML model training, and develop data integration workflows that connect agricultural datasets with NRP computational resources. Students will also work on data sharing mechanisms to make processed datasets available for the research community.

Topics: Data Engineering, Time-Series Data, Data Preprocessing, Data Sharing, ML Data Pipelines
Skills: Python, Pandas, NumPy, Data Validation, REST APIs, Docker, Kubernetes, Data Processing
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada

VINE / AI/ML Models for Agricultural Analytics on NRP

The proposed work includes developing and training machine learning models for agricultural applications using the National Research Platform (NRP). Students will create models for predictive irrigation scheduling based on soil moisture, weather forecasts, and historical data. They will develop computer vision models for analyzing multi-spectral drone imagery to detect plant health, identify pests, and estimate yield. Students will also work on time-series forecasting models for predicting harvest timing and optimizing resource allocation. The project will involve training models on NRP’s GPU clusters, utilizing distributed training capabilities, and deploying models for real-time inference. Students will leverage agricultural datasets for training and validation, and contribute model outputs and insights for the research community.

Topics: Machine Learning, Computer Vision, Time-Series Analysis, Predictive Analytics, Agricultural AI, Distributed Training
Skills: Python, PyTorch/TensorFlow, scikit-learn, OpenCV, Pandas, NumPy, MLOps, NRP Kubernetes, GPU Computing
Difficulty: Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada

VINE / Digital Twin & AI-Driven Visualization

The proposed work includes creating AI-enhanced digital twin systems for agricultural sites using computational resources on NRP. Students will develop 3D visualization systems (potentially using Omniverse or similar platforms) to represent vineyards and farms, integrate AI model predictions into the digital twin for real-time insights, and create interactive dashboards for monitoring and analysis. They will implement spatial data processing using ML models to map sensor locations and readings to geographic coordinates, and develop AI-driven simulation capabilities for testing different agricultural strategies (irrigation patterns, planting layouts, etc.) before implementation. Students will deploy visualization services on NRP infrastructure and integrate with agricultural data sources for real-time updates.

Topics: Digital Twin, AI-Enhanced Visualization, GIS, Spatial Data, ML-Driven Simulation, Real-Time Systems
Skills: Python, 3D Graphics (Omniverse/Unity/Blender), GIS tools, WebGL, React/Three.js, ML Integration, NRP Deployment
Difficulty: Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada

VINE / Web Dashboard & NRP Integration Platform

The proposed work includes building a comprehensive web dashboard for visualizing agricultural data, AI model predictions, and research insights. Students will develop a full-stack web application using modern frameworks (React, Flask/FastAPI) deployed on the National Research Platform (NRP). The dashboard will display real-time sensor readings, historical trends from agricultural datasets, AI model predictions, and digital twin visualizations. Students will create API endpoints that integrate with NRP computational resources and agricultural data sources, implement role-based access control for researchers, and enable data export/sharing with the broader research community. The platform will support interactive data exploration tools and provide programmatic access to AI/ML models running on NRP.

Topics: Full-Stack Web Development, Data Visualization, API Development, NRP Deployment, ML Model Serving
Skills: React, Flask/FastAPI, PostgreSQL, D3.js/Plotly, Bootstrap/Tailwind CSS, REST APIs, Kubernetes, NRP APIs
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Mohammad Firas Sada

Project Resources

National Research Platform: https://nrp.ai/
Iron Horse Vineyards Project: https://gitlab.nrp-nautilus.io/ihv
Omniverse Integration: https://gitlab.nrp-nautilus.io/omniverse
CENIC Network: https://cenic.org/
CENIC Precision Agriculture Blog: https://nrp.ai/cenic-precision-agriculture-2025

Background

This project builds upon the successful demonstration at Iron Horse Vineyards, where CENIC, UC San Diego, and partners have created a living laboratory for precision agriculture. The VINE project focuses on AI/ML research using the National Research Platform (NRP) for computational resources. By leveraging NRP’s distributed infrastructure and GPU clusters, students can train and deploy sophisticated ML models for agricultural applications. The project works with agricultural datasets from Iron Horse Vineyards and aims to create open, shareable datasets for the research community. This approach creates a scalable, reproducible framework for AI/ML research in precision agriculture that can benefit researchers, educators, and practitioners nationwide.

AI Data Readiness Inspector (AIDRIN)

Fri, 30 Jan 2026 10:15:00 -0700

Garbage In, Garbage Out (GIGO) is a widely accepted quote in computer science across various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.

AIDRIN (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of data readiness for AI processes, covering a broad range of dimensions from the literature. AIDRIN uses metrics from traditional data quality assessment, such as completeness, outliers, and duplicates, to evaluate data. Furthermore, AIDRIN uses metrics specific to assessing AI data, such as feature importance, feature correlations, class imbalance, fairness, privacy, and compliance with the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles. AIDRIN provides visualizations and reports to assist data scientists in further investigating data readiness.

AIDRIN Multiple File Formats

The proposed work will include improvements in the AIDRIN framework to (1) add support for new file formats such as Zarr, ROOT, and HDF5; and (2) to allow providing custom data ingestion mechanisms.

Topics: data readiness, AI, data analysis
Skills: Python, C/C++, data analysis, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Scenic: A Language for Design and Verification of Autonomous Cyber-Physical Systems

Sat, 24 Jan 2026 00:00:00 +0000

Scenic is a probabilistic programming language for the design and verification of autonomous cyber-physical systems like self-driving cars. Scenic allows users to define scenarios for testing or training their system by putting a probability distribution on the system’s environment: the positions, orientations, and other properties of objects and agents, as well as their behaviors over time. Sampling these scenarios and running them in a simulator yields synthetic data which can be used to train or test a system. Since Scenic was released open-source in 2019, our group and many others in academia have used Scenic to find, diagnose, and fix bugs in autonomous cars, aircraft, robots, and other kinds of systems. In industry, it is being used by companies including Boeing, Meta, Deutsche Bahn, and Toyota in domains spanning autonomous driving, aviation, household robotics, railways, maritime, and virtual reality.

Our long-term goal is for Scenic to become a widely-used common representation and toolkit supporting the entire design lifecycle of AI-based cyber-physical systems. Towards this end, we have many summer projects available, ranging from adding new application domains to working on the Scenic compiler and sampler:

Extensions to the Scenic driving domain
Interfacing Scenic to new simulators
Scenic distribution visualizer

See the sections below for details.

Extensions to the Scenic Driving Domain

Topics: Autonomous Driving 3D modeling
Skills: Python; basic vector geometry
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

There are several potential goals of this project, including:

Supporting importing complex object information from simulators like CARLA.
Extending the domain to incorporate additional metadata, such as highway entrances and exits.
Fixing various bugs and limitations that exist in the driving domain (e.g. Issue #274 and Issue #295).

Interfacing Scenic to New Simulators

Topics: Simulation Autonomous Driving
Skills: Python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

Scenic is designed to be easily-interfaced to new simulators. Depending on student interest, we could pick a simulator which would open up new kinds of applications for Scenic and write an interface for it. Some possibilities include:

The AWSIM driving simulator (to allow testing the Autoware open-source autonomous driving software stack)
The CarMaker driving simulator

The goal of the project would be to create an interface between Scenic and the new simulator and write scenarios demonstrating it. If time allows, we could do a case study on a realistic system for publication at an academic conference.

Tool to Visualize Scenario Distributions

Topics: Visualization
Skills: Python; basic visualization and graphics
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

A Scenic scenario represents a distribution over scenes, but it can be difficult to interpret what exactly this distribution represents. Being able to visualize this distribution would be helpful for understanding and reasoning about Scenarios.

The goal of this project would be to build on an existing prototype for visualizing these distributions, and to create a tool that can be used by the wider Scenic community.

MedJEPA: Self-Supervised Medical Image Representation Learning with JEPA

Mon, 19 Jan 2026 10:15:56 -0700

Project Description

[MedJEPA] Medical image analysis is fundamental to modern healthcare, enabling disease diagnosis, treatment planning, and patient monitoring across diverse clinical applications. In radiology and pathology, deep learning models support automated detection of abnormalities, tumor segmentation, and diagnostic assistance. Medical imaging modalities including X-rays, CT scans, MRI, ultrasound, and histopathology slides generate vast amounts of unlabeled data that could benefit from self-supervised representation learning. Clinical applications include cancer detection and staging, cardiovascular disease assessment, neurological disorder diagnosis, and infectious disease screening. In drug discovery and clinical research, analyzing medical images helps evaluate treatment efficacy, predict patient outcomes, and identify biomarkers for disease progression. Telemedicine and point-of-care diagnostics benefit from AI-powered image analysis that extends expert-level interpretation to underserved regions. However, medical imaging faces unique challenges: limited labeled datasets due to expensive expert annotation, patient privacy concerns restricting data sharing, domain shift across different imaging equipment and protocols, and the need for models that generalize across hospitals and populations. Traditional medical image analysis relies heavily on supervised learning with manually annotated labels, creating bottlenecks due to the scarcity and cost of expert annotations. Existing self-supervised methods applied to medical imaging often employ complex training procedures with numerous heuristics—momentum encoders, stop-gradients, teacher-student architectures, and carefully tuned augmentation strategies—that may not translate well across different medical imaging modalities and clinical contexts. These approaches struggle with domain-specific challenges such as subtle pathological features, high-resolution images, 3D volumetric data, and the need for interpretable representations that clinicians can trust. To address these challenges, we propose MedicalJEPA: Self-Supervised Medical Image Representation Learning with Joint-Embedding Predictive Architecture, which leverages the theoretically grounded LeJEPA framework for 2D medical images and V-JEPA principles for medical video and volumetric data, creating a unified, scalable, and heuristics-free approach specifically tailored for medical imaging applications. By utilizing the principled JEPA frameworks with objectives like Sketched Isotropic Gaussian Regularization (SIGReg), MedJEPA eliminates complex training heuristics while learning clinically meaningful representations from unlabeled medical images. Unlike conventional self-supervised methods that require extensive hyperparameter tuning and may not generalize across medical imaging modalities, MedicalJEPA provides a clean, theoretically motivated framework with minimal hyperparameters that adapts to diverse medical imaging contexts—from chest X-rays to histopathology slides to cardiac MRI sequences. The learned representations can support downstream tasks including disease classification, lesion detection, organ segmentation, and survival prediction, while requiring significantly fewer labeled examples for fine-tuning. This approach democratizes access to state-of-the-art medical AI by enabling effective learning from the vast amounts of unlabeled medical imaging data available in hospital archives, addressing the annotation bottleneck that has limited progress in medical AI.

Project Objectives

Aligned with the vision of the 2026 Open Source Research Experience (OSRE), this project aims to apply Joint-Embedding Predictive Architecture (JEPA) frameworks to medical image representation learning, addressing the critical challenge of learning from limited labeled medical data. Medical imaging generates enormous amounts of unlabeled data, but supervised learning approaches are bottlenecked by the scarcity and cost of expert annotations. Existing self-supervised methods often rely on complex heuristics that don’t generalize well across diverse medical imaging modalities, equipment vendors, and clinical protocols. This project will leverage the theoretically grounded LeJEPA framework for 2D medical images (X-rays, histopathology slides, fundus images) and V-JEPA principles for temporal and volumetric medical data (cardiac MRI sequences, CT scans, surgical videos). The core challenge lies in adapting these heuristics-free, stable frameworks to medical imaging’s unique characteristics: subtle pathological features requiring fine-grained representations, high-resolution images demanding efficient processing, domain shift across hospitals and equipment, and the need for interpretable features that support clinical decision-making. The learned representations will be evaluated on diverse downstream clinical tasks including disease classification, lesion detection, organ segmentation, and prognosis prediction, with emphasis on few-shot learning scenarios that reflect real-world annotation constraints. Below is an outline of the methodologies and models that will be developed in this project.

Step 1: Medical Data Preparation: Develop data processing pipelines for diverse medical imaging modalities, implementing DICOM/NIfTI parsing, standardized preprocessing, and efficient data loading for self-supervised pre-training. Prepare 2D medical image datasets: Chest X-rays: ChestX-ray14, MIMIC-CXR, CheXpert for lung disease detection Histopathology: Camelyon16/17 (breast cancer), PCam (patch-level classification) Retinal imaging: EyePACS, APTOS (diabetic retinopathy), Messidor Dermatology: HAM10000, ISIC (skin lesion classification) Prepare 3D volumetric and temporal medical data: CT scans: LIDC-IDRI (lung nodules), Medical Segmentation Decathlon datasets MRI sequences: BraTS (brain tumors), ACDC (cardiac MRI), UK Biobank cardiac videos Medical video: Surgical procedure videos, endoscopy recordings, ultrasound sequences Implement medical imaging-specific preprocessing: intensity normalization, resolution standardization, handling of multi-channel medical images (different MRI sequences, RGB histopathology), and privacy-preserving anonymization. Design masking strategies appropriate for medical imaging: spatial masking for 2D images, volumetric masking for 3D scans, temporal masking for sequences, and anatomy-aware masking that respects organ boundaries. Create data loaders supporting high-resolution medical images, 3D volumes, and multi-modal inputs (e.g., multiple MRI sequences).
Step 2: JEPA Model Implementation for Medical Imaging: Implement LeJEPA for 2D medical images: Adapt joint-embedding predictive architecture for medical image characteristics (high resolution, subtle features, domain-specific patterns) Apply Sketched Isotropic Gaussian Regularization (SIGReg) to learn clinically meaningful embedding distributions Maintain single trade-off hyperparameter and heuristics-free training for reproducibility across medical imaging centers Support various encoder architectures: Vision Transformers for global context, ConvNets for local features, hybrid approaches Extend to V-JEPA for medical video and volumetric data: Spatiotemporal encoding for cardiac MRI sequences, surgical videos, and time-series medical imaging Temporal prediction objectives for understanding disease progression and treatment response 3D volume processing for CT and MRI scans with efficient memory management Multi-slice and multi-sequence learning for comprehensive medical imaging contexts Develop medical domain-specific enhancements: Multi-scale representation learning to capture both fine-grained pathological details and global anatomical context Interpretability mechanisms: attention visualization, feature attribution, and embedding space analysis for clinical validation Robustness to domain shift: training strategies that generalize across different scanners, protocols, and institutions Privacy-preserving training considerations compatible with medical data regulations (HIPAA, GDPR) Implement efficient training infrastructure: Support for distributed training across multiple GPUs for large medical imaging datasets Memory-efficient processing of high-resolution images and 3D volumes Checkpoint management and model versioning for clinical deployment pipelines Minimal-code implementation (≈50-100 lines) demonstrating framework simplicity
Step 3: Evaluation & Safety Validation: : Disease Classification Tasks: Multi-label chest X-ray classification: 14 pathology classes on ChestX-ray14, MIMIC-CXR Diabetic retinopathy grading: 5-class classification on EyePACS, APTOS Skin lesion classification: 7-class classification on HAM10000 Brain tumor classification: glioma grading on BraTS dataset Evaluate with linear probing, few-shot learning (5-shot, 10-shot), and full fine-tuning Lesion Detection and Segmentation: Lung nodule detection on LIDC-IDRI dataset Tumor segmentation on Medical Segmentation Decathlon tasks Polyp detection in colonoscopy videos Cardiac structure segmentation in MRI sequences Clinical Prediction Tasks: Survival prediction from histopathology slides Disease progression prediction from longitudinal imaging Treatment response assessment from pre/post imaging pairs Few-Shot and Low-Data Regime Evaluation: Systematic evaluation with 1%, 5%, 10%, 25%, 50% of labeled training data Comparison against supervised baselines and ImageNet pre-training Analysis of annotation efficiency: performance vs. number of labeled examples required

Project Deliverables

This project will deliver three components: software implementation, clinical evaluation, and practical deployment resources. The software implementing MedicalJEPA will be hosted on GitHub as an open-access repository with modular code supporting multiple medical imaging modalities (2D images, 3D volumes, videos), pre-trained model checkpoints on major medical imaging datasets (chest X-rays, histopathology, MRI), training and evaluation scripts with medical imaging-specific preprocessing pipelines, privacy-preserving training implementations compatible with clinical data regulations, and comprehensive documentation including tutorials for medical AI researchers and clinicians. The evaluation results will include benchmarks on 10+ medical imaging datasets across diverse modalities and clinical tasks, few-shot learning analysis demonstrating annotation efficiency gains, cross-institutional validation studies showing robustness to domain shift, interpretability visualizations enabling clinical validation of learned representations, and detailed comparisons against supervised baselines and existing medical self-supervised methods. .

NeuroHealth

Topics: Self-Supervised Medical Image Representation Learning with JEPA
Skills: Proficiency in Python, Pytorch, Github, JEPA
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Linsey Pang

References:

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics - Randall Balestriero and Yann LeCun, arXiv 2024
Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA) - Adrien Bardes et al., arXiv 2024
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture - Mahmoud Assran et al., CVPR 2023 (I-JEPA)
ChestX-ray14: Hospital-Scale Chest X-Ray Database - https://nihcc.app.box.com/v/ChestXray-NIHCC
Medical Segmentation Decathlon - http://medicaldecathlon.com/
MIMIC-CXR Database - https://physionet.org/content/mimic-cxr/
The Cancer Imaging Archive (TCIA) - https://www.cancerimagingarchive.net/
UK Biobank Imaging Study - https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/imaging-data

NeuroHealth: AI-Powered Health Assistant

Mon, 19 Jan 2026 10:15:56 -0700

Project Description

[NeuroHealth] Intelligent health assistance systems are increasingly essential for improving healthcare accessibility, patient engagement, and clinical decision support. In primary care and preventive medicine, AI assistants help users understand symptoms, schedule appropriate appointments, and receive preliminary health guidance. Telemedicine applications include triage support, appointment scheduling optimization, and patient education based on health inquiries. In chronic disease management, these systems provide medication reminders, lifestyle recommendations, and timely alerts for medical follow-ups. Healthcare navigation applications include finding appropriate specialists, understanding treatment options, and coordinating care across multiple providers. In wellness and preventive care, intelligent assistants enhance health literacy by delivering personalized health information, screening recommendations, and proactive health management strategies. By leveraging natural language understanding and medical knowledge integration, these systems enhance healthcare access, reduce unnecessary emergency visits, and empower users to make informed health decisions across diverse populations. Traditional health information systems often provide generic responses that fail to account for individual health contexts, medical history, and personal circumstances. Existing symptom checkers and health chatbots primarily rely on rule-based logic or simple decision trees, limiting their ability to understand nuanced health inquiries, reason about complex symptom patterns, or provide contextually appropriate guidance. These systems struggle with interpreting ambiguous descriptions, adapting to users’ health literacy levels, and generating personalized recommendations that account for individual medical constraints and preferences. To address these challenges, we propose NeuroHealth: AI-Powered Health Assistant, which leverages Large Language Models (LLMs) to create an intelligent conversational agent that synthesizes user health inquiries, symptom descriptions, and contextual information into actionable, personalized health guidance and appointment recommendations. By integrating LLM-based medical reasoning with structured clinical knowledge bases, NeuroHealth enhances symptom interpretation, appointment routing, and health education delivery. Unlike conventional systems that provide static responses from predetermined templates, NeuroHealth dynamically understands user intent, asks clarifying questions, assesses urgency levels, and generates appropriate recommendations—whether scheduling a doctor appointment, suggesting self-care measures, or directing users to emergency services. This fusion of LLM intelligence with validated medical knowledge enables a more accessible, adaptive, and helpful health assistance platform, bridging the gap between users seeking health information and appropriate medical care.

Project Objectives

Aligned with the vision of the 2026 Open Source Research Experience (OSRE), this project aims to develop an AI-Powered Health Assistant (NeuroHealth) to improve healthcare accessibility and patient engagement through intelligent conversational guidance. Healthcare systems face significant challenges in providing timely, personalized health information and connecting patients with appropriate care resources. Traditional symptom checkers and health information systems often deliver generic, rule-based responses that fail to account for individual contexts and struggle with natural language understanding. To address these limitations, this project will leverage Large Language Models (LLMs) to create an intelligent health assistant that understands user health inquiries, interprets symptom descriptions, assesses urgency, and provides personalized recommendations including doctor appointment suggestions, self-care guidance, and healthcare navigation support. The core challenge lies in designing NeuroHealth as a safe, accurate, and user-friendly system capable of natural conversation, medical knowledge retrieval, and appropriate response generation while maintaining clinical safety guardrails. Unlike conventional health chatbots that follow rigid conversation flows, NeuroHealth will reason over user inputs, ask clarifying questions, and dynamically adapt responses based on context, resulting in more helpful, accurate, and appropriate health assistance. Below is an outline of the methodologies and models that will be developed in this project.

Step 1: Data Collection & Knowledge Base Construction: Develop a comprehensive medical knowledge base integrating validated health information sources, symptom databases, condition descriptions, and appointment routing guidelines. Collect and curate conversational health inquiry datasets from public medical Q&A forums, symptom checker logs, and healthcare chatbot interactions to create training and evaluation data. Design structured representations for symptoms, conditions, urgency levels, and appointment recommendations to enable effective retrieval and reasoning. Extract common health inquiry patterns, symptom descriptions, and user intent categories to inform conversation flow design. Data sources can include public medical knowledge bases such as MedlinePlus, Mayo Clinic health information, clinical practice guidelines, and synthetic patient inquiry scenarios based on common healthcare use cases. Implement data validation mechanisms to ensure medical accuracy and clinical safety compliance.
Step 2: Model Development: Design and implement an LLM-based conversational health assistant that integrates medical knowledge retrieval with natural language understanding and generation. Develop a Retrieval-Augmented Generation (RAG) architecture that grounds LLM responses in validated medical information sources, reducing hallucination risks and ensuring factual accuracy. Create prompt engineering strategies and reasoning frameworks that enable the system to: interpret symptom descriptions, assess urgency levels, ask appropriate clarifying questions, and generate personalized health guidance. Implement a multi-component architecture including: intent recognition, symptom extraction, urgency assessment, appointment recommendation generation, and response formatting modules. Develop clinical safety guardrails that detect high-risk scenarios requiring immediate medical attention and provide appropriate emergency guidance. Design conversation management strategies that maintain context across multi-turn dialogues and adapt to users’ health literacy levels. The baseline architecture can leverage state-of-the-art models such as GPT-4, Claude, or open-source alternatives like Llama, Qwen, combined with medical knowledge retrieval systems.
Step 3: Evaluation & Safety Validation: : Benchmark NeuroHealth against existing symptom checkers and health chatbots, evaluating on metrics including response accuracy, appropriateness of appointment recommendations, urgency assessment precision, and user satisfaction. Conduct human evaluation studies with healthcare professionals to assess clinical safety, response quality, and appropriateness of medical guidance. Perform adversarial testing to identify potential failure modes, unsafe responses, or inappropriate recommendations under edge cases. Conduct ablation studies to analyze the impact of retrieval-augmented generation, safety guardrails, and conversation management strategies on system performance. Evaluate system performance across diverse health inquiry types including acute symptoms, chronic condition management, preventive care questions, and healthcare navigation requests. Assess response quality across different user demographics and health literacy levels to ensure equitable access. Optimize inference efficiency and response latency for real-time conversational interaction across web and mobile platforms.

Project Deliverables

This project will deliver three components: model development, evaluation and validation, and interactive demonstration. The software implementing the NeuroHealth system will be hosted on GitHub as an open-access repository with comprehensive documentation, deployment guides, and API specifications. The evaluation results, including benchmark comparisons against existing systems, clinical safety assessments, and user study findings, will be published alongside the GitHub repository. An interactive demo showcasing the conversational interface, symptom interpretation capabilities, and appointment recommendation generation will be provided to illustrate real-world application scenarios.

NeuroHealth

Topics: AI-Powered Health Assistant
Skills: Proficiency in Python, Github, LLM
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Large Language Models in Healthcare - Singhal et al., Nature 2023
Med-PaLM: Large Language Models for Medical Question Answering - Singhal et al., arXiv 2022
Capabilities of GPT-4 on Medical Challenge Problems - Nori et al., arXiv 2023
MedlinePlus Medical Encyclopedia - https://medlineplus.gov/
Clinical Practice Guidelines Database - https://www.guidelines.gov/

Lynx Grader

Tue, 13 Jan 2026 13:00:00 -0800

The EduLinq Lynx Grader (also referred to as “autograder”) is an open source tool used by several courses at UCSC to safely and quickly grade programming assignments. Grading student code is something that may seem simple at first (you just need to run their code!), but quickly becomes exceeding complex as you get more into the details. Specifically, grading a student’s code securely while providing the “last mile” service of getting code from students and sending results to instructors/TAs and the course’s LMS (e.g., Canvas) can be very difficult. The Lynx Grader provides all of this in a free and open source project. The LINQS Lab has made many contributions to the maintain and improve the Lynx Grader.

As an open source project, there are endless opportunities for development, improvements, and collaboration. Here, we highlight some specific projects that will work well in the summer mentorship setting.

All students interested in LINQS projects for OSRE/GSoC 2026 should fill out this form. Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project. The form will stop accepting responses once the application window closes. Do not post on any of the project repositories about OSRE/GSoC (e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2026). Remember, these are active repositories that were not created for OSRE/GSoC.

LLM Detection

Topics: AI/ML LLM Research Backend
Skills: software development, backend, systems, data munging, go, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

As Large Language Model (LLM) tools like ChatGPT become more common and powerful, instructors need tools to help determine if students are the actual authors of the code they submit. More classical instances of plagiarism are often discovered by code similarity tools like MOSS. However these tools are not sufficient for detecting code written not by a student, but by an AI model like ChatGPT or GitHub Copilot.

The task for this project is to create a system that provides a score indicating the system’s confidence that a given piece of code was written by an AI tool and not a student. This will supplement the existing code analysis tools in the Lynx Grader. There are many approaches to completing this task that will be considered. A more software development approach can consist of levering exiting systems to create a production-ready system, whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.

There has been previous work on this issue, where a student did a survey of existing solutions, collection of initial datasets, and exploratory experiments on possible directions. This project would build off of this previous work.

See Also:

Code Analysis GUI

Topics: Frontend
Skills: software development, frontend, data munging, js, css, go
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Lynx Grader has existing functionality to analyze the code in a student’s submission for malicious content. Relevant to this project is that the Lynx Grader can run a pairwise similarity analysis against all submitted code. This is how most existing software plagiarism systems detect offending code. The existing infrastructure provides detailed statistics on code similarity, but does not currently have a visual way to display this data.

The task for this project is to create a web GUI using the Lynx Grader REST API to display the results of a code analysis. The size of this project depends on how many of the existing features are going to be supported by the web GUI.

See Also:

Web GUI

Topics: Frontend
Skills: software development, frontend, js, css
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Lynx Grader contains dozens of API endpoints, most directly representing a piece of functionality exposed to the user. All of these features are exposed in the Lynx Grader’s Python Interface. However, the Python interface is a purely command-line interface. And although command-line interface are objectively (read: subjectively) the best, a web GUI would be more accessible to a wider audience. The autograder already has a web GUI, but it does not cover all the features available in the Lynx Grader.

The task for this project is to augment the Lynx Grader’s web GUI with more features. Specifically, add support for more tools used to create and administer courses.

See Also:

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Tue, 22 Jul 2025 10:15:56 -0700

Midway Through OSRE

My Journey with LLMSeqRec

Hello from the Midpoint!

Hi everyone! I’m Connor Lee, a student at NYU studying Computer Science and Mathematics, and I’m excited to share the progress I’ve made halfway through the Open Source Research Experience (OSRE) with my project: LLMSeqRec – a large language model-enhanced sequential recommender system.

Over the past several weeks, I’ve had the opportunity to explore the intersection of recommender systems and large language models (LLMs), and it’s been a deep, challenging, and rewarding dive into building smarter, more contextual recommendation engines.

What is LLMSeqRec?

LLMSeqRec stands for LLM-Enhanced Contextual Sequential Recommender. Traditional sequential recommendation systems like SASRec are great at capturing patterns from user-item interactions, but they often fall short in two areas: understanding semantic context (e.g., item descriptions, reviews) and dealing with cold-start problems.

LLMSeqRec aims to address this by incorporating pretrained LLM embeddings into the recommendation pipeline. The goal is to enhance models like SASRec with semantic signals from text (like product reviews or titles), allowing them to better model user intent, long-range dependencies, and generalize to new items or users.

Progress So Far

✅ Baseline SASRec Runs

To establish a benchmark, I successfully ran the original SASRec implementation (in PyTorch) using both the MovieLens 1M and Amazon Beauty datasets. After debugging initial data formatting issues and adjusting batch sizes for local CPU/GPU compatibility, I automated training with scripts that let me scale to 200+ epochs to acheive the best performance in both Colab and on my MacBook via CPU.

Note: At this stage, we have not yet integrated LLMs into the model. These baseline runs (SASRec) serve as the control group for evaluating the future impact of LLM-based enhancements.

What’s Next

As I enter the second half of the OSRE, I’ll be shifting gears toward LLM integration, model evaluation, and running LLM-powered sequential recommendations using product metadata and contextual information. Here’s what’s ahead:

Designing pipelines to extract and align textual metadata with item sequences
Integrating LLM-generated embeddings into the recommender model
Evaluating performance changes across different dataset characteristics

📊 Experimental Results

We have not yet utilized LLMs in our current experiments. The results below reflect our reproduced baseline performance of SASRec across datasets.

Below are the performance curves on different test sets, where we evaluate model performance every 20 epochs during training:

Beauty Dataset Performance

Hit@10 performance on the test set for the Beauty dataset (every 20 epochs)

Training loss for the Beauty dataset

NDCG@10 performance on the test set for the Beauty dataset (every 20 epochs)

ML-1M Dataset Performance

Training loss for the ML-1M dataset

Hit@10 performance on the test set for the ML-1M dataset (every 20 epochs)

NDCG@10 performance on the test set for the ML-1M dataset (every 20 epochs)

These results demonstrate that our baseline SASRec reproductions are converging as expected and will serve as a solid foundation for comparison once LLM integration is complete.

Closing Thoughts

This project has been an exciting journey into both research and engineering and I’m excited to explore LLM-powered embedding integration in the upcoming phase.

I’m incredibly grateful to my mentors Dr. Linsey Pang and Dr. Bin Dong for their support and guidance throughout the project so far. I’m looking forward to sharing more technical results as we work toward building smarter, more adaptable recommender systems.

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Fri, 06 Jun 2025 10:15:56 -0700

Project Description

Sequential Recommender Systems are widely used in scientific and business applications to analyze and predict patterns over time. In biology and ecology, they help track species behavior by suggesting related research on migration patterns and environmental changes. Medical applications include personalized treatment recommendations based on patient history and predicting disease progression. In physics and engineering, these systems optimize experimental setups by suggesting relevant past experiments or simulations. Environmental and climate science applications include forecasting climate trends and recommending datasets for monitoring deforestation or pollution. In business and e-commerce, sequential recommenders enhance user experiences by predicting consumer behavior, suggesting personalized products, and optimizing marketing strategies based on browsing and purchase history. By leveraging sequential dependencies, these recommender systems enhance research efficiency, knowledge discovery, and business decision-making across various domains. Traditional sequential recommendation systems rely on historical user interactions to predict future preferences, but they often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors. Existing models primarily use predefined embeddings and handcrafted features, limiting their ability to generalize across diverse recommendation scenarios. To address these challenges, we propose LLM Enhanced Contextual Sequential Recommender (LLMSeqRec), which leverages Large Language Models (LLMs) to enrich sequential recommendations with deep contextual understanding and adaptive reasoning. By integrating LLM-generated embeddings and contextual representations, LLMSeqRec enhances user intent modeling, cold-start recommendations, and long-range dependencies in sequential data. Unlike traditional models that rely solely on structured interaction logs, LLMSeqRec dynamically interprets and augments sequences with semantic context, leading to more accurate and personalized recommendations. This fusion of LLM intelligence with sequential modeling enables a more scalable, adaptable, and explainable recommender system, bridging the gap between traditional sequence-based approaches and advanced AI-driven recommendations.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), this project aims to develop an LLM-Enhanced Contextual Sequential Recommender (LLMSeqRec) to improve sequential recommendation accuracy across various scientific and business applications. Sequential recommender systems are widely used to analyze and predict patterns over time, assisting in fields such as biology, ecology, medicine, physics, engineering, environmental science, and e-commerce. However, traditional models often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors, as they primarily rely on vanilla sequential Id orders. To address these limitations, this project will leverage Large Language Models (LLMs) to enhance context-aware sequential recommendations by dynamically integrating LLM-generated embeddings and contextual representations. The core challenge lies in designing LLMSeqRec, a unified and scalable model capable of enriching user intent modeling, mitigating cold-start issues, and capturing long-range dependencies within sequential data. Unlike conventional systems that rely solely on structured interaction logs, LLMSeqRec will interpret and augment sequences with semantic context, resulting in more accurate, adaptable, and explainable recommendations. Below is an outline of the methodologies and models that will be developed in this project:

Step 1: Data Preprocessing & Feature Creation: Develop a data processing pipeline to parse user’s sequential interaction behaviors into sequential data points for LLM-based embeddings and contextual sequential transformer modeling; Extract user behavior sequences, items’ metadata, and temporal patterns to create context-aware sequential representations for training, validation and testing; The data source can be from Amazon open public data or Movie Lense data set. The data points creation can follow SASRec (in the reference 1).
Step 2: Model Development: Design and implement LLM-enhanced sequential recommendation models, integrating pretrained language models to augment user-item interactions with semantic context; Develop an adaptive mechanism to incorporate external contextual signals, such as product descriptions, reviews into the sequential recommendation process; The baseline model can be SASRec pytorch implementation.
Step 3: Evaluation: : Benchmark LLMSeqRec against state-of-the-art sequential recommenders, evaluating on accuracy, NDCG and cold-start performance; Conduct ablation studies to analyze the impact of LLM-generated embeddings on recommendation quality; Optimize model inference speed and efficiency for real-time recommendation scenarios.

Project Deliverables

This project will deliver three components, software, model training, validation and performance evaluation and demo. The software which implements the above LLMSeqRec model will be hosted on the github repo as open-access repositories. The evaluation results and demo will be published along the github repo .

LLMSeqRec

Topics: LLM Enhanced Contextual Sequential Recommender
Skills: Proficiency in Python, Pytorch, Github, Self-attention, Transformer
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Self-Attentive Sequential Recommendation (SASRec)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Amazon Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews
Movie Lense Data: https://grouplens.org/datasets/movielens/

Introduction

I’m Connor, a student at NYU studying CS and Math. This summer I’ve gotten the opportunity to work on LLMSeqRec under Dr. Bin Dong and Dr. Linsey Pang.

In today’s digital age, sequential recommender systems power everything from e-commerce suggestions to personalized content everywhere. However, traditional models fall short in capturing user intent, adapting to dynamic behavior, or tackling cold-start problems. That’s where LLMSeqRec comes in.

Problem Statement

Most sequential recommender systems rely heavily on historical user-item interactions and predefined embeddings. This approach limits their ability to understand nuanced user preferences, struggles to scale across domains, and performs poorly in scenarios like new users or sparse data. The absence of semantic and contextual modeling is a major gap in current solutions.

Overview of project

LLMSeqRec is a novel, LLM-enhanced sequential recommender framework that bridges this gap. By leveraging large language models (LLMs), it incorporates semantic embeddings and prompt-based contextual modeling to understand both user behavior and item metadata at a deeper level. The system explores two core approaches:

Embedding-based: LLMs generate embeddings from item attributes.
Prompt-based: LLMs receive full transaction history in natural language format and infer recommendations.

These techniques are tested using well-known datasets (e.g., Amazon, MovieLens), and evaluated with ranking metrics like NDCG@10 and Hit@10. The goal: deliver more accurate, context-rich, and explainable recommendations.

Next Steps

The project is currently progressing through stages including model training, embedding integration, and evaluation. Upcoming tasks include:

Fine-tuning enhanced models
Designing zero-/few-shot prompts
Running comparative experiments
Publishing findings and writing technical blogs

As part of the LLMSeqRec my proposal under the mentorship of Dr. Bin Dong and Dr. Linsey Pang.

Enhancing Reproducibility in Distributed AI Training: Leveraging Checkpointing and Metadata Analytics

Fri, 21 Feb 2025 09:00:00 -0700

Reproducibility in distributed AI training is a crucial challenge due to several sources of uncertainty, including stragglers, data variability, and inherent randomness. Stragglers—slower processing nodes in a distributed system—can introduce timing discrepancies that affect the synchronization of model updates, leading to inconsistent states across training runs. Data variability, stemming from non-deterministic data shuffling and differing data partitions across nodes, can also lead to variations in model performance. Additionally, inherent randomness in algorithm initialization, such as random weight beginnings and stochastic processes like dropout, further compounds these challenges. Reproducibility in AI is pivotal for ensuring the credibility of AI-driven scientific findings, akin to how reproducibility underpins traditional scientific research.

To enhance AI reproducibility, leveraging metadata analytics and visualization along with saved checkpoints offers a promising solution. Checkpointing in AI training is a pivotal technique that involves saving snapshots of a model and its parameters at regular intervals throughout the training process. This practice is essential for maintaining progress in the face of potential interruptions, such as hardware failures, and enables the resumption of training without having to restart from scratch. In the context of distributed AI training, checkpointing also provides a framework for analyzing and ensuring reproducibility, offering a means to systematically capture and review the training trajectory of models. Analyzing checkpoints can specifically help identify issues like stragglers, which are slower computing nodes in a distributed system that can impede synchronized progress. For example, by examining the time stamps and resource utilization data associated with each checkpoint, anomalies in processing time can be detected, revealing nodes that consistently lag behind others. This analysis enables teams to diagnose performance bottlenecks and optimize resource allocation across the distributed system, ensuring smoother and more consistent training runs. By combining checkpointing with metadata analytics, it becomes possible to pinpoint the exact training iterations where delays occur, thereby facilitating targeted investigations and solutions to improve overall system reproducibility and efficiency.

Workplan

The proposed work will include: 1) Setting up a checkpointing system within the distributed AI training framework to periodically save model states and metadata; 2) Designing a metadata analysis schema for populating model and system statistics from the saved checkpoints; 3) Conducting exploratory data analysis to identify patterns, anomalies, and sources of variability in the training process; 4) Creating visualization tools to represent metadata insights with collected statistics and patterns; 5) Using insights from metadata analytics and visualization to optimize resource distribution across the distributed system and mitigate straggler effects; and 6) Disseminating results and methodologies through academic papers, workshops, and open-source contributions.

Topics: Reproducibility AI distributed AI checkpoint metadata analysis
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Thu, 20 Feb 2025 09:00:00 -0700

Retrieval-Augmented Generation (RAG) frameworks, which merge the capabilities of retrieval systems and generative models, significantly enhance the relevance and accuracy of responses produced by large language models (LLMs). These frameworks retrieve relevant documents from a large corpus and use these documents to inform the generative process, thereby improving the contextuality and precision of the generated content. Ensuring reproducibility in data queries using similarity search within these RAG frameworks is critical for maintaining the reliability and consistency of scientific workflows. Reproducibility ensures that the same input query consistently yields the same output, which is vital for scientific tasks that rely on precise and repeatable results. Inconsistencies can arise from various sources, affecting the trustworthiness of scientific outcomes. Differences in retrieval algorithms can lead to variable sets of documents being retrieved for the same query. Variations in data indexing methods can cause inconsistencies in how documents are ranked and accessed. The stochastic nature of LLM operations introduces an element of randomness in the generative process. Updates in datasets can also alter the baseline against which queries are processed and interpreted, leading to different results over time.

This proposal aims to address these reproducibility challenges in similarity searches within RAG frameworks. This work involves analyzing the root causes of non-determinism, benchmarking and validating the consistency of query results, implementing enhancements to minimize variability, and developing tools and best practices to ensure reproducibility. Reproducibility in data queries can be influenced by several factors, including updates in datasets, differences in retrieval algorithms, varying data indexing methods, and the stochastic nature of LLM operations. Each of these factors can cause variability in the documents retrieved and in the generated responses. Ensuring consistency in query results across different runs is crucial for maintaining the integrity of LLM-driven scientific research, allowing researchers to confidently build upon prior work and achieve reliable, trustworthy outcomes.

Workplan

The proposed work will include: (1) Identifying sources of non-determinism and variability, such as algorithmic differences and indexing methods, in RAG; (2) Utilizing standardized scientific datasets to benchmark the reproducibility of similarity search results across different RAG frameworks; (3) Establishing protocols for handling dataset updates to ensure that such changes do not impact the reproducibility of similarity search results; and (4) Implementing mechanisms to track and document updates to datasets, ensuring that changes are reflected consistently across all instances of the RAG framework. By addressing these areas, the proposed work aims to mitigate challenges related to reproducibility in similarity search queries within RAG frameworks, ultimately enhancing the reliability and trustworthiness of scientific research outcomes.

Topics: Reproducibility LLM RAG Scientific Workflows
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

AI Data Readiness Inspector (AIDRIN)

Tue, 11 Feb 2025 10:15:00 -0700

Garbage In Garbage Out (GIGO) is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest considerable time and effort in preparing the data for AI.

AIDRIN (AI Data Readiness INspector) is a framework that provides a quantifiable assessment of the readiness of data for AI processes, covering a broad range of readiness dimensions available in the literature. AIDRIN uses metrics in traditional data quality assessment, such as completeness, outliers, and duplicates, for data evaluation. Furthermore, AIDRIN uses metrics specific to assess data for AI, such as feature importance, feature correlations, class imbalance, fairness, privacy, and FAIR (Findability, Accessibility, Interoperability, and Reusability) principle compliance. AIDRIN provides visualizations and reports to assist data scientists in further investigating the readiness of data.

AIDRIN Visualizations and Science Gateway

The proposed work will include improvements in the AIDRIN framework to (1) enhance, extend, and optimize the visualizations of metrics related to all six pillars of AI data readiness and (2) set up a science gateway on NERSC or AWS cloud service.

Topics: data readiness AI
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench with AI workloads

Tue, 11 Feb 2025 10:15:00 -0700

h5bench is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.

Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library’s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.

h5bench with AI workloads

The proposed work will include (1) analyzing and characterizing AI workloads that rely on HDF5 datasets, (2) extracting a kernel of their I/O operations, and (3) implementing and validating the kernel in h5bench.

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Scenic: A Language for Design and Verification of Autonomous Cyber-Physical Systems

Tue, 11 Feb 2025 00:00:00 +0000

3D Driving Scenarios
A Library for Aviation Scenarios
Interfacing Scenic to new simulators
Optimizing and parallelizing Scenic
Improvements and infrastructure for the VerifAI toolkit

See the sections below for details.

3D Driving Scenarios

Topics: Autonomous Driving 3D modeling
Skills: Python; basic vector geometry
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

Scenic scenarios written to test autonomous vehicles use the driving domain, a Scenic library defining driving-specific concepts including cars, pedestrians, roads, lanes, and intersections. The library extracts information about road networks, such as the shapes of lanes, from files in the standard OpenDRIVE format. Currently, we only generate 2D polygons for lanes, throwing away 3D information. While this suffices for many driving scenarios, it means we cannot properly model overpasses (the roads appear to overlap) or test driving scenarios where 3D geometry is important, such as hilly terrain.

The goals of this project are to extend our road network library to generate 3D meshes (instead of 2D polygons) for roads, write new Scenic scenarios which use this new capability, and (if time allows) test autonomous driving software using them.

A Library for Aviation Scenarios

Topics: Autonomous Aircraft
Skills: Python; ideally some aviation experience
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

We have used Scenic to find, diagnose, and fix bugs in software for autonomous aircraft: in particular, this paper studied a neural network-based automated taxiing system using the X-Plane flight simulator. We also have prototype interfaces to AirSim and Microsoft Flight Simulator. However, our experiments so far have mainly focused on simple scenarios involving a single aircraft.

The goal of this project is to develop an aviation library for Scenic (like the driving domain mentioned in the previous project) which will allow users to create complex aviation scenarios in a simulator-agnostic way. The library would define concepts for aircraft, flight paths, weather, etc. and allow importing real-world data about these. The student would demonstrate the library’s functionality by writing some example scenarios and testing either simple aircraft controllers or (if time allows) ML-based flight software.

Interfacing Scenic to New Simulators

Topics: Simulation Autonomous Driving Robotics LLMs
Skills: Python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

The AWSIM driving simulator (to allow testing the Autoware open-source autonomous driving software stack)
The CoppeliaSim robotics simulator
NVIDIA’s Cosmos, an LLM which generates videos from text prompts
NVIDIA’s Omniverse (various applications, e.g. simulating virtual factories)
Various simulators for which we have prototype interfaces that could be generalized and made more usable, including MuJoCo and Isaac Sim

Optimizing and Parallelizing Scenic

Topics: Optimization Parallelization
Skills: Python
Difficulty: Moderate
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

Large-scale testing with Scenic, when one wants to generate thousands of simulations, can be very computationally-expensive. In some cases, the bottleneck is the simulator, and being able to easily run multiple simulations in parallel would greatly increase scalability. In others, Scenic itself spends substantial time trying to sample scenarios satisfying all the given constraints.

This project would explore a variety of approaches to speeding up scene and simulation generation in Scenic. Some possibilities include:

Parallelizing scene generation and simulation (e.g. using Ray)
Systematically profiling real-world Scenic programs to characterize the main bottlenecks and propose optimizations
JIT compiling Scenic’s internal sampling code (e.g. using Numba)

Improvements and Infrastructure for the VerifAI Toolkit

Topics: DevOps Documentation APIs
Skills: Python
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Daniel Fremont, Eric Vin

VerifAI is a toolkit for design and analysis of AI-based systems that builds on top of Scenic. It adds among other features the ability to perform falsification, intelligently searching for scenarios that will cause a system to behave in an undesirable way.

The goal of this project is to improve VerifAI’s development infrastructure, documentation, and ease of use, which are currently relatively poor compared to Scenic. Specific tasks could include:

Setting up continuous integration (CI) on GitHub
Creating processes to help users/developers submit issues and PRs and deal with them in a timely manner
Writing more documentation, including tutorials and examples (not only for end users of VerifAI but those wanting to develop custom falsification components, for example)
Refactoring VerifAI’s API to make it easier to use and extend

Autograder

Thu, 06 Feb 2025 13:00:00 -0800

The EduLinq Autograder is an open source tool used by several courses at UCSC to safely and quickly grade programming assignments. Grading student code is something that may seem simple at first (you just need to run their code!), but quickly becomes exceeding complex as you get more into the details. Specifically, grading a student’s code securely while providing the “last mile” service of getting code from students and sending results to instructors/TAs and the course’s LMS (e.g., Canvas) can be very difficult. The Autograder provides all of this in a free and open source project. The LINQS Lab has made many contributions to the maintain and improve the Autograder.

All students interested in LINQS projects for OSRE/GSoC 2025 should fill out this form. Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project. The form will stop accepting responses once the application window closes. Do not post on any of the project repositories about OSRE/GSoC (e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2025). Remember, these are active repositories that were not created for OSRE/GSoC.

LLM Detection

Topics: AI/ML LLM Research Backend
Skills: software development, backend, systems, data munging, go, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The task for this project is to create a system that provides a score indicating the system’s confidence that a given piece of code was written by an AI tool and not a student. This will supplement the existing code analysis tools in the Autograder. There are many approaches to completing this task that will be considered. A more software development approach can consist of levering exiting systems to create a production-ready system, whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.

See Also:

Code Analysis GUI

Topics: Frontend
Skills: software development, frontend, data munging, js, css, go
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Autograder has existing functionality to analyze the code in a student’s submission for malicious content. Relevant to this project is that the Autograder can run a pairwise similarity analysis against all submitted code. This is how most existing software plagiarism systems detect offending code. The existing infrastructure provides detailed statistics on code similarity, but does not currently have a visual way to display this data.

The task for this project is to create a web GUI using the Autograder REST API to display the results of a code analysis. The size of this project depends on how many of the existing features are going to be supported by the web GUI.

See Also:

Web GUI

Topics: Frontend
Skills: software development, frontend, js, css
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Autograder contains dozens of API endpoints, most directly representing a piece of functionality exposed to the user. All of these features are exposed in the Autograder’s Python Interface. However, the Python interface is a purely command-line interface. And although command-line interface are objectively (read: subjectively) the best, a web GUI would be more accessible to a wider audience. The autograder already has a web GUI, but it does not cover all the features available in the Autograder.

The task for this project is to augment the Autograder’s web GUI with more features. Specifically, add support for more tools used to create and administer courses.

See Also:

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Thu, 06 Feb 2025 10:15:56 -0700

Project Description

Project Objectives

Step 1: Data Preprocessing & Feature Creation: Develop a data processing pipeline to parse user’s sequential interaction behaviors into sequential data points for LLM-based embeddings and contextual sequential transformer modeling; Extract user behavior sequences, items’ metadata, and temporal patterns to create context-aware sequential representations for training, validation and testing; The data source can be from Amazon open public data or Movie Lense data set. The data points creation can follow SASRec (in the reference 1).
Step 2: Model Development: Design and implement LLM-enhanced sequential recommendation models, integrating pretrained language models to augment user-item interactions with semantic context; Develop an adaptive mechanism to incorporate external contextual signals, such as product descriptions, reviews into the sequential recommendation process; The baseline model can be SASRec pytorch implementation.
Step 3: Evaluation: : Benchmark LLMSeqRec against state-of-the-art sequential recommenders, evaluating on accuracy, NDCG and cold-start performance; Conduct ablation studies to analyze the impact of LLM-generated embeddings on recommendation quality; Optimize model inference speed and efficiency for real-time recommendation scenarios.

Project Deliverables

LLMSeqRec

Topics: LLM Enhanced Contextual Sequential Recommender
Skills: Proficiency in Python, Pytorch, Github, Self-attention, Transformer
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Self-Attentive Sequential Recommendation (SASRec)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Amazon Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews
Movie Lense Data: https://grouplens.org/datasets/movielens/

ReIDMM: Re-identifying Multiple Objects across Multiple Streams

Thu, 06 Feb 2025 10:15:56 -0700

Project Description

Re-identifying multiple objects across multiple streams (ReIDMM) is essential in scientific research and various industries. It involves tracking and analyzing entities across different viewpoints or time frames. In astronomy, ReIDMM helps track celestial objects like asteroids and space debris using multiple observatories. In biology and ecology, it enables the identification of animals across different camera traps and aids in tracking microscopic organisms in laboratory studies. In physics and engineering, it is used for tracking particles in high-energy physics experiments, monitoring structural changes in materials, and identifying robots or drones in lab automation. Beyond scientific applications, ReIDMM plays a critical role in industries such as retail, where it tracks customer behavior across multiple stores and improves sales and prevents theft. In smart cities, it supports traffic monitoring by identifying vehicles across intersections for improved traffic flow management. In manufacturing, it enables supply chain tracking by locating packages across conveyor belts and warehouse cameras. In autonomous systems, ReIDMM enhances multi-camera sensor fusion and warehouse robotics by identifying pedestrians, obstacles, and objects across different camera views.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), this project aims to develop an open-source algorithm for multiple-object re-identification across diverse open-source data streams. As highlighted earlier, this method is expected to have wide-ranging applications in both scientific research and industry. Utilizing an open-source dataset, our focus will be on re-identifying common objects such as vehicles and pedestrians. The primary challenge lies in designing a unified algorithm, ReIDMM, capable of performing robust multi-object re-identification across multiple streams. Users will be able to tag any object as a target in a video or image for tracking across streams. Below is an outline of the algorithms to be developed in this project:

Step 1: Target Object Identification: Randomly select a target object from an image or video using object detection models such as YOLOv7. These models detect objects by generating bounding boxes around them. Target objects could include vehicles, pedestrians, animals, or other recognizable entities. This step ensures an initial object of interest is chosen for re-identification.
Step 2: Feature Extraction and Embedding: Once the target object is identified, extract relevant features such as bounding box coordinates, timestamp, location metadata (if available), and visual characteristics. A multimodal embedding approach is used, where these features are transformed into a numerical representation (embedding vector) that captures the object’s unique identity. This allows for efficient comparison across different images or videos.
Step 3: Searching and Matching: To find the target object in other images or videos: (1) Extract embeddings of all objects detected in the other images/videos; (2) Compute similarity between the target object’s embedding and those of all detected objects using metrics like cosine similarity or Euclidean distance. (3) Rank objects by similarity, returning the most probable matches. The highest-ranked results are likely to be the same object observed from different angles, lighting conditions, or time frames.

Project Deliverables

This project will deliver three things, software, evaluation results and demo. The software which implements the above ReIDMM algorithm will be hosted on the github repo as open-access repositories. The evaluation results and demo will be published along the github repo.

ReIDMM

Topics: ReIDMM: Re-identifying Multiple Objects across Multiple Streams`
Skills: Proficient in Python, Experience with images processing, machine learning
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Linsey Pang

Reference:

OpenROAD - An Open-Source, Autonomous RTL-GDSII Flow for Chip Design

Sun, 19 Jan 2025 00:00:00 +0000

The OpenROAD project is a non-profit project, originally funded by DARPA with the aim of creating open-source EDA tools; an Autonomous flow from RTL-GDSII that completes < 24 hrs, to lower cost and boost innovation in IC design. This project is now supported by Precision Innovations.

OpenROAD massively scales and supports EWD (Education and Workforce Development) and supports a broad ecosystem making it a vital tool that supports a rapidly growing Semiconductor Industry.

OpenROAD is the fastest onramp to gain knowledge, skills and create pathways for great career opportunities in chip design. You will develop important software and hardware design skills by contributing to these interesting projects. You will also have the opportunity to work with mentors from the OpenROAD project and other industry experts.

We welcome a diverse community of designers, researchers, enthusiasts, software engineers and entrepreneurs to use and contribute to OpenROAD and make a far-reaching impact in the rapidly growing, global Semiconductor Industry.

Improving Code Quality in OpenROAD

Topics: Coding Best Practices in C++, Code Quality Tooling, Continuous Integration
Skills: C++
Difficulty: Medium
Size: Medium (175 hours)
Mentors: Matt Liberty & Arthur Koucher

OpenROAD is a large and complex program. This project is to improve the code quality through resolving issues flagged by tools like Coverity and clang-tidy. New tools like the clang sanitizers ASAN/TSAN/UBSAN should also be set up and integrated with the Jenkins CI.

GUI Testing in OpenROAD

Topics: Testing, Continuous Integration
Skills: C++, Qt
Difficulty: Medium
Size: Large (350 hours)
Mentors: Matt Liberty & Peter Gadfort

The OpenROAD GUI is a crucial set of functionality for users to see and investigate their design. GUI testing is specialized and rather different from standard unit testing. The GUI therefore needs improvements to its testing to cover both interaction and rendering. The GUI uses the Qt framework. An open-source testing tool like https://github.com/faaxm/spix will be set up and key tests developed. This will provide the framework for all future testing.

Rectilinear Floorplans in OpenROAD

Topics: Electronic Design Automation, Algorithms
Skills: C++, data structures and algorithms
Difficulty: Medium
Size: Large (350 hours)
Mentors: Eder Monteiro & Augusto Berndt

OpenROAD supports block floorplans that are rectangular in shape. Some designs may require more complex shapes to fit. This project extends the tool to support rectilinear polygon shapes as floorplans. This will require upgrading data structures and algorithms in various parts of OpenROAD including floor plan generation, pin placement, and global placement.

LEF Reader and Database Enhancements in OpenROAD

Topics: Electronic Design Automation, Database, Parsing
Skills: Boost Spirit parsers, Database, C++
Difficulty: Medium
Size: Medium (175 hours)
Mentors: Osama Hammad & Ethan Mahintorabi

LEF (Library Exchange Format) is a standard format for describing physical design rules for integrated circuits. OpenROAD has support for many constructs but some newer ones for advanced process nodes are not supported. This project is to support parsing such information and storing in the OpenDB for use by the rest of the tool.

ORAssistant - LLM Data Engineering and Testing

Topics: Large Language Model, Machine Learning, Data Engineering, Model Deployment, Testing, Full-Stack Development
Skills: large language model engineering, database, evaluation, CI/CD, open-source or related software development, full-stack
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Jack Luar & Palaniappan R

This project is aimed at enhancing robustness and accuracy for OR Assistant, the conversational assistant for OpenROAD through comprehensive testing and evaluation. You will work with members of the OpenROAD team and other researchers to enhance the existing dataset to cover a wide range of use cases to deliver accurate responses more efficiently. This project will focus on data engineering and benchmarking and you will collaborate on a project on the LLM model engineering. Tasks include: creating evaluation pipelines, building databases to gather feedback, improving CI/CD, writing documentation, and improving the backend and frontend services as needed (non-exhaustive). You will gain valuable experience and skills in understanding chip design flows and applications. Open to proposals from all levels of ML practitioners.

ORAssistant - LLM Model Engineering

Topics: Large Language Model, Machine Learning, Model Architecture, Model Deployment
Skills: large language model engineering, prompt engineering, fine-tuning
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Jack Luar & Palaniappan R

This project is aimed at enhancing robustness and accuracy for OR Assistant, the conversational assistant for OpenROAD through enhanced model architectures. You will work with members of the OpenROAD team and other researchers to explore alternate architectures beyond the existing RAG-based implementation. This project will focus on improving reliability and accuracy of the existing model architecture. You will collaborate on a tandem project on data engineering for OR assistant. Tasks include: reviewing and understanding the state-of-the-art in retrieval augmented generation, implementing best practices, caching prompts, improving relevance and accuracy metrics, writing documentation and improving the backend and frontend services as needed (non-exhaustive). You will gain valuable experience and skills in understanding chip design flows and applications. Open to proposals from all levels of ML practitioners.

Midway Through GSoC

Wed, 31 Jul 2024 00:00:00 +0000

Hello everyone! I’m Joel Tony, and I’m excited to share my progress update on the Drishti project as part of my Google Summer of Code (GSoC) experience. Over the past few weeks, I’ve been diving deep into the world of I/O visualization for scientific applications, and I’m thrilled to tell you about the strides we’ve made.

What is Drishti?

For those unfamiliar with Drishti, it’s an application used to visualize I/O traces of scientific applications. When running complex scientific applications, understanding their I/O behavior can be challenging. Drishti steps in to parse logs from various sources, with a primary focus on those collected using Darshan, a lightweight I/O characterization tool for HPC applications. Drishti provides human-interpretable insights on how to improve I/O performance based on these logs. While Drishti supports multiple log sources, our current work emphasizes Darshan logs due to their comprehensive I/O information. Additionally, Drishti offers visually appealing and easy-to-understand graphs to help users better grasp their application’s I/O patterns, making it easier to identify bottlenecks and optimize performance.

Progress and Challenges

Export Directory Feature

One of the first features I implemented was the export directory functionality. In earlier versions of Drishti, users couldn’t select where they wanted their output files to be saved. This became problematic when working with read-only log locations. I familiarized myself with the codebase, created a pull request, and successfully added this feature, allowing users to choose their preferred output location.

CI Improvements and Cross-Project Dependencies

While working on Drishti, I discovered the tight coupling between various tools in the HPC I/O organization, such as Drishti and DXT Explorer. This highlighted the need for improved Continuous Integration (CI) practices. We currently run about eight GitHub Actions for each pull request, but they don’t adequately test the interactions between different branches of these interconnected tools. This is an area we’ve identified for future improvement to ensure smoother integration and fewer conflicts between projects.

Refactoring for Multi-File Support

The bulk of my time was spent refactoring Drishti to extend its framework from parsing single Darshan files to handling multiple files. This task was more complex than it initially appeared, as Drishti’s insights are based on the contents of each Darshan file. When dealing with multiple files, we needed to find a way to aggregate the data meaningfully without sacrificing on performance.

The original codebase had a single, thousand-line function for parsing Darshan files. To improve this, I implemented a data class structure in Python. This refactoring allows for:

Better separation of computation and condition checking
Easier parallelization of processing multiple traces
Finer-grained profiling of performance bottlenecks
More flexibility in data manipulation and memory management

Learnings and Skills Gained

Through this process, I’ve gained valuable insights into:

Refactoring large codebases
Understanding and improving cross-project dependencies
Implementing data classes in Python for better code organization
Balancing performance with code readability and maintainability

Next Steps

As I move forward with the project, my focus will be on:

Adding unit tests for individual methods to ensure functionality
Exploring alternative data frame implementations like Polars for better performance
Developing aggregation methods for different types of data across multiple Darshan files
Optimizing memory usage and computational efficiency for large datasets

Conclusion

Working on Drishti has been an incredible learning experience. I’ve had the opportunity to tackle real-world challenges in scientific computing and I/O visualization. As we progress, I’m excited about the potential impact of these improvements on the scientific community’s ability to optimize their applications’ I/O performance.

I’m grateful for this opportunity and looking forward to the challenges and discoveries that lie ahead in the second half of my GSoC journey. Stay tuned for more updates as we continue to enhance Drishti!

If you have any questions or would like to learn more about the project, feel free to reach out to me. Let’s keep pushing the boundaries of scientific computing together!

Halfway Through GSOC: Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Sat, 20 Jul 2024 00:00:00 +0000

Hello, I’m Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University. I’m currently working on the AIIO / Graph Neural Network project under the guidance of Bin Dong and Suren Byna. Our project focuses on enhancing the AIIO framework to automatically diagnose I/O performance bottlenecks in high-performance computing (HPC) systems using Graph Neural Networks (GNNs).

Project Overview

Our primary goal is to tackle the persistent issue of I/O bottlenecks in HPC applications. Identifying these bottlenecks manually is often labor-intensive and prone to errors. By integrating GNNs into the AIIO framework, we aim to create an automated solution that can diagnose these bottlenecks with high accuracy, ultimately improving the efficiency and reliability of HPC systems.

Progress and Challenges

Over the past few weeks, my work has been centered on developing a robust data pre-processing pipeline. This pipeline is crucial for converting raw I/O log data into a graph format suitable for GNN analysis. The data pre-processing involves extracting relevant features from Darshan I/O logs, which include job-related information and performance metrics. One of the main challenges has been dealing with the heterogeneity and sparsity of the data, which can affect the accuracy of our models. To address this, we’ve focused on using correlation analysis to identify and select the most relevant features, ensuring that the dataset is well-structured and informative for GNN processing.

We’ve also started constructing the GNN model. The model is designed to capture the complex relationships between different I/O operations and their impact on system performance. This involves defining nodes and edges in the graph that represent job IDs, counter types, and their values. We explored different graph structures, including those that focus on counter types and those that incorporate more detailed information. While more detailed graphs offer better accuracy, they also require more computational resources.

Current Achievements

Data Pre-processing Pipeline: We have successfully developed and tested the pipeline to transform Darshan I/O logs into graph-structured data. This was a significant milestone, as it sets the foundation for all subsequent GNN modeling efforts.
GNN Model Construction: The initial version of our GNN model has been implemented. This model is now capable of learning from the graph data and making predictions about I/O performance bottlenecks.
Correlation Analysis for Graph Structure Design: We have used correlation analysis on the dataset to understand the relationships between I/O counters. This analysis has been instrumental in designing a more effective graph structure, helping to better capture the dependencies and interactions critical for accurate performance diagnosis.

Training for Different Graph Structures: We are currently training our model using various graph structures to determine the most effective configuration for accurate I/O performance diagnosis. This ongoing process aims to refine our approach and improve the model’s predictive accuracy.

Next Steps

Looking ahead, we plan to focus on several key areas:

Refinement and Testing: We’ll continue refining the GNN model, focusing on improving its accuracy and efficiency. This includes experimenting with different graph structures and training techniques.
SHAP Analysis: To enhance the interpretability of our model, we’ll incorporate SHAP (SHapley Additive exPlanations) values. This will help us understand the contribution of each feature to the model’s predictions, making it easier to identify critical factors in I/O performance.
Documentation and Community Engagement: As we make progress, we’ll document our methods and findings, sharing them with the broader community. This includes contributing to open-source repositories and engaging with other researchers in the field.

This journey has been both challenging and rewarding, and I am grateful for the support and guidance from my mentors and the community. I look forward to sharing more updates as we continue to advance this exciting project.

Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Fri, 14 Jun 2024 00:00:00 +0000

Hello, I am Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University, specializing in Artificial Intelligence. This summer, I will be working on the project AIIO / Graph Neural Network under the mentorship of Bin Dong and Suren Byna.

High-Performance Computing (HPC) applications often face performance issues due to I/O bottlenecks. Manually identifying these bottlenecks is time-consuming and error-prone. My project aims to enhance the AIIO framework by integrating a Graph Neural Network (GNN) model to automatically diagnose I/O performance bottlenecks at the job level. This involves developing a comprehensive data pre-processing pipeline, constructing and validating a tailored GNN model, and rigorously testing the model’s accuracy using test cases from the AIIO dataset.

Through this project, I seek to provide a sophisticated, AI-driven approach to understanding and improving I/O performance in HPC systems, ultimately contributing to more efficient and reliable HPC applications.

Drishti

Thu, 06 Jun 2024 00:00:00 +0000

Namaste everyone! 🙏🏻

I’m Joel Tony, a third-year Computer Science undergraduate at BITS Pilani, Goa, India. I’m truly honored to be part of this year’s Google Summer of Code program, working with the UC OSPO organization on a project that genuinely excites me. I’m particularly grateful to be working under the mentorship of Dr. Jean Luca Bez, a Research Scientist at Lawrence Berkeley National Laboratory, and Dr. Suren Byna, a Full Professor at the Ohio State University. Their expertise in high-performance computing and data systems is invaluable as I tackle this project.

My project, “Drishti: Visualization and Analysis of AI-based Applications”, aims to extend the Drishti framework to better support AI/ML workloads, focusing specifically on optimizing their Input/Output (I/O) performance. I/O refers to the data transfer between a computer’s memory and external storage devices like hard drives (HDDs) or solid-state drives (SSDs). As AI models and datasets continue to grow exponentially in size, efficient I/O management has become a critical bottleneck that can significantly impact the overall performance of these data-intensive workloads.

Drishti is an innovative, interactive web-based framework that helps users understand the I/O behavior of scientific applications by visualizing I/O traces and highlighting bottlenecks. It transforms raw I/O data into interpretable visualizations, making performance issues more apparent. Now, I’m working to adapt these capabilities for the unique I/O patterns of AI/ML workloads.

Through my studies in high-performance computing and working with tools like BeeGFS and Darshan, I’ve gained insights into the intricacies of I/O performance. However, adapting Drishti for AI/ML workloads presents new challenges. In traditional HPC, computing often dominates, but in the realm of AI, the tables have turned. As models grow by billions of parameters and datasets expand to petabytes, I/O has become the critical path. Training larger models or using richer datasets doesn’t just mean more computation; it means handling vastly more data. This shift makes I/O optimisation not just a performance tweak but a fundamental enabler of AI progress. By fine-tuning Drishti for AI/ML workloads, we aim to pinpoint I/O bottlenecks precisely, helping researchers streamline their data pipelines and unlock the full potential of their hardware.

As outlined in my proposal, my tasks are threefold:

Modularize Drishti’s codebase: Currently, it’s a single 1700-line file that handles multiple functionalities. I’ll be refactoring it into focused, maintainable modules, improving readability and facilitating future enhancements.
Enable multi-trace handling: Unlike traditional HPC apps that typically generate one trace file, most AI jobs produce multiple. I’ll build a layer to aggregate these, providing a comprehensive view of the application’s I/O behavior.
Craft AI/ML-specific recommendations: Current suggestions often involve MPI-IO or HDF5, which aren’t typical in ML frameworks like PyTorch or TensorFlow. I’ll create targeted recommendations that align with these frameworks’ data pipelines.

This summer, my mission is to make Drishti as fluent in AI/ML I/O patterns as it is in traditional HPC workloads. My goal is not just to adapt Drishti but to optimize it for the unique I/O challenges that AI/ML applications face. Whether it’s dealing with massive datasets, handling numerous small files, or navigating framework-specific data formats, we want Drishti to provide clear, actionable insights.

From classroom theories to hands-on projects, from understanding file systems to optimizing AI workflows, each step has deepened my appreciation for the complexities and potential of high-performance computing. This GSoC project is an opportunity to apply this knowledge in a meaningful way, contributing to a tool that can significantly impact the open-source community.

In today’s AI-driven world, the pace of innovation is often gated by I/O performance. A model that takes weeks to train due to I/O bottlenecks might, with optimized I/O, train in days—translating directly into faster iterations, more experiments, and ultimately, breakthroughs. By making I/O behavior in AI/ML applications more interpretable through Drishti, we’re not just tweaking code. We’re providing developers with the insights they need to optimize their data pipelines, turning I/O from a bottleneck into a catalyst for AI advancement.

I look forward to sharing updates as we adapt Drishti for the AI era, focusing squarely on optimizing I/O for AI/ML workloads. In doing so, we aim to accelerate not just data transfer but the very progress of AI itself. I’m deeply thankful to Dr. Jean Luca Bez and Prof. Suren Byna for their guidance in this endeavor and to the UC OSPO and GSoC communities for this incredible opportunity.

AIIO / Graph Neural Network

Wed, 17 Jan 2024 10:15:56 -0700

[AIIO] (https://github.com/hpc-io/aiio) revolutionizes the way for users to automatically tune the I/O performance of applications on HPC systems. It currently works on linear regression models but has more opportunities to work on heterogeneous data, such as programming info. This requires extending the linear regression model to more complex models, such as heterogeneous graph neural networks. The proposed work will include developing the graph neural work-based model to predict the I/O performance and interpretation.

AIIO / Graph Neural Network

Topics: AIIO/Graph Neural Network`
Skills: Python, Github, Machine Learning
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Suren Byna

The Specific tasks of the project include:

Develop the data pre-processing pipeline to convert I/O logs into formats which are required by the Graph Neural Network
Build and test the Graph Neural Network to model the I/O performance for HPC applications.
Test and evaluate the accuracy of the Graph Neural Network with test cases from AIIO