llm | UCSC OSPO

NETAI: AI-Powered Network Anomaly Detection and Diagnostics Platform

Thu, 05 Feb 2026 00:00:00 +0000

NETAI (Network AI) is an AI-powered network anomaly detection and diagnostics platform for the National Research Platform (NRP). This project combines Kubernetes-native LLM integration, network performance monitoring, and predictive analytics to create an intelligent assistant for network operators. Students will work with cutting-edge technologies including Large Language Models (LLMs), Kubernetes, perfSONAR network measurements, time-series analysis, and containerized AI/ML workloads, while contributing to real-world applications in network operations and diagnostics.

The project involves developing a Kubernetes chatbot that leverages NRP’s managed LLM service (providing access to models like Qwen3-VL, GLM-4.7, and GPT-OSS) to help network operators understand complex network behaviors, diagnose anomalies, and receive natural language explanations of network issues. Students will integrate perfSONAR measurement data with traceroute path analysis to create an interactive network topology visualization, and develop AI/ML models for predictive network performance analysis using NRP’s GPU resources.

In addition, students will gain hands-on experience with fine-tuning LLMs on historical network diagnostics data, developing time-series forecasting models for network metrics, and implementing anomaly detection using deep learning techniques. The entire AI/ML pipeline will be containerized and deployed as Kubernetes workloads, utilizing GPU-enabled pods for model training and inference, ensuring scalability and seamless integration with existing NRP infrastructure.

The platform builds upon existing network diagnostics capabilities, combining end-to-end throughput measurements with detailed traceroute data to enable operators to visualize network paths, identify performance bottlenecks, and understand relationships between metrics and underlying infrastructure. The AI enhancement will provide predictive capabilities, automated incident reporting, and intelligent recommendations for network remediation strategies.

NETAI / LLM Integration & Kubernetes Chatbot

The proposed work includes developing a Kubernetes-native chatbot that integrates with NRP’s managed LLM service to provide intelligent network diagnostics assistance. Students will create a conversational interface that can answer questions about network performance, explain anomalies in natural language, and suggest remediation strategies. They will fine-tune LLMs on historical network diagnostics data, test results, and traceroute information to create domain-specific assistants. Students will implement RESTful APIs for chatbot interactions, develop prompt engineering strategies for network diagnostics, and create context-aware responses that incorporate real-time network telemetry. The chatbot will be deployed as Kubernetes services, utilizing GPU pods for inference and integrating with the existing diagnostics platform.

Topics: Large Language Models, Kubernetes, Chatbots, Natural Language Processing, Network Diagnostics, API Development
Skills: Python, Kubernetes, LLM APIs (Qwen3-VL, GLM-4.7, GPT-OSS), Prompt Engineering, REST APIs, Docker, GPU Computing
Difficulty: Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

NETAI / Network Anomaly Detection Models

The proposed work includes developing deep learning models for network anomaly detection using historical perfSONAR and traceroute data. Students will create models that can identify slow links, high packet loss, excessive retransmits, and failed network tests automatically. They will implement anomaly detection algorithms using techniques such as autoencoders, LSTM networks, and transformer architectures. Students will train models on NRP’s GPU clusters using historical network telemetry stored in SQLite databases, develop feature engineering pipelines for network metrics, and create real-time inference services deployed as Kubernetes workloads. The models will be integrated into the diagnostics platform to provide automated anomaly detection alongside the interactive visualization.

Topics: Deep Learning, Anomaly Detection, Time-Series Analysis, Network Monitoring, Model Training, GPU Computing
Skills: Python, PyTorch/TensorFlow, scikit-learn, Pandas, NumPy, SQLite, Kubernetes, GPU Pods, MLOps
Difficulty: Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

NETAI / Predictive Analytics & Forecasting

The proposed work includes developing predictive models that can forecast network performance degradation and identify patterns in network anomalies before they impact users. Students will create time-series forecasting models for network metrics such as throughput, latency, and packet loss, using techniques like ARIMA, Prophet, and deep learning-based forecasting. They will implement few-shot learning approaches to adapt models to new network topologies and measurement patterns, develop early warning systems for potential network issues, and create automated incident report generation using LLMs. Students will leverage NRP’s GPU resources for training forecasting models and deploy them as Kubernetes services for real-time predictions integrated with the diagnostics dashboard.

Topics: Time-Series Forecasting, Predictive Analytics, Machine Learning, Network Performance, Early Warning Systems, LLM Integration
Skills: Python, PyTorch/TensorFlow, Prophet, ARIMA, Pandas, NumPy, Time-Series Analysis, Kubernetes, GPU Computing
Difficulty: Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

NETAI / Kubernetes Deployment & Infrastructure

The proposed work includes setting up Kubernetes-based infrastructure for deploying the entire NETAI platform, including LLM services, ML models, and the diagnostics dashboard. Students will create Helm charts for deploying containerized AI/ML workloads, configure GPU-enabled pods for model training and inference, and implement persistent storage solutions for maintaining historical network telemetry. They will develop GitLab CI/CD pipelines for automated testing and deployment, set up monitoring and observability using Prometheus and Grafana for tracking model performance and resource usage, and create scalable deployment strategies that leverage NRP’s distributed computing resources. Students will also integrate the platform with existing perfSONAR infrastructure and ensure seamless operation within the NRP cluster.

Topics: Kubernetes, DevOps, CI/CD, GPU Computing, Container Orchestration, Infrastructure as Code, Monitoring
Skills: Kubernetes, Helm, GitLab CI/CD, Prometheus, Grafana, Docker, GPU Pods, Persistent Storage, Infrastructure Automation
Difficulty: Medium to Hard
Size: Large (350 hours)
Mentors: Dmitry Mishin, Derek Weitzel

Project Resources

National Research Platform: https://nrp.ai/
NRP LLM Service: https://nrp.ai/documentation/userdocs/ai/llm-managed/
perfSONAR: https://www.perfsonar.net/
MaDDash: https://github.com/esnet/maddash
Network Monitoring Documentation: https://nrp.ai/documentation/

Background

This project addresses critical gaps in network performance monitoring for the National Research Platform by integrating AI/ML capabilities with existing perfSONAR-based diagnostics. The platform combines end-to-end network measurements with detailed path-level analysis, enhanced by intelligent AI assistants that can help operators understand complex network behaviors and predict potential issues. By leveraging NRP’s managed LLM service and GPU resources, students will create a Kubernetes-native system that scales across the distributed research network infrastructure, providing both real-time diagnostics and predictive analytics to improve network reliability and performance for researchers nationwide.

MedJEPA: Self-Supervised Medical Image Representation Learning with JEPA

Mon, 19 Jan 2026 10:15:56 -0700

Project Description

[MedJEPA] Medical image analysis is fundamental to modern healthcare, enabling disease diagnosis, treatment planning, and patient monitoring across diverse clinical applications. In radiology and pathology, deep learning models support automated detection of abnormalities, tumor segmentation, and diagnostic assistance. Medical imaging modalities including X-rays, CT scans, MRI, ultrasound, and histopathology slides generate vast amounts of unlabeled data that could benefit from self-supervised representation learning. Clinical applications include cancer detection and staging, cardiovascular disease assessment, neurological disorder diagnosis, and infectious disease screening. In drug discovery and clinical research, analyzing medical images helps evaluate treatment efficacy, predict patient outcomes, and identify biomarkers for disease progression. Telemedicine and point-of-care diagnostics benefit from AI-powered image analysis that extends expert-level interpretation to underserved regions. However, medical imaging faces unique challenges: limited labeled datasets due to expensive expert annotation, patient privacy concerns restricting data sharing, domain shift across different imaging equipment and protocols, and the need for models that generalize across hospitals and populations. Traditional medical image analysis relies heavily on supervised learning with manually annotated labels, creating bottlenecks due to the scarcity and cost of expert annotations. Existing self-supervised methods applied to medical imaging often employ complex training procedures with numerous heuristics—momentum encoders, stop-gradients, teacher-student architectures, and carefully tuned augmentation strategies—that may not translate well across different medical imaging modalities and clinical contexts. These approaches struggle with domain-specific challenges such as subtle pathological features, high-resolution images, 3D volumetric data, and the need for interpretable representations that clinicians can trust. To address these challenges, we propose MedicalJEPA: Self-Supervised Medical Image Representation Learning with Joint-Embedding Predictive Architecture, which leverages the theoretically grounded LeJEPA framework for 2D medical images and V-JEPA principles for medical video and volumetric data, creating a unified, scalable, and heuristics-free approach specifically tailored for medical imaging applications. By utilizing the principled JEPA frameworks with objectives like Sketched Isotropic Gaussian Regularization (SIGReg), MedJEPA eliminates complex training heuristics while learning clinically meaningful representations from unlabeled medical images. Unlike conventional self-supervised methods that require extensive hyperparameter tuning and may not generalize across medical imaging modalities, MedicalJEPA provides a clean, theoretically motivated framework with minimal hyperparameters that adapts to diverse medical imaging contexts—from chest X-rays to histopathology slides to cardiac MRI sequences. The learned representations can support downstream tasks including disease classification, lesion detection, organ segmentation, and survival prediction, while requiring significantly fewer labeled examples for fine-tuning. This approach democratizes access to state-of-the-art medical AI by enabling effective learning from the vast amounts of unlabeled medical imaging data available in hospital archives, addressing the annotation bottleneck that has limited progress in medical AI.

Project Objectives

Aligned with the vision of the 2026 Open Source Research Experience (OSRE), this project aims to apply Joint-Embedding Predictive Architecture (JEPA) frameworks to medical image representation learning, addressing the critical challenge of learning from limited labeled medical data. Medical imaging generates enormous amounts of unlabeled data, but supervised learning approaches are bottlenecked by the scarcity and cost of expert annotations. Existing self-supervised methods often rely on complex heuristics that don’t generalize well across diverse medical imaging modalities, equipment vendors, and clinical protocols. This project will leverage the theoretically grounded LeJEPA framework for 2D medical images (X-rays, histopathology slides, fundus images) and V-JEPA principles for temporal and volumetric medical data (cardiac MRI sequences, CT scans, surgical videos). The core challenge lies in adapting these heuristics-free, stable frameworks to medical imaging’s unique characteristics: subtle pathological features requiring fine-grained representations, high-resolution images demanding efficient processing, domain shift across hospitals and equipment, and the need for interpretable features that support clinical decision-making. The learned representations will be evaluated on diverse downstream clinical tasks including disease classification, lesion detection, organ segmentation, and prognosis prediction, with emphasis on few-shot learning scenarios that reflect real-world annotation constraints. Below is an outline of the methodologies and models that will be developed in this project.

Step 1: Medical Data Preparation: Develop data processing pipelines for diverse medical imaging modalities, implementing DICOM/NIfTI parsing, standardized preprocessing, and efficient data loading for self-supervised pre-training. Prepare 2D medical image datasets: Chest X-rays: ChestX-ray14, MIMIC-CXR, CheXpert for lung disease detection Histopathology: Camelyon16/17 (breast cancer), PCam (patch-level classification) Retinal imaging: EyePACS, APTOS (diabetic retinopathy), Messidor Dermatology: HAM10000, ISIC (skin lesion classification) Prepare 3D volumetric and temporal medical data: CT scans: LIDC-IDRI (lung nodules), Medical Segmentation Decathlon datasets MRI sequences: BraTS (brain tumors), ACDC (cardiac MRI), UK Biobank cardiac videos Medical video: Surgical procedure videos, endoscopy recordings, ultrasound sequences Implement medical imaging-specific preprocessing: intensity normalization, resolution standardization, handling of multi-channel medical images (different MRI sequences, RGB histopathology), and privacy-preserving anonymization. Design masking strategies appropriate for medical imaging: spatial masking for 2D images, volumetric masking for 3D scans, temporal masking for sequences, and anatomy-aware masking that respects organ boundaries. Create data loaders supporting high-resolution medical images, 3D volumes, and multi-modal inputs (e.g., multiple MRI sequences).
Step 2: JEPA Model Implementation for Medical Imaging: Implement LeJEPA for 2D medical images: Adapt joint-embedding predictive architecture for medical image characteristics (high resolution, subtle features, domain-specific patterns) Apply Sketched Isotropic Gaussian Regularization (SIGReg) to learn clinically meaningful embedding distributions Maintain single trade-off hyperparameter and heuristics-free training for reproducibility across medical imaging centers Support various encoder architectures: Vision Transformers for global context, ConvNets for local features, hybrid approaches Extend to V-JEPA for medical video and volumetric data: Spatiotemporal encoding for cardiac MRI sequences, surgical videos, and time-series medical imaging Temporal prediction objectives for understanding disease progression and treatment response 3D volume processing for CT and MRI scans with efficient memory management Multi-slice and multi-sequence learning for comprehensive medical imaging contexts Develop medical domain-specific enhancements: Multi-scale representation learning to capture both fine-grained pathological details and global anatomical context Interpretability mechanisms: attention visualization, feature attribution, and embedding space analysis for clinical validation Robustness to domain shift: training strategies that generalize across different scanners, protocols, and institutions Privacy-preserving training considerations compatible with medical data regulations (HIPAA, GDPR) Implement efficient training infrastructure: Support for distributed training across multiple GPUs for large medical imaging datasets Memory-efficient processing of high-resolution images and 3D volumes Checkpoint management and model versioning for clinical deployment pipelines Minimal-code implementation (≈50-100 lines) demonstrating framework simplicity
Step 3: Evaluation & Safety Validation: : Disease Classification Tasks: Multi-label chest X-ray classification: 14 pathology classes on ChestX-ray14, MIMIC-CXR Diabetic retinopathy grading: 5-class classification on EyePACS, APTOS Skin lesion classification: 7-class classification on HAM10000 Brain tumor classification: glioma grading on BraTS dataset Evaluate with linear probing, few-shot learning (5-shot, 10-shot), and full fine-tuning Lesion Detection and Segmentation: Lung nodule detection on LIDC-IDRI dataset Tumor segmentation on Medical Segmentation Decathlon tasks Polyp detection in colonoscopy videos Cardiac structure segmentation in MRI sequences Clinical Prediction Tasks: Survival prediction from histopathology slides Disease progression prediction from longitudinal imaging Treatment response assessment from pre/post imaging pairs Few-Shot and Low-Data Regime Evaluation: Systematic evaluation with 1%, 5%, 10%, 25%, 50% of labeled training data Comparison against supervised baselines and ImageNet pre-training Analysis of annotation efficiency: performance vs. number of labeled examples required

Project Deliverables

This project will deliver three components: software implementation, clinical evaluation, and practical deployment resources. The software implementing MedicalJEPA will be hosted on GitHub as an open-access repository with modular code supporting multiple medical imaging modalities (2D images, 3D volumes, videos), pre-trained model checkpoints on major medical imaging datasets (chest X-rays, histopathology, MRI), training and evaluation scripts with medical imaging-specific preprocessing pipelines, privacy-preserving training implementations compatible with clinical data regulations, and comprehensive documentation including tutorials for medical AI researchers and clinicians. The evaluation results will include benchmarks on 10+ medical imaging datasets across diverse modalities and clinical tasks, few-shot learning analysis demonstrating annotation efficiency gains, cross-institutional validation studies showing robustness to domain shift, interpretability visualizations enabling clinical validation of learned representations, and detailed comparisons against supervised baselines and existing medical self-supervised methods. .

NeuroHealth

Topics: Self-Supervised Medical Image Representation Learning with JEPA
Skills: Proficiency in Python, Pytorch, Github, JEPA
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Linsey Pang

References:

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics - Randall Balestriero and Yann LeCun, arXiv 2024
Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA) - Adrien Bardes et al., arXiv 2024
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture - Mahmoud Assran et al., CVPR 2023 (I-JEPA)
ChestX-ray14: Hospital-Scale Chest X-Ray Database - https://nihcc.app.box.com/v/ChestXray-NIHCC
Medical Segmentation Decathlon - http://medicaldecathlon.com/
MIMIC-CXR Database - https://physionet.org/content/mimic-cxr/
The Cancer Imaging Archive (TCIA) - https://www.cancerimagingarchive.net/
UK Biobank Imaging Study - https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/imaging-data

NeuroHealth: AI-Powered Health Assistant

Mon, 19 Jan 2026 10:15:56 -0700

Project Description

[NeuroHealth] Intelligent health assistance systems are increasingly essential for improving healthcare accessibility, patient engagement, and clinical decision support. In primary care and preventive medicine, AI assistants help users understand symptoms, schedule appropriate appointments, and receive preliminary health guidance. Telemedicine applications include triage support, appointment scheduling optimization, and patient education based on health inquiries. In chronic disease management, these systems provide medication reminders, lifestyle recommendations, and timely alerts for medical follow-ups. Healthcare navigation applications include finding appropriate specialists, understanding treatment options, and coordinating care across multiple providers. In wellness and preventive care, intelligent assistants enhance health literacy by delivering personalized health information, screening recommendations, and proactive health management strategies. By leveraging natural language understanding and medical knowledge integration, these systems enhance healthcare access, reduce unnecessary emergency visits, and empower users to make informed health decisions across diverse populations. Traditional health information systems often provide generic responses that fail to account for individual health contexts, medical history, and personal circumstances. Existing symptom checkers and health chatbots primarily rely on rule-based logic or simple decision trees, limiting their ability to understand nuanced health inquiries, reason about complex symptom patterns, or provide contextually appropriate guidance. These systems struggle with interpreting ambiguous descriptions, adapting to users’ health literacy levels, and generating personalized recommendations that account for individual medical constraints and preferences. To address these challenges, we propose NeuroHealth: AI-Powered Health Assistant, which leverages Large Language Models (LLMs) to create an intelligent conversational agent that synthesizes user health inquiries, symptom descriptions, and contextual information into actionable, personalized health guidance and appointment recommendations. By integrating LLM-based medical reasoning with structured clinical knowledge bases, NeuroHealth enhances symptom interpretation, appointment routing, and health education delivery. Unlike conventional systems that provide static responses from predetermined templates, NeuroHealth dynamically understands user intent, asks clarifying questions, assesses urgency levels, and generates appropriate recommendations—whether scheduling a doctor appointment, suggesting self-care measures, or directing users to emergency services. This fusion of LLM intelligence with validated medical knowledge enables a more accessible, adaptive, and helpful health assistance platform, bridging the gap between users seeking health information and appropriate medical care.

Project Objectives

Aligned with the vision of the 2026 Open Source Research Experience (OSRE), this project aims to develop an AI-Powered Health Assistant (NeuroHealth) to improve healthcare accessibility and patient engagement through intelligent conversational guidance. Healthcare systems face significant challenges in providing timely, personalized health information and connecting patients with appropriate care resources. Traditional symptom checkers and health information systems often deliver generic, rule-based responses that fail to account for individual contexts and struggle with natural language understanding. To address these limitations, this project will leverage Large Language Models (LLMs) to create an intelligent health assistant that understands user health inquiries, interprets symptom descriptions, assesses urgency, and provides personalized recommendations including doctor appointment suggestions, self-care guidance, and healthcare navigation support. The core challenge lies in designing NeuroHealth as a safe, accurate, and user-friendly system capable of natural conversation, medical knowledge retrieval, and appropriate response generation while maintaining clinical safety guardrails. Unlike conventional health chatbots that follow rigid conversation flows, NeuroHealth will reason over user inputs, ask clarifying questions, and dynamically adapt responses based on context, resulting in more helpful, accurate, and appropriate health assistance. Below is an outline of the methodologies and models that will be developed in this project.

Step 1: Data Collection & Knowledge Base Construction: Develop a comprehensive medical knowledge base integrating validated health information sources, symptom databases, condition descriptions, and appointment routing guidelines. Collect and curate conversational health inquiry datasets from public medical Q&A forums, symptom checker logs, and healthcare chatbot interactions to create training and evaluation data. Design structured representations for symptoms, conditions, urgency levels, and appointment recommendations to enable effective retrieval and reasoning. Extract common health inquiry patterns, symptom descriptions, and user intent categories to inform conversation flow design. Data sources can include public medical knowledge bases such as MedlinePlus, Mayo Clinic health information, clinical practice guidelines, and synthetic patient inquiry scenarios based on common healthcare use cases. Implement data validation mechanisms to ensure medical accuracy and clinical safety compliance.
Step 2: Model Development: Design and implement an LLM-based conversational health assistant that integrates medical knowledge retrieval with natural language understanding and generation. Develop a Retrieval-Augmented Generation (RAG) architecture that grounds LLM responses in validated medical information sources, reducing hallucination risks and ensuring factual accuracy. Create prompt engineering strategies and reasoning frameworks that enable the system to: interpret symptom descriptions, assess urgency levels, ask appropriate clarifying questions, and generate personalized health guidance. Implement a multi-component architecture including: intent recognition, symptom extraction, urgency assessment, appointment recommendation generation, and response formatting modules. Develop clinical safety guardrails that detect high-risk scenarios requiring immediate medical attention and provide appropriate emergency guidance. Design conversation management strategies that maintain context across multi-turn dialogues and adapt to users’ health literacy levels. The baseline architecture can leverage state-of-the-art models such as GPT-4, Claude, or open-source alternatives like Llama, Qwen, combined with medical knowledge retrieval systems.
Step 3: Evaluation & Safety Validation: : Benchmark NeuroHealth against existing symptom checkers and health chatbots, evaluating on metrics including response accuracy, appropriateness of appointment recommendations, urgency assessment precision, and user satisfaction. Conduct human evaluation studies with healthcare professionals to assess clinical safety, response quality, and appropriateness of medical guidance. Perform adversarial testing to identify potential failure modes, unsafe responses, or inappropriate recommendations under edge cases. Conduct ablation studies to analyze the impact of retrieval-augmented generation, safety guardrails, and conversation management strategies on system performance. Evaluate system performance across diverse health inquiry types including acute symptoms, chronic condition management, preventive care questions, and healthcare navigation requests. Assess response quality across different user demographics and health literacy levels to ensure equitable access. Optimize inference efficiency and response latency for real-time conversational interaction across web and mobile platforms.

Project Deliverables

This project will deliver three components: model development, evaluation and validation, and interactive demonstration. The software implementing the NeuroHealth system will be hosted on GitHub as an open-access repository with comprehensive documentation, deployment guides, and API specifications. The evaluation results, including benchmark comparisons against existing systems, clinical safety assessments, and user study findings, will be published alongside the GitHub repository. An interactive demo showcasing the conversational interface, symptom interpretation capabilities, and appointment recommendation generation will be provided to illustrate real-world application scenarios.

NeuroHealth

Topics: AI-Powered Health Assistant
Skills: Proficiency in Python, Github, LLM
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Large Language Models in Healthcare - Singhal et al., Nature 2023
Med-PaLM: Large Language Models for Medical Question Answering - Singhal et al., arXiv 2022
Capabilities of GPT-4 on Medical Challenge Problems - Nori et al., arXiv 2023
MedlinePlus Medical Encyclopedia - https://medlineplus.gov/
Clinical Practice Guidelines Database - https://www.guidelines.gov/

Lynx Grader

Tue, 13 Jan 2026 13:00:00 -0800

The EduLinq Lynx Grader (also referred to as “autograder”) is an open source tool used by several courses at UCSC to safely and quickly grade programming assignments. Grading student code is something that may seem simple at first (you just need to run their code!), but quickly becomes exceeding complex as you get more into the details. Specifically, grading a student’s code securely while providing the “last mile” service of getting code from students and sending results to instructors/TAs and the course’s LMS (e.g., Canvas) can be very difficult. The Lynx Grader provides all of this in a free and open source project. The LINQS Lab has made many contributions to the maintain and improve the Lynx Grader.

As an open source project, there are endless opportunities for development, improvements, and collaboration. Here, we highlight some specific projects that will work well in the summer mentorship setting.

All students interested in LINQS projects for OSRE/GSoC 2026 should fill out this form. Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project. The form will stop accepting responses once the application window closes. Do not post on any of the project repositories about OSRE/GSoC (e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2026). Remember, these are active repositories that were not created for OSRE/GSoC.

LLM Detection

Topics: AI/ML LLM Research Backend
Skills: software development, backend, systems, data munging, go, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

As Large Language Model (LLM) tools like ChatGPT become more common and powerful, instructors need tools to help determine if students are the actual authors of the code they submit. More classical instances of plagiarism are often discovered by code similarity tools like MOSS. However these tools are not sufficient for detecting code written not by a student, but by an AI model like ChatGPT or GitHub Copilot.

The task for this project is to create a system that provides a score indicating the system’s confidence that a given piece of code was written by an AI tool and not a student. This will supplement the existing code analysis tools in the Lynx Grader. There are many approaches to completing this task that will be considered. A more software development approach can consist of levering exiting systems to create a production-ready system, whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.

There has been previous work on this issue, where a student did a survey of existing solutions, collection of initial datasets, and exploratory experiments on possible directions. This project would build off of this previous work.

See Also:

Code Analysis GUI

Topics: Frontend
Skills: software development, frontend, data munging, js, css, go
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Lynx Grader has existing functionality to analyze the code in a student’s submission for malicious content. Relevant to this project is that the Lynx Grader can run a pairwise similarity analysis against all submitted code. This is how most existing software plagiarism systems detect offending code. The existing infrastructure provides detailed statistics on code similarity, but does not currently have a visual way to display this data.

The task for this project is to create a web GUI using the Lynx Grader REST API to display the results of a code analysis. The size of this project depends on how many of the existing features are going to be supported by the web GUI.

See Also:

Web GUI

Topics: Frontend
Skills: software development, frontend, js, css
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Lynx Grader contains dozens of API endpoints, most directly representing a piece of functionality exposed to the user. All of these features are exposed in the Lynx Grader’s Python Interface. However, the Python interface is a purely command-line interface. And although command-line interface are objectively (read: subjectively) the best, a web GUI would be more accessible to a wider audience. The autograder already has a web GUI, but it does not cover all the features available in the Lynx Grader.

The task for this project is to augment the Lynx Grader’s web GUI with more features. Specifically, add support for more tools used to create and administer courses.

See Also:

Final Report: A Systematic Investigation into the Reproducibility of RAG Systems

Fri, 05 Sep 2025 00:00:00 +0000

I’m Baiqiang, and this is the final report for the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project, mentored by Luanzheng “Lenny” Guo and Dongfang Zhao. This project successfully developed a novel framework to quantitatively measure reproducibility in AI systems, yielding several surprising and impactful results.

The Challenge: The Need for Systematic Measurement

Retrieval-Augmented Generation (RAG) is a cornerstone of AI for science, but its reliability is often compromised by non-determinism. While this issue was a known concern, a fundamental challenge was the lack of standardized tools and methodologies to systematically measure and quantify the sources of this inconsistency. Without a rigorous way to analyze the problem, it was difficult to move beyond ad-hoc tests and establish the true root causes, hindering the development of truly trustworthy AI systems for science.

Our Contribution: The ReproRAG Framework

To address this gap, the central contribution of this project is ReproRAG, a comprehensive, open-source benchmarking framework. ReproRAG is designed to systematically investigate sources of uncertainty across the entire RAG pipeline by:

Isolating Variables: It allows for controlled experiments on embedding models, numerical precision, retrieval algorithms, hardware configurations (CPU/GPU), and distributed execution environments.
Quantifying Uncertainty: It employs a suite of metrics—including Exact Match Rate, Jaccard Similarity, and Kendall’s Tau—to precisely measure the impact of each variable on the final retrieved results.

Key Findings: A New Hierarchy of Uncertainty

Our large-scale empirical study using ReproRAG challenged common assumptions and established a clear hierarchy of what actually impacts reproducibility.

Core Algorithms Are Not the Problem: Our most surprising finding is that modern retrieval libraries like FAISS are perfectly reproducible out-of-the-box. Across all tested index types (including approximate ones like HNSW and IVF) and execution environments (single-node CPU/GPU and multi-node distributed systems), we achieved perfect run-to-run reproducibility (1.000 scores on all metrics) when environmental factors like random seeds were controlled. This falsifies the common hypothesis that approximate nearest neighbor algorithms are a primary source of randomness.
Embedding Model Choice is a Dominant Source of Variation: We found that the choice of the embedding model is a dominant factor driving result variation. When comparing outputs from different state-of-the-art models (BGE, E5, Qwen) for the same query, the agreement was very low (e.g., Overlap Coefficient of ~0.43-0.54). This means a scientific conclusion drawn with one model may not be reproducible with another, as they are fundamentally “seeing” different evidence.
Environmental Factors Introduce Measurable “Drift”:
- Numerical Precision: Changing floating-point precision (e.g., FP32 vs. FP16) was a guaranteed source of variation, but it caused a small and quantifiable “embedding drift” rather than chaotic changes.
- Data Insertion: Incrementally adding new data to an index caused a predictable “displacement” of old results, not a re-shuffling. The relative ranking of the remaining original documents was perfectly stable (Kendall’s Tau of 1.000).
Common Determinism Flags Can Be Ineffective: Our tests showed that popular software-level controls, like cudnn.deterministic flags in PyTorch, had no observable effect on the output of modern transformer-based embedding models. This underscores the necessity of empirical validation over assuming that framework settings work as advertised.

Conclusion

This project successfully shifted the focus of the RAG reproducibility problem. The key challenge is not to fix supposedly “random” algorithms, but to rigorously control the entire experimental environment. We delivered ReproRAG, a framework that empowers researchers to do just that. Our findings provide actionable insights for the community: efforts to improve reproducibility should focus less on the retrieval algorithms themselves and more on disciplined management of embedding models, data versioning, and numerical precision.

Mid-Term Report: Uncovering the True Sources of Non-Reproducibility in AI for Science

Fri, 01 Aug 2025 00:00:00 +0000

Hello, I’m Baiqiang. I’m excited to share a mid-term update from the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project. This journey, mentored by Luanzheng “Lenny” Guo and Dongfang Zhao, has taken a fascinating and unexpected turn, leading to a much deeper understanding of what it takes to build truly reliable AI for science.

The Search for an Invisible Bug

As a quick recap, our project tackles the critical problem of non-determinism in Retrieval-Augmented Generation (RAG) systems. For science to be trustworthy, it must be repeatable. If an AI system gives different answers to the same question, it fails this fundamental test. Our initial goal, outlined in my proposal, was to find and fix the sources of this inconsistency, which we believed lay within the retrieval algorithms themselves.

To do this, we built a comprehensive testing framework capable of running thousands of controlled experiments. We designed it to meticulously measure the consistency of retrieval results while varying everything from the indexing algorithm to the underlying hardware.

A Surprising Discovery: The Usual Suspect is Innocent

The common wisdom in the community is that high-performance, approximate search libraries like FAISS are a major source of randomness. We put this to the test, running repeated queries against various index types, including complex ones like HNSW and IndexIVF.

Our results were clear and surprising: FAISS is remarkably reproducible out of the box. When run on a consistent hardware and software stack, it returns the exact same results, every single time. The library appears to have robust internal seed management that ensures deterministic behavior.

This finding was a pivotal moment. The non-reproducibility that researchers observe in practice is real, but it doesn’t come from where we expected. The problem isn’t the algorithm itself, but the environment it runs in. Our investigation immediately shifted to find the real culprits.

Pinpointing the True Sources of Non-Determinism

Our framework quickly helped us identify the true sources of inconsistency:

Hardware-Induced Variation (CPU vs. GPU): This is the most significant factor. Running the exact same retrieval code can produce different document rankings and even different document sets when executed on a CPU versus a GPU. This is likely due to subtle differences in floating-point arithmetic and library optimizations in the hardware stack.
The Impact of Numerical Precision: We also confirmed that changing the floating-point precision of the data (e.g., from FP32 to FP16) can introduce small numerical variations that are just large enough to reorder the results, potentially changing the evidence the LLM receives.

Our Mission Refined: Building Tools for Environmental Control

This discovery has sharpened our project’s mission. The challenge is not to “fix” a supposedly random algorithm, but to develop the tools and best practices to control for the entire experimental environment. Our focus for the second half of the project is to:

Develop a Hardware-Aware Configuration Tracker: We are building a tool that goes beyond logging software versions. It will capture the critical details of the hardware environment—CPU/GPU model, CUDA version, etc.—and link them directly to an experiment’s results.
Create a Cross-Environment Validation Suite: Our open-source benchmarking suite will empower researchers to test their own pipelines. Crucially, it will help them identify and diagnose inconsistencies when moving workflows between different machines, such as from a local laptop to a cloud-based GPU.
Establish New Best Practices: We will distill our findings into clear, actionable guidance. The key recommendation is no longer just about choosing the right algorithm, but ensuring a consistent and well-documented hardware and software environment to guarantee reproducible outcomes.

By following the evidence, we’ve uncovered the root cause of a critical problem in AI-driven research. We are now developing the solutions needed to manage it, paving the way for a future where scientific discoveries powered by AI are built on a foundation of verifiable trust.

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Tue, 22 Jul 2025 10:15:56 -0700

Midway Through OSRE

My Journey with LLMSeqRec

Hello from the Midpoint!

Hi everyone! I’m Connor Lee, a student at NYU studying Computer Science and Mathematics, and I’m excited to share the progress I’ve made halfway through the Open Source Research Experience (OSRE) with my project: LLMSeqRec – a large language model-enhanced sequential recommender system.

Over the past several weeks, I’ve had the opportunity to explore the intersection of recommender systems and large language models (LLMs), and it’s been a deep, challenging, and rewarding dive into building smarter, more contextual recommendation engines.

What is LLMSeqRec?

LLMSeqRec stands for LLM-Enhanced Contextual Sequential Recommender. Traditional sequential recommendation systems like SASRec are great at capturing patterns from user-item interactions, but they often fall short in two areas: understanding semantic context (e.g., item descriptions, reviews) and dealing with cold-start problems.

LLMSeqRec aims to address this by incorporating pretrained LLM embeddings into the recommendation pipeline. The goal is to enhance models like SASRec with semantic signals from text (like product reviews or titles), allowing them to better model user intent, long-range dependencies, and generalize to new items or users.

Progress So Far

✅ Baseline SASRec Runs

To establish a benchmark, I successfully ran the original SASRec implementation (in PyTorch) using both the MovieLens 1M and Amazon Beauty datasets. After debugging initial data formatting issues and adjusting batch sizes for local CPU/GPU compatibility, I automated training with scripts that let me scale to 200+ epochs to acheive the best performance in both Colab and on my MacBook via CPU.

Note: At this stage, we have not yet integrated LLMs into the model. These baseline runs (SASRec) serve as the control group for evaluating the future impact of LLM-based enhancements.

What’s Next

As I enter the second half of the OSRE, I’ll be shifting gears toward LLM integration, model evaluation, and running LLM-powered sequential recommendations using product metadata and contextual information. Here’s what’s ahead:

Designing pipelines to extract and align textual metadata with item sequences
Integrating LLM-generated embeddings into the recommender model
Evaluating performance changes across different dataset characteristics

📊 Experimental Results

We have not yet utilized LLMs in our current experiments. The results below reflect our reproduced baseline performance of SASRec across datasets.

Below are the performance curves on different test sets, where we evaluate model performance every 20 epochs during training:

Beauty Dataset Performance

Hit@10 performance on the test set for the Beauty dataset (every 20 epochs)

Training loss for the Beauty dataset

NDCG@10 performance on the test set for the Beauty dataset (every 20 epochs)

ML-1M Dataset Performance

Training loss for the ML-1M dataset

Hit@10 performance on the test set for the ML-1M dataset (every 20 epochs)

NDCG@10 performance on the test set for the ML-1M dataset (every 20 epochs)

These results demonstrate that our baseline SASRec reproductions are converging as expected and will serve as a solid foundation for comparison once LLM integration is complete.

Closing Thoughts

This project has been an exciting journey into both research and engineering and I’m excited to explore LLM-powered embedding integration in the upcoming phase.

I’m incredibly grateful to my mentors Dr. Linsey Pang and Dr. Bin Dong for their support and guidance throughout the project so far. I’m looking forward to sharing more technical results as we work toward building smarter, more adaptable recommender systems.

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Wed, 25 Jun 2025 00:00:00 +0000

Hello, I’m Baiqiang. As part of the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project, I am excited to introduce my work on a crucial challenge in modern computational science. My proposal under the mentorship of Luanzheng “Lenny” Guo at Pacific Northwest National Laboratory and Dongfang Zhao at the University of Washington aims to enhance the reproducibility of AI-driven scientific workflows.

The Problem: A Crisis of Confidence in AI for Science

Large Language Models (LLMs) are transforming scientific research, from accelerating literature reviews to generating novel hypotheses. However, their power is matched by their pitfalls: a tendency to “hallucinate” facts and a lack of transparency. Retrieval-Augmented Generation (RAG) was developed as a powerful solution, grounding LLM outputs in factual evidence retrieved from a specific knowledge base (like a database of scientific papers).

But a hidden problem lurks within RAG: non-determinism. The very first step of a RAG system—the similarity search that finds relevant documents—can produce different results even when asked the same question. Variations in indexing algorithms, data updates, or even the underlying software can change which documents are retrieved. For science, this is a critical flaw. If an experiment cannot be repeated with the same results, its conclusions cannot be trusted. This project tackles that challenge head-on.

Our Mission: Forging a Path to Reproducible RAG

This project proposes a comprehensive solution to systematically identify, measure, and mitigate non-determinism in RAG frameworks. Our goal is to empower researchers to build and use AI tools with confidence.

Our approach is built on four key pillars:

Systematic Analysis: We will conduct a deep dive into popular RAG components (like FAISS, ScaNN, and HNSW) to pinpoint the exact sources of randomness and variability.
Rigorous Benchmarking: We will develop a public, open-source benchmarking suite using standardized scientific datasets (from PubMed, arXiv, etc.). This will allow anyone to quantitatively measure the reproducibility of their own RAG pipeline using clear metrics like retrieval overlap and rank correlation.
Targeted Enhancements: Based on our findings, we will implement practical solutions, including:
- Promoting deterministic algorithms and configurations.
- Building robust data versioning and provenance tracking tools (inspired by DVC and Git LFS).
- Creating tools for precise configuration management to capture the entire experimental setup.
Practical Guidance and Open Source Tools: We will distill our insights into comprehensive documentation, reusable code examples, and best practices. All tools and findings will be contributed back to the open-source community.

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Fri, 06 Jun 2025 10:15:56 -0700

Project Description

Sequential Recommender Systems are widely used in scientific and business applications to analyze and predict patterns over time. In biology and ecology, they help track species behavior by suggesting related research on migration patterns and environmental changes. Medical applications include personalized treatment recommendations based on patient history and predicting disease progression. In physics and engineering, these systems optimize experimental setups by suggesting relevant past experiments or simulations. Environmental and climate science applications include forecasting climate trends and recommending datasets for monitoring deforestation or pollution. In business and e-commerce, sequential recommenders enhance user experiences by predicting consumer behavior, suggesting personalized products, and optimizing marketing strategies based on browsing and purchase history. By leveraging sequential dependencies, these recommender systems enhance research efficiency, knowledge discovery, and business decision-making across various domains. Traditional sequential recommendation systems rely on historical user interactions to predict future preferences, but they often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors. Existing models primarily use predefined embeddings and handcrafted features, limiting their ability to generalize across diverse recommendation scenarios. To address these challenges, we propose LLM Enhanced Contextual Sequential Recommender (LLMSeqRec), which leverages Large Language Models (LLMs) to enrich sequential recommendations with deep contextual understanding and adaptive reasoning. By integrating LLM-generated embeddings and contextual representations, LLMSeqRec enhances user intent modeling, cold-start recommendations, and long-range dependencies in sequential data. Unlike traditional models that rely solely on structured interaction logs, LLMSeqRec dynamically interprets and augments sequences with semantic context, leading to more accurate and personalized recommendations. This fusion of LLM intelligence with sequential modeling enables a more scalable, adaptable, and explainable recommender system, bridging the gap between traditional sequence-based approaches and advanced AI-driven recommendations.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), this project aims to develop an LLM-Enhanced Contextual Sequential Recommender (LLMSeqRec) to improve sequential recommendation accuracy across various scientific and business applications. Sequential recommender systems are widely used to analyze and predict patterns over time, assisting in fields such as biology, ecology, medicine, physics, engineering, environmental science, and e-commerce. However, traditional models often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors, as they primarily rely on vanilla sequential Id orders. To address these limitations, this project will leverage Large Language Models (LLMs) to enhance context-aware sequential recommendations by dynamically integrating LLM-generated embeddings and contextual representations. The core challenge lies in designing LLMSeqRec, a unified and scalable model capable of enriching user intent modeling, mitigating cold-start issues, and capturing long-range dependencies within sequential data. Unlike conventional systems that rely solely on structured interaction logs, LLMSeqRec will interpret and augment sequences with semantic context, resulting in more accurate, adaptable, and explainable recommendations. Below is an outline of the methodologies and models that will be developed in this project:

Step 1: Data Preprocessing & Feature Creation: Develop a data processing pipeline to parse user’s sequential interaction behaviors into sequential data points for LLM-based embeddings and contextual sequential transformer modeling; Extract user behavior sequences, items’ metadata, and temporal patterns to create context-aware sequential representations for training, validation and testing; The data source can be from Amazon open public data or Movie Lense data set. The data points creation can follow SASRec (in the reference 1).
Step 2: Model Development: Design and implement LLM-enhanced sequential recommendation models, integrating pretrained language models to augment user-item interactions with semantic context; Develop an adaptive mechanism to incorporate external contextual signals, such as product descriptions, reviews into the sequential recommendation process; The baseline model can be SASRec pytorch implementation.
Step 3: Evaluation: : Benchmark LLMSeqRec against state-of-the-art sequential recommenders, evaluating on accuracy, NDCG and cold-start performance; Conduct ablation studies to analyze the impact of LLM-generated embeddings on recommendation quality; Optimize model inference speed and efficiency for real-time recommendation scenarios.

Project Deliverables

This project will deliver three components, software, model training, validation and performance evaluation and demo. The software which implements the above LLMSeqRec model will be hosted on the github repo as open-access repositories. The evaluation results and demo will be published along the github repo .

LLMSeqRec

Topics: LLM Enhanced Contextual Sequential Recommender
Skills: Proficiency in Python, Pytorch, Github, Self-attention, Transformer
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Self-Attentive Sequential Recommendation (SASRec)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Amazon Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews
Movie Lense Data: https://grouplens.org/datasets/movielens/

Introduction

I’m Connor, a student at NYU studying CS and Math. This summer I’ve gotten the opportunity to work on LLMSeqRec under Dr. Bin Dong and Dr. Linsey Pang.

In today’s digital age, sequential recommender systems power everything from e-commerce suggestions to personalized content everywhere. However, traditional models fall short in capturing user intent, adapting to dynamic behavior, or tackling cold-start problems. That’s where LLMSeqRec comes in.

Problem Statement

Most sequential recommender systems rely heavily on historical user-item interactions and predefined embeddings. This approach limits their ability to understand nuanced user preferences, struggles to scale across domains, and performs poorly in scenarios like new users or sparse data. The absence of semantic and contextual modeling is a major gap in current solutions.

Overview of project

LLMSeqRec is a novel, LLM-enhanced sequential recommender framework that bridges this gap. By leveraging large language models (LLMs), it incorporates semantic embeddings and prompt-based contextual modeling to understand both user behavior and item metadata at a deeper level. The system explores two core approaches:

Embedding-based: LLMs generate embeddings from item attributes.
Prompt-based: LLMs receive full transaction history in natural language format and infer recommendations.

These techniques are tested using well-known datasets (e.g., Amazon, MovieLens), and evaluated with ranking metrics like NDCG@10 and Hit@10. The goal: deliver more accurate, context-rich, and explainable recommendations.

Next Steps

The project is currently progressing through stages including model training, embedding integration, and evaluation. Upcoming tasks include:

Fine-tuning enhanced models
Designing zero-/few-shot prompts
Running comparative experiments
Publishing findings and writing technical blogs

As part of the LLMSeqRec my proposal under the mentorship of Dr. Bin Dong and Dr. Linsey Pang.

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Thu, 20 Feb 2025 09:00:00 -0700

Retrieval-Augmented Generation (RAG) frameworks, which merge the capabilities of retrieval systems and generative models, significantly enhance the relevance and accuracy of responses produced by large language models (LLMs). These frameworks retrieve relevant documents from a large corpus and use these documents to inform the generative process, thereby improving the contextuality and precision of the generated content. Ensuring reproducibility in data queries using similarity search within these RAG frameworks is critical for maintaining the reliability and consistency of scientific workflows. Reproducibility ensures that the same input query consistently yields the same output, which is vital for scientific tasks that rely on precise and repeatable results. Inconsistencies can arise from various sources, affecting the trustworthiness of scientific outcomes. Differences in retrieval algorithms can lead to variable sets of documents being retrieved for the same query. Variations in data indexing methods can cause inconsistencies in how documents are ranked and accessed. The stochastic nature of LLM operations introduces an element of randomness in the generative process. Updates in datasets can also alter the baseline against which queries are processed and interpreted, leading to different results over time.

This proposal aims to address these reproducibility challenges in similarity searches within RAG frameworks. This work involves analyzing the root causes of non-determinism, benchmarking and validating the consistency of query results, implementing enhancements to minimize variability, and developing tools and best practices to ensure reproducibility. Reproducibility in data queries can be influenced by several factors, including updates in datasets, differences in retrieval algorithms, varying data indexing methods, and the stochastic nature of LLM operations. Each of these factors can cause variability in the documents retrieved and in the generated responses. Ensuring consistency in query results across different runs is crucial for maintaining the integrity of LLM-driven scientific research, allowing researchers to confidently build upon prior work and achieve reliable, trustworthy outcomes.

Workplan

The proposed work will include: (1) Identifying sources of non-determinism and variability, such as algorithmic differences and indexing methods, in RAG; (2) Utilizing standardized scientific datasets to benchmark the reproducibility of similarity search results across different RAG frameworks; (3) Establishing protocols for handling dataset updates to ensure that such changes do not impact the reproducibility of similarity search results; and (4) Implementing mechanisms to track and document updates to datasets, ensuring that changes are reflected consistently across all instances of the RAG framework. By addressing these areas, the proposed work aims to mitigate challenges related to reproducibility in similarity search queries within RAG frameworks, ultimately enhancing the reliability and trustworthiness of scientific research outcomes.

Topics: Reproducibility LLM RAG Scientific Workflows
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Autograder

Thu, 06 Feb 2025 13:00:00 -0800

The EduLinq Autograder is an open source tool used by several courses at UCSC to safely and quickly grade programming assignments. Grading student code is something that may seem simple at first (you just need to run their code!), but quickly becomes exceeding complex as you get more into the details. Specifically, grading a student’s code securely while providing the “last mile” service of getting code from students and sending results to instructors/TAs and the course’s LMS (e.g., Canvas) can be very difficult. The Autograder provides all of this in a free and open source project. The LINQS Lab has made many contributions to the maintain and improve the Autograder.

All students interested in LINQS projects for OSRE/GSoC 2025 should fill out this form. Towards the end of the application window, we will contact those who we believe to be a good fit for a LINQS project. The form will stop accepting responses once the application window closes. Do not post on any of the project repositories about OSRE/GSoC (e.g., comment on an issue that you want to tackle it as a part of OSRE/GSoC 2025). Remember, these are active repositories that were not created for OSRE/GSoC.

LLM Detection

Topics: AI/ML LLM Research Backend
Skills: software development, backend, systems, data munging, go, docker
Difficulty: Challenging
Size: Large (350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The task for this project is to create a system that provides a score indicating the system’s confidence that a given piece of code was written by an AI tool and not a student. This will supplement the existing code analysis tools in the Autograder. There are many approaches to completing this task that will be considered. A more software development approach can consist of levering exiting systems to create a production-ready system, whereas a more research approach can consist of creating a novel approach complete with a paper and experiments.

See Also:

Code Analysis GUI

Topics: Frontend
Skills: software development, frontend, data munging, js, css, go
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Autograder has existing functionality to analyze the code in a student’s submission for malicious content. Relevant to this project is that the Autograder can run a pairwise similarity analysis against all submitted code. This is how most existing software plagiarism systems detect offending code. The existing infrastructure provides detailed statistics on code similarity, but does not currently have a visual way to display this data.

The task for this project is to create a web GUI using the Autograder REST API to display the results of a code analysis. The size of this project depends on how many of the existing features are going to be supported by the web GUI.

See Also:

Web GUI

Topics: Frontend
Skills: software development, frontend, js, css
Difficulty: Easy
Size: Medium or Large (175 or 350 hours)
Mentors: Eriq Augustine, Fabrice Kurmann, Lise Getoor

The Autograder contains dozens of API endpoints, most directly representing a piece of functionality exposed to the user. All of these features are exposed in the Autograder’s Python Interface. However, the Python interface is a purely command-line interface. And although command-line interface are objectively (read: subjectively) the best, a web GUI would be more accessible to a wider audience. The autograder already has a web GUI, but it does not cover all the features available in the Autograder.

The task for this project is to augment the Autograder’s web GUI with more features. Specifically, add support for more tools used to create and administer courses.

See Also:

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Thu, 06 Feb 2025 10:15:56 -0700

Project Description

Project Objectives

Step 1: Data Preprocessing & Feature Creation: Develop a data processing pipeline to parse user’s sequential interaction behaviors into sequential data points for LLM-based embeddings and contextual sequential transformer modeling; Extract user behavior sequences, items’ metadata, and temporal patterns to create context-aware sequential representations for training, validation and testing; The data source can be from Amazon open public data or Movie Lense data set. The data points creation can follow SASRec (in the reference 1).
Step 2: Model Development: Design and implement LLM-enhanced sequential recommendation models, integrating pretrained language models to augment user-item interactions with semantic context; Develop an adaptive mechanism to incorporate external contextual signals, such as product descriptions, reviews into the sequential recommendation process; The baseline model can be SASRec pytorch implementation.
Step 3: Evaluation: : Benchmark LLMSeqRec against state-of-the-art sequential recommenders, evaluating on accuracy, NDCG and cold-start performance; Conduct ablation studies to analyze the impact of LLM-generated embeddings on recommendation quality; Optimize model inference speed and efficiency for real-time recommendation scenarios.

Project Deliverables

LLMSeqRec

Topics: LLM Enhanced Contextual Sequential Recommender
Skills: Proficiency in Python, Pytorch, Github, Self-attention, Transformer
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Self-Attentive Sequential Recommendation (SASRec)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Amazon Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews
Movie Lense Data: https://grouplens.org/datasets/movielens/

OpenROAD - An Open-Source, Autonomous RTL-GDSII Flow for Chip Design

Sun, 19 Jan 2025 00:00:00 +0000

The OpenROAD project is a non-profit project, originally funded by DARPA with the aim of creating open-source EDA tools; an Autonomous flow from RTL-GDSII that completes < 24 hrs, to lower cost and boost innovation in IC design. This project is now supported by Precision Innovations.

OpenROAD massively scales and supports EWD (Education and Workforce Development) and supports a broad ecosystem making it a vital tool that supports a rapidly growing Semiconductor Industry.

OpenROAD is the fastest onramp to gain knowledge, skills and create pathways for great career opportunities in chip design. You will develop important software and hardware design skills by contributing to these interesting projects. You will also have the opportunity to work with mentors from the OpenROAD project and other industry experts.

We welcome a diverse community of designers, researchers, enthusiasts, software engineers and entrepreneurs to use and contribute to OpenROAD and make a far-reaching impact in the rapidly growing, global Semiconductor Industry.

Improving Code Quality in OpenROAD

Topics: Coding Best Practices in C++, Code Quality Tooling, Continuous Integration
Skills: C++
Difficulty: Medium
Size: Medium (175 hours)
Mentors: Matt Liberty & Arthur Koucher

OpenROAD is a large and complex program. This project is to improve the code quality through resolving issues flagged by tools like Coverity and clang-tidy. New tools like the clang sanitizers ASAN/TSAN/UBSAN should also be set up and integrated with the Jenkins CI.

GUI Testing in OpenROAD

Topics: Testing, Continuous Integration
Skills: C++, Qt
Difficulty: Medium
Size: Large (350 hours)
Mentors: Matt Liberty & Peter Gadfort

The OpenROAD GUI is a crucial set of functionality for users to see and investigate their design. GUI testing is specialized and rather different from standard unit testing. The GUI therefore needs improvements to its testing to cover both interaction and rendering. The GUI uses the Qt framework. An open-source testing tool like https://github.com/faaxm/spix will be set up and key tests developed. This will provide the framework for all future testing.

Rectilinear Floorplans in OpenROAD

Topics: Electronic Design Automation, Algorithms
Skills: C++, data structures and algorithms
Difficulty: Medium
Size: Large (350 hours)
Mentors: Eder Monteiro & Augusto Berndt

OpenROAD supports block floorplans that are rectangular in shape. Some designs may require more complex shapes to fit. This project extends the tool to support rectilinear polygon shapes as floorplans. This will require upgrading data structures and algorithms in various parts of OpenROAD including floor plan generation, pin placement, and global placement.

LEF Reader and Database Enhancements in OpenROAD

Topics: Electronic Design Automation, Database, Parsing
Skills: Boost Spirit parsers, Database, C++
Difficulty: Medium
Size: Medium (175 hours)
Mentors: Osama Hammad & Ethan Mahintorabi

LEF (Library Exchange Format) is a standard format for describing physical design rules for integrated circuits. OpenROAD has support for many constructs but some newer ones for advanced process nodes are not supported. This project is to support parsing such information and storing in the OpenDB for use by the rest of the tool.

ORAssistant - LLM Data Engineering and Testing

Topics: Large Language Model, Machine Learning, Data Engineering, Model Deployment, Testing, Full-Stack Development
Skills: large language model engineering, database, evaluation, CI/CD, open-source or related software development, full-stack
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Jack Luar & Palaniappan R

This project is aimed at enhancing robustness and accuracy for OR Assistant, the conversational assistant for OpenROAD through comprehensive testing and evaluation. You will work with members of the OpenROAD team and other researchers to enhance the existing dataset to cover a wide range of use cases to deliver accurate responses more efficiently. This project will focus on data engineering and benchmarking and you will collaborate on a project on the LLM model engineering. Tasks include: creating evaluation pipelines, building databases to gather feedback, improving CI/CD, writing documentation, and improving the backend and frontend services as needed (non-exhaustive). You will gain valuable experience and skills in understanding chip design flows and applications. Open to proposals from all levels of ML practitioners.

ORAssistant - LLM Model Engineering

Topics: Large Language Model, Machine Learning, Model Architecture, Model Deployment
Skills: large language model engineering, prompt engineering, fine-tuning
Difficulty: Medium
Size: Medium (175 hours)
Mentor: Jack Luar & Palaniappan R

This project is aimed at enhancing robustness and accuracy for OR Assistant, the conversational assistant for OpenROAD through enhanced model architectures. You will work with members of the OpenROAD team and other researchers to explore alternate architectures beyond the existing RAG-based implementation. This project will focus on improving reliability and accuracy of the existing model architecture. You will collaborate on a tandem project on data engineering for OR assistant. Tasks include: reviewing and understanding the state-of-the-art in retrieval augmented generation, implementing best practices, caching prompts, improving relevance and accuracy metrics, writing documentation and improving the backend and frontend services as needed (non-exhaustive). You will gain valuable experience and skills in understanding chip design flows and applications. Open to proposals from all levels of ML practitioners.

Final Blogpost: HDEval's LLM Benchmarking for HDL Design

Wed, 21 Aug 2024 00:00:00 +0000

Introduction

Hello everyone! I’m Ashwin Bardhwaj, an undergraduate student studying at UC Berkeley. As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau and Sakshi Garg looks to create a suite of benchmark programs for HDEval.

The goal of this project is to create large-scale Verilog programs in order to benchmark that capability of LLMs to develop HDL code. Throughout this project, I have created 3 of the large Verilog testbenches called 3-Stage-RISC_V processor, Gameboy Emulator, and Sorts. The benchmark programs will lose their effectriveness if LLMs such as ChatGPT scrape over Github reposotires and learn from them. As a result, the code itself cannot be made public due to LLM scraping over repositories, this file will cover the test report for all 3 of these projects.

3 Stage RISC V Processor

This is a pipelined RISC processor developed to to handle RV32I instructions. A 3-Stage processsor will typically contain a Fetch, Decode, and Execute cycle. As a result, every instruction will take exactly 3 clock cycles. For this processor, instructions can be formatted into R, I (Load), S (Store), B (Cond), and J (Jump and Link) type instructions. Once a 32 bit instruction is fetched at the location in memory specifed by the pc (Program Counter) register, it is sent to be decoded by the “decode unit”. Through decoding an instruction, we can determine the exact operation code, register location of the 2 operands (rs1 and rs2), and the destination register (rd) at which to write the calculated result. After decoding, an activation flag is sent to the excetution cycle to then take and access the register file at address rs1 and rs2 in order to get the correct operand data. The data and operation is then sent to the ALU to compute the result based on the opcode. The result is then written back into the register file at the rd address and the program counter is incremented and the next instruction is fetched.

The prompts for each module in this processor have been generated and tested against a GPT 3 turbo and GPT 4o models as an example. In the RISC V tab in my test report, I have provided the exact prompts and results after running on MASC’s HDLAgent tool which can access the APIs of many LLMs.

Gameboy Emulator

The Gameboy Emulator is a Verilog implementation of the classic GameBoy console that was widely popular in the 1990s. The main aspects of the GameBoy that were focused on in this project were the Z-80 like CPU, memory objects like RAM, VRAM, and ROM, the PPU (Picture Processing Unit), and other peripherals. The instructions are given to the CISC (variable-length instructions) CPU where they are decoded and executed based on the details and expectations of that specific instruction. In some cases, timing becomes a concern and there is significant effort made to ensure that instructions can be parsed and run predictably and effictively. Instructions from the ROM may take between 1 to 4 clock cycles to run depending on the requirements. For example, the instruction “LD B, HL” , loads the data found at the 16 bit address given by registers H and L into register B is a 2 cycle instruction. The first cycle decodes the HL address and fetches the data at the accurate location, while the second cycle takes the new input data at writes it into register B. This requires accurate timing control between different asects of the GameBoy.

The Picture Processing Unit is also an integral feature of the gameboy. Three frames called Background, Window, and Sprite are combined into the classic Gameboy screens we know today. White the Background and Window data are consistently called from the VRAM after certain clock cycle times, the Sprite and sprtite attributes are accessed using DMA (Direct Memory Access) from OAM (Object Attribute Memory). This reduces the CPU load and improves the speed of sprite data.

Deliverables

HDEval Test Report: The HDEval Test Report contains the module prompts for each testbench, the results after testing on GPT 3 turbo and 4o, and test cases to ensure code correctness and reliability.
HDEval Repo: HDEval contains the encrypted version of the yaml files that encapsulate the code, prompts, and additional data.

Next Steps

Given these benchmarks, it is important to track the abilities of these LLMs to generate HDL code. Therefore, including GPT 3-turbo and 4o. I would like these benchmarks to be applied to more models so that we can track their growth and keep informed on their effectiveness in HDL and hardware.

Previous Blogs

Please feel free to check out my previous blogs!

Thank you for reading!

Midterm Blogpost: HDEval's LLM Benchmarking for HDL Design

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello! My name is Ashwin Bardhwaj, an electrical engineering and computer science student based in San Diego, CA. For the past 6 weeks, I have been working closely with Professor Jose Renau on the HDEval project. The aim of this project is to create multiple project sized HDL benchmarks to evaluate how well existing LLMs can generate Verilog/Chisel code. These benchmarks will include my own “golden” HDL implementation of the project as well as respective English prompts to guide the LLM. I am excited to be able to work with these tools that have the potential to become a valuable resource for HDL design. So far, I have been successful in creating the first benchmark, a pipelined 3 stage RISC-V core, as well as working through by second project, a Gameboy Emulator.

RISC-V Implementation

Over this past month and a half, I have successfully completed my first benchmark which focuses on creating, modeling, and testing a pipelined 3-stage RISC-V core. The core uses the fetch, decode, and execute structure and is functional for most RV32I instructions. I synthesized and simulated my Verilog using Icarus Verilog and displayed the waveforms on GTKWave. After development, a good section of time was spent creating and tuning the English explanation of each Verilog module. After running these benchmark files through several LLM APIs, we compared the existing “golden” modules with the generated ones and noticed that more recent versions of LLMs such as GPT 4o and Claude 3 preform much better at creating syntactically correct and efficient code.

In addition, I have also created a tool that will parse the Verilog and instruction files into the necessary json structure to then test on various models.

Gameboy Emulator

I am also in the process of developing the second benchmark, which targets a Gameboy emulator. This will challenge the LLMs much more than the RISC-V project because apart from the custom CISC CPU, the model should also understand how to handle various other blocks of the hardware system including memory, picture processing unit (PPU), sound processing unit (SPU), various input/output systems like the buttons and cartridge, and interrupt handlers. As a result, it will challenge the model to understand the system as a whole when creating each individual module.

Next Steps

As we continue on to the second half of the project, I will continue working on my gameboy emulator. I have already completely developed and tested the Z80-esque CPU, DMA, and interrupt handler but need to continue working on the display and sound interfaces. Also, I will also continue to evaluate and run these tests over a wider range of LLMs to get a better picture of what models and versions are best suited for HDL design as well as the direction these models are going in.

HDEval: Benchmarking LLMs that Generate Verilog/Chisel Modules From Natural Language

Tue, 14 May 2024 00:00:00 +0000

Hi everyone!

I’m Ashwin Bardhwaj, currently pursuing a bachelors in Electrical Engineering and Computer Science at UC Berkeley. I was recently involved in a project to implement a secure hardware encryption enclave in Verilog. That’s why I was excited to work with the MASC group to evaluate how existing generalized LLMs (such as ChatGPT 4 or StarCoder) can generate accurate Verliog/Chisel code from English and assist in the hardware development process.

As part of Micro Architecture Santa Cruz (MASC) my proposal under the mentorship of Jose Renau and Sakshi Garg looks to create a suite of benchmark programs for HDEval.

The deliverable of this project is to create multiple large HDL benchmarks along with a respective set of prompts. Using yosys to implement Logic Equivalence Check, we are able to prove through formal verification that the generated code will exhibit the same behavior as the benchmark. In addition, we can also consider the performance and resource utilization of the generated code as a metric.