Bin Dong | UCSC OSPO

MedJEPA: Self-Supervised Medical Image Representation Learning with JEPA

Mon, 19 Jan 2026 10:15:56 -0700

Project Description

[MedJEPA] Medical image analysis is fundamental to modern healthcare, enabling disease diagnosis, treatment planning, and patient monitoring across diverse clinical applications. In radiology and pathology, deep learning models support automated detection of abnormalities, tumor segmentation, and diagnostic assistance. Medical imaging modalities including X-rays, CT scans, MRI, ultrasound, and histopathology slides generate vast amounts of unlabeled data that could benefit from self-supervised representation learning. Clinical applications include cancer detection and staging, cardiovascular disease assessment, neurological disorder diagnosis, and infectious disease screening. In drug discovery and clinical research, analyzing medical images helps evaluate treatment efficacy, predict patient outcomes, and identify biomarkers for disease progression. Telemedicine and point-of-care diagnostics benefit from AI-powered image analysis that extends expert-level interpretation to underserved regions. However, medical imaging faces unique challenges: limited labeled datasets due to expensive expert annotation, patient privacy concerns restricting data sharing, domain shift across different imaging equipment and protocols, and the need for models that generalize across hospitals and populations. Traditional medical image analysis relies heavily on supervised learning with manually annotated labels, creating bottlenecks due to the scarcity and cost of expert annotations. Existing self-supervised methods applied to medical imaging often employ complex training procedures with numerous heuristics—momentum encoders, stop-gradients, teacher-student architectures, and carefully tuned augmentation strategies—that may not translate well across different medical imaging modalities and clinical contexts. These approaches struggle with domain-specific challenges such as subtle pathological features, high-resolution images, 3D volumetric data, and the need for interpretable representations that clinicians can trust. To address these challenges, we propose MedicalJEPA: Self-Supervised Medical Image Representation Learning with Joint-Embedding Predictive Architecture, which leverages the theoretically grounded LeJEPA framework for 2D medical images and V-JEPA principles for medical video and volumetric data, creating a unified, scalable, and heuristics-free approach specifically tailored for medical imaging applications. By utilizing the principled JEPA frameworks with objectives like Sketched Isotropic Gaussian Regularization (SIGReg), MedJEPA eliminates complex training heuristics while learning clinically meaningful representations from unlabeled medical images. Unlike conventional self-supervised methods that require extensive hyperparameter tuning and may not generalize across medical imaging modalities, MedicalJEPA provides a clean, theoretically motivated framework with minimal hyperparameters that adapts to diverse medical imaging contexts—from chest X-rays to histopathology slides to cardiac MRI sequences. The learned representations can support downstream tasks including disease classification, lesion detection, organ segmentation, and survival prediction, while requiring significantly fewer labeled examples for fine-tuning. This approach democratizes access to state-of-the-art medical AI by enabling effective learning from the vast amounts of unlabeled medical imaging data available in hospital archives, addressing the annotation bottleneck that has limited progress in medical AI.

Project Objectives

Aligned with the vision of the 2026 Open Source Research Experience (OSRE), this project aims to apply Joint-Embedding Predictive Architecture (JEPA) frameworks to medical image representation learning, addressing the critical challenge of learning from limited labeled medical data. Medical imaging generates enormous amounts of unlabeled data, but supervised learning approaches are bottlenecked by the scarcity and cost of expert annotations. Existing self-supervised methods often rely on complex heuristics that don’t generalize well across diverse medical imaging modalities, equipment vendors, and clinical protocols. This project will leverage the theoretically grounded LeJEPA framework for 2D medical images (X-rays, histopathology slides, fundus images) and V-JEPA principles for temporal and volumetric medical data (cardiac MRI sequences, CT scans, surgical videos). The core challenge lies in adapting these heuristics-free, stable frameworks to medical imaging’s unique characteristics: subtle pathological features requiring fine-grained representations, high-resolution images demanding efficient processing, domain shift across hospitals and equipment, and the need for interpretable features that support clinical decision-making. The learned representations will be evaluated on diverse downstream clinical tasks including disease classification, lesion detection, organ segmentation, and prognosis prediction, with emphasis on few-shot learning scenarios that reflect real-world annotation constraints. Below is an outline of the methodologies and models that will be developed in this project.

Step 1: Medical Data Preparation: Develop data processing pipelines for diverse medical imaging modalities, implementing DICOM/NIfTI parsing, standardized preprocessing, and efficient data loading for self-supervised pre-training. Prepare 2D medical image datasets: Chest X-rays: ChestX-ray14, MIMIC-CXR, CheXpert for lung disease detection Histopathology: Camelyon16/17 (breast cancer), PCam (patch-level classification) Retinal imaging: EyePACS, APTOS (diabetic retinopathy), Messidor Dermatology: HAM10000, ISIC (skin lesion classification) Prepare 3D volumetric and temporal medical data: CT scans: LIDC-IDRI (lung nodules), Medical Segmentation Decathlon datasets MRI sequences: BraTS (brain tumors), ACDC (cardiac MRI), UK Biobank cardiac videos Medical video: Surgical procedure videos, endoscopy recordings, ultrasound sequences Implement medical imaging-specific preprocessing: intensity normalization, resolution standardization, handling of multi-channel medical images (different MRI sequences, RGB histopathology), and privacy-preserving anonymization. Design masking strategies appropriate for medical imaging: spatial masking for 2D images, volumetric masking for 3D scans, temporal masking for sequences, and anatomy-aware masking that respects organ boundaries. Create data loaders supporting high-resolution medical images, 3D volumes, and multi-modal inputs (e.g., multiple MRI sequences).
Step 2: JEPA Model Implementation for Medical Imaging: Implement LeJEPA for 2D medical images: Adapt joint-embedding predictive architecture for medical image characteristics (high resolution, subtle features, domain-specific patterns) Apply Sketched Isotropic Gaussian Regularization (SIGReg) to learn clinically meaningful embedding distributions Maintain single trade-off hyperparameter and heuristics-free training for reproducibility across medical imaging centers Support various encoder architectures: Vision Transformers for global context, ConvNets for local features, hybrid approaches Extend to V-JEPA for medical video and volumetric data: Spatiotemporal encoding for cardiac MRI sequences, surgical videos, and time-series medical imaging Temporal prediction objectives for understanding disease progression and treatment response 3D volume processing for CT and MRI scans with efficient memory management Multi-slice and multi-sequence learning for comprehensive medical imaging contexts Develop medical domain-specific enhancements: Multi-scale representation learning to capture both fine-grained pathological details and global anatomical context Interpretability mechanisms: attention visualization, feature attribution, and embedding space analysis for clinical validation Robustness to domain shift: training strategies that generalize across different scanners, protocols, and institutions Privacy-preserving training considerations compatible with medical data regulations (HIPAA, GDPR) Implement efficient training infrastructure: Support for distributed training across multiple GPUs for large medical imaging datasets Memory-efficient processing of high-resolution images and 3D volumes Checkpoint management and model versioning for clinical deployment pipelines Minimal-code implementation (≈50-100 lines) demonstrating framework simplicity
Step 3: Evaluation & Safety Validation: : Disease Classification Tasks: Multi-label chest X-ray classification: 14 pathology classes on ChestX-ray14, MIMIC-CXR Diabetic retinopathy grading: 5-class classification on EyePACS, APTOS Skin lesion classification: 7-class classification on HAM10000 Brain tumor classification: glioma grading on BraTS dataset Evaluate with linear probing, few-shot learning (5-shot, 10-shot), and full fine-tuning Lesion Detection and Segmentation: Lung nodule detection on LIDC-IDRI dataset Tumor segmentation on Medical Segmentation Decathlon tasks Polyp detection in colonoscopy videos Cardiac structure segmentation in MRI sequences Clinical Prediction Tasks: Survival prediction from histopathology slides Disease progression prediction from longitudinal imaging Treatment response assessment from pre/post imaging pairs Few-Shot and Low-Data Regime Evaluation: Systematic evaluation with 1%, 5%, 10%, 25%, 50% of labeled training data Comparison against supervised baselines and ImageNet pre-training Analysis of annotation efficiency: performance vs. number of labeled examples required

Project Deliverables

This project will deliver three components: software implementation, clinical evaluation, and practical deployment resources. The software implementing MedicalJEPA will be hosted on GitHub as an open-access repository with modular code supporting multiple medical imaging modalities (2D images, 3D volumes, videos), pre-trained model checkpoints on major medical imaging datasets (chest X-rays, histopathology, MRI), training and evaluation scripts with medical imaging-specific preprocessing pipelines, privacy-preserving training implementations compatible with clinical data regulations, and comprehensive documentation including tutorials for medical AI researchers and clinicians. The evaluation results will include benchmarks on 10+ medical imaging datasets across diverse modalities and clinical tasks, few-shot learning analysis demonstrating annotation efficiency gains, cross-institutional validation studies showing robustness to domain shift, interpretability visualizations enabling clinical validation of learned representations, and detailed comparisons against supervised baselines and existing medical self-supervised methods. .

NeuroHealth

Topics: Self-Supervised Medical Image Representation Learning with JEPA
Skills: Proficiency in Python, Pytorch, Github, JEPA
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Linsey Pang

References:

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics - Randall Balestriero and Yann LeCun, arXiv 2024
Revisiting Feature Prediction for Learning Visual Representations from Video (V-JEPA) - Adrien Bardes et al., arXiv 2024
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture - Mahmoud Assran et al., CVPR 2023 (I-JEPA)
ChestX-ray14: Hospital-Scale Chest X-Ray Database - https://nihcc.app.box.com/v/ChestXray-NIHCC
Medical Segmentation Decathlon - http://medicaldecathlon.com/
MIMIC-CXR Database - https://physionet.org/content/mimic-cxr/
The Cancer Imaging Archive (TCIA) - https://www.cancerimagingarchive.net/
UK Biobank Imaging Study - https://www.ukbiobank.ac.uk/enable-your-research/about-our-data/imaging-data

NeuroHealth: AI-Powered Health Assistant

Mon, 19 Jan 2026 10:15:56 -0700

Project Description

[NeuroHealth] Intelligent health assistance systems are increasingly essential for improving healthcare accessibility, patient engagement, and clinical decision support. In primary care and preventive medicine, AI assistants help users understand symptoms, schedule appropriate appointments, and receive preliminary health guidance. Telemedicine applications include triage support, appointment scheduling optimization, and patient education based on health inquiries. In chronic disease management, these systems provide medication reminders, lifestyle recommendations, and timely alerts for medical follow-ups. Healthcare navigation applications include finding appropriate specialists, understanding treatment options, and coordinating care across multiple providers. In wellness and preventive care, intelligent assistants enhance health literacy by delivering personalized health information, screening recommendations, and proactive health management strategies. By leveraging natural language understanding and medical knowledge integration, these systems enhance healthcare access, reduce unnecessary emergency visits, and empower users to make informed health decisions across diverse populations. Traditional health information systems often provide generic responses that fail to account for individual health contexts, medical history, and personal circumstances. Existing symptom checkers and health chatbots primarily rely on rule-based logic or simple decision trees, limiting their ability to understand nuanced health inquiries, reason about complex symptom patterns, or provide contextually appropriate guidance. These systems struggle with interpreting ambiguous descriptions, adapting to users’ health literacy levels, and generating personalized recommendations that account for individual medical constraints and preferences. To address these challenges, we propose NeuroHealth: AI-Powered Health Assistant, which leverages Large Language Models (LLMs) to create an intelligent conversational agent that synthesizes user health inquiries, symptom descriptions, and contextual information into actionable, personalized health guidance and appointment recommendations. By integrating LLM-based medical reasoning with structured clinical knowledge bases, NeuroHealth enhances symptom interpretation, appointment routing, and health education delivery. Unlike conventional systems that provide static responses from predetermined templates, NeuroHealth dynamically understands user intent, asks clarifying questions, assesses urgency levels, and generates appropriate recommendations—whether scheduling a doctor appointment, suggesting self-care measures, or directing users to emergency services. This fusion of LLM intelligence with validated medical knowledge enables a more accessible, adaptive, and helpful health assistance platform, bridging the gap between users seeking health information and appropriate medical care.

Project Objectives

Aligned with the vision of the 2026 Open Source Research Experience (OSRE), this project aims to develop an AI-Powered Health Assistant (NeuroHealth) to improve healthcare accessibility and patient engagement through intelligent conversational guidance. Healthcare systems face significant challenges in providing timely, personalized health information and connecting patients with appropriate care resources. Traditional symptom checkers and health information systems often deliver generic, rule-based responses that fail to account for individual contexts and struggle with natural language understanding. To address these limitations, this project will leverage Large Language Models (LLMs) to create an intelligent health assistant that understands user health inquiries, interprets symptom descriptions, assesses urgency, and provides personalized recommendations including doctor appointment suggestions, self-care guidance, and healthcare navigation support. The core challenge lies in designing NeuroHealth as a safe, accurate, and user-friendly system capable of natural conversation, medical knowledge retrieval, and appropriate response generation while maintaining clinical safety guardrails. Unlike conventional health chatbots that follow rigid conversation flows, NeuroHealth will reason over user inputs, ask clarifying questions, and dynamically adapt responses based on context, resulting in more helpful, accurate, and appropriate health assistance. Below is an outline of the methodologies and models that will be developed in this project.

Step 1: Data Collection & Knowledge Base Construction: Develop a comprehensive medical knowledge base integrating validated health information sources, symptom databases, condition descriptions, and appointment routing guidelines. Collect and curate conversational health inquiry datasets from public medical Q&A forums, symptom checker logs, and healthcare chatbot interactions to create training and evaluation data. Design structured representations for symptoms, conditions, urgency levels, and appointment recommendations to enable effective retrieval and reasoning. Extract common health inquiry patterns, symptom descriptions, and user intent categories to inform conversation flow design. Data sources can include public medical knowledge bases such as MedlinePlus, Mayo Clinic health information, clinical practice guidelines, and synthetic patient inquiry scenarios based on common healthcare use cases. Implement data validation mechanisms to ensure medical accuracy and clinical safety compliance.
Step 2: Model Development: Design and implement an LLM-based conversational health assistant that integrates medical knowledge retrieval with natural language understanding and generation. Develop a Retrieval-Augmented Generation (RAG) architecture that grounds LLM responses in validated medical information sources, reducing hallucination risks and ensuring factual accuracy. Create prompt engineering strategies and reasoning frameworks that enable the system to: interpret symptom descriptions, assess urgency levels, ask appropriate clarifying questions, and generate personalized health guidance. Implement a multi-component architecture including: intent recognition, symptom extraction, urgency assessment, appointment recommendation generation, and response formatting modules. Develop clinical safety guardrails that detect high-risk scenarios requiring immediate medical attention and provide appropriate emergency guidance. Design conversation management strategies that maintain context across multi-turn dialogues and adapt to users’ health literacy levels. The baseline architecture can leverage state-of-the-art models such as GPT-4, Claude, or open-source alternatives like Llama, Qwen, combined with medical knowledge retrieval systems.
Step 3: Evaluation & Safety Validation: : Benchmark NeuroHealth against existing symptom checkers and health chatbots, evaluating on metrics including response accuracy, appropriateness of appointment recommendations, urgency assessment precision, and user satisfaction. Conduct human evaluation studies with healthcare professionals to assess clinical safety, response quality, and appropriateness of medical guidance. Perform adversarial testing to identify potential failure modes, unsafe responses, or inappropriate recommendations under edge cases. Conduct ablation studies to analyze the impact of retrieval-augmented generation, safety guardrails, and conversation management strategies on system performance. Evaluate system performance across diverse health inquiry types including acute symptoms, chronic condition management, preventive care questions, and healthcare navigation requests. Assess response quality across different user demographics and health literacy levels to ensure equitable access. Optimize inference efficiency and response latency for real-time conversational interaction across web and mobile platforms.

Project Deliverables

This project will deliver three components: model development, evaluation and validation, and interactive demonstration. The software implementing the NeuroHealth system will be hosted on GitHub as an open-access repository with comprehensive documentation, deployment guides, and API specifications. The evaluation results, including benchmark comparisons against existing systems, clinical safety assessments, and user study findings, will be published alongside the GitHub repository. An interactive demo showcasing the conversational interface, symptom interpretation capabilities, and appointment recommendation generation will be provided to illustrate real-world application scenarios.

NeuroHealth

Topics: AI-Powered Health Assistant
Skills: Proficiency in Python, Github, LLM
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Large Language Models in Healthcare - Singhal et al., Nature 2023
Med-PaLM: Large Language Models for Medical Question Answering - Singhal et al., arXiv 2022
Capabilities of GPT-4 on Medical Challenge Problems - Nori et al., arXiv 2023
MedlinePlus Medical Encyclopedia - https://medlineplus.gov/
Clinical Practice Guidelines Database - https://www.guidelines.gov/

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Tue, 22 Jul 2025 10:15:56 -0700

Midway Through OSRE

My Journey with LLMSeqRec

Hello from the Midpoint!

Hi everyone! I’m Connor Lee, a student at NYU studying Computer Science and Mathematics, and I’m excited to share the progress I’ve made halfway through the Open Source Research Experience (OSRE) with my project: LLMSeqRec – a large language model-enhanced sequential recommender system.

Over the past several weeks, I’ve had the opportunity to explore the intersection of recommender systems and large language models (LLMs), and it’s been a deep, challenging, and rewarding dive into building smarter, more contextual recommendation engines.

What is LLMSeqRec?

LLMSeqRec stands for LLM-Enhanced Contextual Sequential Recommender. Traditional sequential recommendation systems like SASRec are great at capturing patterns from user-item interactions, but they often fall short in two areas: understanding semantic context (e.g., item descriptions, reviews) and dealing with cold-start problems.

LLMSeqRec aims to address this by incorporating pretrained LLM embeddings into the recommendation pipeline. The goal is to enhance models like SASRec with semantic signals from text (like product reviews or titles), allowing them to better model user intent, long-range dependencies, and generalize to new items or users.

Progress So Far

✅ Baseline SASRec Runs

To establish a benchmark, I successfully ran the original SASRec implementation (in PyTorch) using both the MovieLens 1M and Amazon Beauty datasets. After debugging initial data formatting issues and adjusting batch sizes for local CPU/GPU compatibility, I automated training with scripts that let me scale to 200+ epochs to acheive the best performance in both Colab and on my MacBook via CPU.

Note: At this stage, we have not yet integrated LLMs into the model. These baseline runs (SASRec) serve as the control group for evaluating the future impact of LLM-based enhancements.

What’s Next

As I enter the second half of the OSRE, I’ll be shifting gears toward LLM integration, model evaluation, and running LLM-powered sequential recommendations using product metadata and contextual information. Here’s what’s ahead:

Designing pipelines to extract and align textual metadata with item sequences
Integrating LLM-generated embeddings into the recommender model
Evaluating performance changes across different dataset characteristics

📊 Experimental Results

We have not yet utilized LLMs in our current experiments. The results below reflect our reproduced baseline performance of SASRec across datasets.

Below are the performance curves on different test sets, where we evaluate model performance every 20 epochs during training:

Beauty Dataset Performance

Hit@10 performance on the test set for the Beauty dataset (every 20 epochs)

Training loss for the Beauty dataset

NDCG@10 performance on the test set for the Beauty dataset (every 20 epochs)

ML-1M Dataset Performance

Training loss for the ML-1M dataset

Hit@10 performance on the test set for the ML-1M dataset (every 20 epochs)

NDCG@10 performance on the test set for the ML-1M dataset (every 20 epochs)

These results demonstrate that our baseline SASRec reproductions are converging as expected and will serve as a solid foundation for comparison once LLM integration is complete.

Closing Thoughts

This project has been an exciting journey into both research and engineering and I’m excited to explore LLM-powered embedding integration in the upcoming phase.

I’m incredibly grateful to my mentors Dr. Linsey Pang and Dr. Bin Dong for their support and guidance throughout the project so far. I’m looking forward to sharing more technical results as we work toward building smarter, more adaptable recommender systems.

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Fri, 06 Jun 2025 10:15:56 -0700

Project Description

Sequential Recommender Systems are widely used in scientific and business applications to analyze and predict patterns over time. In biology and ecology, they help track species behavior by suggesting related research on migration patterns and environmental changes. Medical applications include personalized treatment recommendations based on patient history and predicting disease progression. In physics and engineering, these systems optimize experimental setups by suggesting relevant past experiments or simulations. Environmental and climate science applications include forecasting climate trends and recommending datasets for monitoring deforestation or pollution. In business and e-commerce, sequential recommenders enhance user experiences by predicting consumer behavior, suggesting personalized products, and optimizing marketing strategies based on browsing and purchase history. By leveraging sequential dependencies, these recommender systems enhance research efficiency, knowledge discovery, and business decision-making across various domains. Traditional sequential recommendation systems rely on historical user interactions to predict future preferences, but they often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors. Existing models primarily use predefined embeddings and handcrafted features, limiting their ability to generalize across diverse recommendation scenarios. To address these challenges, we propose LLM Enhanced Contextual Sequential Recommender (LLMSeqRec), which leverages Large Language Models (LLMs) to enrich sequential recommendations with deep contextual understanding and adaptive reasoning. By integrating LLM-generated embeddings and contextual representations, LLMSeqRec enhances user intent modeling, cold-start recommendations, and long-range dependencies in sequential data. Unlike traditional models that rely solely on structured interaction logs, LLMSeqRec dynamically interprets and augments sequences with semantic context, leading to more accurate and personalized recommendations. This fusion of LLM intelligence with sequential modeling enables a more scalable, adaptable, and explainable recommender system, bridging the gap between traditional sequence-based approaches and advanced AI-driven recommendations.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), this project aims to develop an LLM-Enhanced Contextual Sequential Recommender (LLMSeqRec) to improve sequential recommendation accuracy across various scientific and business applications. Sequential recommender systems are widely used to analyze and predict patterns over time, assisting in fields such as biology, ecology, medicine, physics, engineering, environmental science, and e-commerce. However, traditional models often struggle with capturing complex contextual dependencies and adapting to dynamic user behaviors, as they primarily rely on vanilla sequential Id orders. To address these limitations, this project will leverage Large Language Models (LLMs) to enhance context-aware sequential recommendations by dynamically integrating LLM-generated embeddings and contextual representations. The core challenge lies in designing LLMSeqRec, a unified and scalable model capable of enriching user intent modeling, mitigating cold-start issues, and capturing long-range dependencies within sequential data. Unlike conventional systems that rely solely on structured interaction logs, LLMSeqRec will interpret and augment sequences with semantic context, resulting in more accurate, adaptable, and explainable recommendations. Below is an outline of the methodologies and models that will be developed in this project:

Step 1: Data Preprocessing & Feature Creation: Develop a data processing pipeline to parse user’s sequential interaction behaviors into sequential data points for LLM-based embeddings and contextual sequential transformer modeling; Extract user behavior sequences, items’ metadata, and temporal patterns to create context-aware sequential representations for training, validation and testing; The data source can be from Amazon open public data or Movie Lense data set. The data points creation can follow SASRec (in the reference 1).
Step 2: Model Development: Design and implement LLM-enhanced sequential recommendation models, integrating pretrained language models to augment user-item interactions with semantic context; Develop an adaptive mechanism to incorporate external contextual signals, such as product descriptions, reviews into the sequential recommendation process; The baseline model can be SASRec pytorch implementation.
Step 3: Evaluation: : Benchmark LLMSeqRec against state-of-the-art sequential recommenders, evaluating on accuracy, NDCG and cold-start performance; Conduct ablation studies to analyze the impact of LLM-generated embeddings on recommendation quality; Optimize model inference speed and efficiency for real-time recommendation scenarios.

Project Deliverables

This project will deliver three components, software, model training, validation and performance evaluation and demo. The software which implements the above LLMSeqRec model will be hosted on the github repo as open-access repositories. The evaluation results and demo will be published along the github repo .

LLMSeqRec

Topics: LLM Enhanced Contextual Sequential Recommender
Skills: Proficiency in Python, Pytorch, Github, Self-attention, Transformer
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Self-Attentive Sequential Recommendation (SASRec)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Amazon Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews
Movie Lense Data: https://grouplens.org/datasets/movielens/

Introduction

I’m Connor, a student at NYU studying CS and Math. This summer I’ve gotten the opportunity to work on LLMSeqRec under Dr. Bin Dong and Dr. Linsey Pang.

In today’s digital age, sequential recommender systems power everything from e-commerce suggestions to personalized content everywhere. However, traditional models fall short in capturing user intent, adapting to dynamic behavior, or tackling cold-start problems. That’s where LLMSeqRec comes in.

Problem Statement

Most sequential recommender systems rely heavily on historical user-item interactions and predefined embeddings. This approach limits their ability to understand nuanced user preferences, struggles to scale across domains, and performs poorly in scenarios like new users or sparse data. The absence of semantic and contextual modeling is a major gap in current solutions.

Overview of project

LLMSeqRec is a novel, LLM-enhanced sequential recommender framework that bridges this gap. By leveraging large language models (LLMs), it incorporates semantic embeddings and prompt-based contextual modeling to understand both user behavior and item metadata at a deeper level. The system explores two core approaches:

Embedding-based: LLMs generate embeddings from item attributes.
Prompt-based: LLMs receive full transaction history in natural language format and infer recommendations.

These techniques are tested using well-known datasets (e.g., Amazon, MovieLens), and evaluated with ranking metrics like NDCG@10 and Hit@10. The goal: deliver more accurate, context-rich, and explainable recommendations.

Next Steps

The project is currently progressing through stages including model training, embedding integration, and evaluation. Upcoming tasks include:

Fine-tuning enhanced models
Designing zero-/few-shot prompts
Running comparative experiments
Publishing findings and writing technical blogs

As part of the LLMSeqRec my proposal under the mentorship of Dr. Bin Dong and Dr. Linsey Pang.

LLMSeqRec: LLM Enhanced Contextual Sequential Recommender

Thu, 06 Feb 2025 10:15:56 -0700

Project Description

Project Objectives

Step 1: Data Preprocessing & Feature Creation: Develop a data processing pipeline to parse user’s sequential interaction behaviors into sequential data points for LLM-based embeddings and contextual sequential transformer modeling; Extract user behavior sequences, items’ metadata, and temporal patterns to create context-aware sequential representations for training, validation and testing; The data source can be from Amazon open public data or Movie Lense data set. The data points creation can follow SASRec (in the reference 1).
Step 2: Model Development: Design and implement LLM-enhanced sequential recommendation models, integrating pretrained language models to augment user-item interactions with semantic context; Develop an adaptive mechanism to incorporate external contextual signals, such as product descriptions, reviews into the sequential recommendation process; The baseline model can be SASRec pytorch implementation.
Step 3: Evaluation: : Benchmark LLMSeqRec against state-of-the-art sequential recommenders, evaluating on accuracy, NDCG and cold-start performance; Conduct ablation studies to analyze the impact of LLM-generated embeddings on recommendation quality; Optimize model inference speed and efficiency for real-time recommendation scenarios.

Project Deliverables

LLMSeqRec

Topics: LLM Enhanced Contextual Sequential Recommender
Skills: Proficiency in Python, Pytorch, Github, Self-attention, Transformer
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Linsey Pang, Bin Dong

References:

Self-Attentive Sequential Recommendation (SASRec)
BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Amazon Dataset: https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews
Movie Lense Data: https://grouplens.org/datasets/movielens/

ReIDMM: Re-identifying Multiple Objects across Multiple Streams

Thu, 06 Feb 2025 10:15:56 -0700

Project Description

Re-identifying multiple objects across multiple streams (ReIDMM) is essential in scientific research and various industries. It involves tracking and analyzing entities across different viewpoints or time frames. In astronomy, ReIDMM helps track celestial objects like asteroids and space debris using multiple observatories. In biology and ecology, it enables the identification of animals across different camera traps and aids in tracking microscopic organisms in laboratory studies. In physics and engineering, it is used for tracking particles in high-energy physics experiments, monitoring structural changes in materials, and identifying robots or drones in lab automation. Beyond scientific applications, ReIDMM plays a critical role in industries such as retail, where it tracks customer behavior across multiple stores and improves sales and prevents theft. In smart cities, it supports traffic monitoring by identifying vehicles across intersections for improved traffic flow management. In manufacturing, it enables supply chain tracking by locating packages across conveyor belts and warehouse cameras. In autonomous systems, ReIDMM enhances multi-camera sensor fusion and warehouse robotics by identifying pedestrians, obstacles, and objects across different camera views.

Project Objectives

Aligned with the vision of the 2025 Open Source Research Experience (OSRE), this project aims to develop an open-source algorithm for multiple-object re-identification across diverse open-source data streams. As highlighted earlier, this method is expected to have wide-ranging applications in both scientific research and industry. Utilizing an open-source dataset, our focus will be on re-identifying common objects such as vehicles and pedestrians. The primary challenge lies in designing a unified algorithm, ReIDMM, capable of performing robust multi-object re-identification across multiple streams. Users will be able to tag any object as a target in a video or image for tracking across streams. Below is an outline of the algorithms to be developed in this project:

Step 1: Target Object Identification: Randomly select a target object from an image or video using object detection models such as YOLOv7. These models detect objects by generating bounding boxes around them. Target objects could include vehicles, pedestrians, animals, or other recognizable entities. This step ensures an initial object of interest is chosen for re-identification.
Step 2: Feature Extraction and Embedding: Once the target object is identified, extract relevant features such as bounding box coordinates, timestamp, location metadata (if available), and visual characteristics. A multimodal embedding approach is used, where these features are transformed into a numerical representation (embedding vector) that captures the object’s unique identity. This allows for efficient comparison across different images or videos.
Step 3: Searching and Matching: To find the target object in other images or videos: (1) Extract embeddings of all objects detected in the other images/videos; (2) Compute similarity between the target object’s embedding and those of all detected objects using metrics like cosine similarity or Euclidean distance. (3) Rank objects by similarity, returning the most probable matches. The highest-ranked results are likely to be the same object observed from different angles, lighting conditions, or time frames.

Project Deliverables

This project will deliver three things, software, evaluation results and demo. The software which implements the above ReIDMM algorithm will be hosted on the github repo as open-access repositories. The evaluation results and demo will be published along the github repo.

ReIDMM

Topics: ReIDMM: Re-identifying Multiple Objects across Multiple Streams`
Skills: Proficient in Python, Experience with images processing, machine learning
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Linsey Pang

Reference:

Final Report: Stream processing support for FasTensor

Fri, 30 Aug 2024 00:00:00 +0000

Final Report: Stream processing support for FasTensor

Project Description

FasTensor is a scientific computing library specialized in performing computations over dense matrices that exhibit spatial locality, a characteristic often found in physical phenomena data. Our GSoC'24 project aimed to enhance FasTensor by enabling it to ingest and process live data streams from sensors and scientific equipment.

What is FasTensor?

Imagine you’re working on a physical simulation or solving partial differential equations (PDEs). You’ve discretized your PDE, but now you face a new challenge: you need to run your computations fast and parallelize them across massive compute clusters.

At this point, you find yourself describing a stencil [1] operation. But should you really spend your time tinkering with loop orders, data layouts, and countless other side-quests unrelated to your core problem?

This is where FasTensor comes in: Describe your computation as a stencil, and it takes care of ensuring optimal execution. FasTensor lets you focus on the science, not the implementation details.

Repository Links

FasTensor: https://github.com/BinDong314/FasTensor
My fork: https://github.com/my-name/FasTensor/tree/ftstream

PR(s)

Work done this summer

Develop Streaming simulator: FTStream

I was first entasked by Dr. Bin to develop a stream simulator for testing the streaming capability of FasTensor. For testing purposes, a stream is characterized by file size, count, and arrival interval. FTStream can generate streams of various sizes and intervals, up to the theoretical limits of disk and filesystem. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe!

Writing this tool was an adventure in throughput testing and exploring APIs. I wrote multiple drivers, each for a different whim and hijinks of systems in the HPC world. Here’s a brief journey through the APIs we explored:

HDF5 APIs: Pretty fast in flush-to-disk operation, but the API design strongly binds to file handles, which inhibits high throughput duplication.
HDF5 VFL and VOL: We dabbled in these dark arts, but there be dragons! Keeping a long-term view of maintenance, we dropped the idea.
POSIX O_DIRECT: This involved getting your buffers aligned right and handling remainders correctly. A step up, but not quite at the theoretical limits.
Linux AIO: Streaming is latency sensitive domain, to reach the theoretical limits, every syscall saved matters. Linux AIO allowed us syscall batching with io_submit(). It took a few testing sessions to get the correct combo of queue depth, buffer size, and alignment right.

We settled on O_DIRECT + Linux AIO. Feel free to modify ftstream/fastflush.h to suit your needs.

Stream Support

FasTensor has just one simple paradigm: you give it a data source, an output data store, and your transform, and it handles all the behind-the-scenes grunt work of computing over big datasets so you can focus on your research.

We aimed to achieve the same for streaming: Drop in the STREAM keyword, append a pattern identifying your stream, and use your usual transform.

Voila! Now your previous FasTensor code supports live data streams.

Technical tidbits:

Implements a manager-worker pattern to allow us flexibility in the future to implement different stream semantics such as windowing, CPU-memory based load balancing
Supports streams of indefinite size

Challenges

HPC has its fair share of challenges. Things you take for granted might not be available there, and it takes a while to adjust to paradigms of scale and parallelization.

For example, when developing FTStream, we found O_DIRECT is available on some parallel file systems like GPFS but not supported on Lustre/CFS. We developed a separate MPIO driver for FTStream that will be upstreamed once thoroughly tested on Lustre.

Future Work

Implement windowing and explore more advanced stream semantics.
Implement support for for defining workload policies
Optimize interleaving IO and Compute.

References

[1] Anshu Dubey. 2014. Stencils in Scientific Computations. In Proceedings of the Second Workshop on Optimizing Stencil Computations (WOSC ‘14). Association for Computing Machinery, New York, NY, USA, 57. https://doi.org/10.1145/2686745.2686756

Acknowledgement

I struck gold when it comes to mentors.

Dr. Bin Dong was really kind and supportive throughout the journey. From the very first steps of giving a tour around the codebase to giving me a lot of freedom to experiment, refactor, and refine.

Dr. John Wu was encouraging and nurturing of budding talent. We had great research presentations every Monday apart from usual mentor interactions, where different research groups presented their talks and students were invited to present their progress.

I’ve come across Quantum computing many times in the news, but I never thought I’d get a frontline preview from the researchers working at the bleeding edge at the Lawrence Berkeley National Laboratory (LBL).

This GSoC experience, made possible by Google and UC OSPO, has been invaluable for my growth as a developer and researcher.

For people interested in HPC, ML, Systems, or Reproducibility, I encourage you all to apply to UC OSPO. It’s been an incredible journey, and I’m grateful for every moment of it!

Streaming into the Future: Adding Real-Time Processing to FasTensor

Tue, 30 Jul 2024 00:00:00 +0000

Hey there, HPC enthusiasts and fellow coders! I’m excited to share my progress on this summer’s Google Summer of Code project under UC OSPO’s FasTensor. Here’s a glimpse into how we’re pushing the boundaries of real-time data processing.

The Big Picture: FasTensor and HPC Challenges

First, a quick refresher: FasTensor is our go-to tool for handling dense arrays in scientific computing. It tackles three major HPC challenges:

Optimizing computations
Distributing data efficiently
Balancing workloads across computing cores

FasTensor excels at these tasks, especially when dealing with data that has structural locality - a common feature in scientific computing. Here, the Stencil computations come in handy, capturing data locality for operations like solving partial differential equations in physical simulations.

The Mission: Bringing FasTensor into Real-Time

While FasTensor is great at processing existing data, the next frontier is handling live data streams from scientific instruments and sensors. That’s where my GSoC project comes in: adding stream processing capabilities to FasTensor.

Progress Highlights:

Building a Stream Simulator

We’ve created FTstream, a nifty tool that simulates data streams. It can generate streams of various sizes and intervals, pushing the limits of what your disk can handle. We’re talking speeds up to 2.5 GiB/s on a non-parallel NVMe! This tool is crucial because many scientific instruments, from particle accelerators to radio telescopes, generate massive amounts of data at incredible speeds and we need to able to simulate that. For context, that’s faster than a 10MP RGB camera shooting at 35 frames per second that generates data at ~1 GiB/s.

Optimizing I/O Strategies

We’ve been experimenting with various I/O approaches to optimize high-speed data stream handling.

Exploring Streaming Semantics

We’re investigating various ways to express and execute stream transformations, to ensure that FasTensor can handle a wide range of streaming computations.

Developing I/O Drivers

We’ve developed two new I/O drivers based on LinuxAIO and MPI IO to ingest incoming data smoothly and maintain stream consistency.

What’s Next?

Putting It All Together

We’re in the final stretch of integrating all these components into a seamless stream processing system.

Rigorous Testing

We’ll push our stream processing to its limits, simulating diverse data flows to ensure rock-solid performance in any scientific setting.

HPC Environment Validation

The ultimate test will be running our new streaming capabilities in real HPC environments, checking how they perform with different I/O setups and computing paradigms.

Wrapping Up

This summer has been a whirlwind of coding, testing, and learning. We’re making significant strides in bringing real-time processing capabilities to FasTensor, which could open up exciting new possibilities in scientific computing and data analysis. Stay tuned for more updates as we finalize this feature. If you’re interested in the nitty-gritty technical details or want to check out the code, feel free to reach out or check our project repository. Happy coding, and may your computations be ever faster!

Heterogeneous Graph Neural Networks for I/O Performance Bottleneck Diagnosis

Fri, 14 Jun 2024 00:00:00 +0000

Hello, I am Mahdi Banisharifdehkordi, a Ph.D. student in Computer Science at Iowa State University, specializing in Artificial Intelligence. This summer, I will be working on the project AIIO / Graph Neural Network under the mentorship of Bin Dong and Suren Byna.

High-Performance Computing (HPC) applications often face performance issues due to I/O bottlenecks. Manually identifying these bottlenecks is time-consuming and error-prone. My project aims to enhance the AIIO framework by integrating a Graph Neural Network (GNN) model to automatically diagnose I/O performance bottlenecks at the job level. This involves developing a comprehensive data pre-processing pipeline, constructing and validating a tailored GNN model, and rigorously testing the model’s accuracy using test cases from the AIIO dataset.

Through this project, I seek to provide a sophisticated, AI-driven approach to understanding and improving I/O performance in HPC systems, ultimately contributing to more efficient and reliable HPC applications.

Stream Processing support for FasTensor

Thu, 13 Jun 2024 00:00:00 +0000

Hi, I’m Aditya Narayan,👋

I’m a frequent visitor to the town square of theoretical CS, operations (Ops), and robust high-performance systems. Sometimes I indulge myself with insights on Computing and Biology, and other times I enjoy the accounts of minefield experiences in the systems world. Luckily, this summer, OSRE offered an opportunity that happened to be at the perfect intersection of my interests.

This summer, I will be working on a scientific computing library called FasTensor that offers a parallel computing structure called Stencil, widely popular in the scientific computing world to solve PDEs for Physical Simulations and Convolutions on Signals, among its many uses. I am excited to introduce my mentors, Dr. Bin Dong and Dr. John Wu of the Scientific Data Management Group at Lawrence Berkeley National Laboratory (LBNL). They bring invaluable expertise to the project.

They recognized the need for a tensor processing library that provided dedicated support for big datasets with inherent structural locality, often found in the scientific computing world, which was lacking in popular open-source MapReduce or Key-Value based frameworks.

More often than not, the operations performed on these datasets are composed of computations involving neighboring elements. This motivated the development of the FasTensor library.

I will be working on providing a Stream Processing interface that enables online data processing of large-scale datasets as they arrive from Data Producers. The project focuses on offering rich interfaces for managing and composing streams, supporting common scientific data formats like HDF5, and integrating fault tolerance and reliability mechanisms.

I am thrilled to work on the FasTensor project because I believe it has the potential to make a significant impact by enabling researchers to implement a rich set of computations on their big datasets in an easy and intuitive manner.

After all, FasTensor has just one simple paradigm: A -> Transform(F(x), B),

and it handles all the behind-the-scenes grunt work of handling big datasets so you can focus on your research.

Stay tuned for updates and feel free to collaborate!

AIIO / Graph Neural Network

Wed, 17 Jan 2024 10:15:56 -0700

[AIIO] (https://github.com/hpc-io/aiio) revolutionizes the way for users to automatically tune the I/O performance of applications on HPC systems. It currently works on linear regression models but has more opportunities to work on heterogeneous data, such as programming info. This requires extending the linear regression model to more complex models, such as heterogeneous graph neural networks. The proposed work will include developing the graph neural work-based model to predict the I/O performance and interpretation.

AIIO / Graph Neural Network

Topics: AIIO/Graph Neural Network`
Skills: Python, Github, Machine Learning
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, Suren Byna

The Specific tasks of the project include:

Develop the data pre-processing pipeline to convert I/O logs into formats which are required by the Graph Neural Network
Build and test the Graph Neural Network to model the I/O performance for HPC applications.
Test and evaluate the accuracy of the Graph Neural Network with test cases from AIIO

FasTensor / Stream Processing

Wed, 17 Jan 2024 10:15:56 -0700

[FasTensor] (https://github.com/BinDong314/FasTensor) is a generic tensor processing engine with scalability from single nodes to thousands of nodes on HPC. FasTensor supports applications from traditional SQL query to complex DFT solver in scientific applications. It has a 1000X performance advantage over MapReduce and Spark in supporting generic data processing functions on tensor structure. In this project, we propose to expand FasTensor with streaming functionality to support online data processing. Specifically, participants of this project will develop a stream endpoint for retrieving live data output from applications, such as DAS. The stream endpoint performs the function to maintain the pointer of data, which could be a n-dimensional subset of a tensor.

FasTensor / Stream Processing

Topics: FasTensor/Streaming Processing
Skills: C++, github
Difficulty: Difficult
Size: Large (350 hours)
Mentor: Bin Dong, John Wu

The Specific tasks of the project include:

Building a mock workflow based on our DAS application (https://github.com/BinDong314/DASSA) to test stream processing. The mock workflow comprises a data producer, which generates DAS data, and a data consumer, which processes the data.
Developing a Stream Endpoint (e.g., I/O driver) to iteratively read dynamically increasing data from a directory. The stream endpoint essentially includes open, read, and write functions, and a pointer to remember current file pointer.
Integrating the Stream Endpoint into the FasTensor library.
Evaluating the performance of the mock workflow with the new Stream Endpoint.
Documenting the execution mechanism.

FasTensor

Mon, 07 Nov 2022 10:15:56 -0700

FasTensor is a parallel execution engine for user-defined functions on multidimensional arrays. The user-defined functions follow the stencil metaphor used for scientific computing and is effective for expressing a wide range of computations for data analyses, including common aggregation operations from database management systems and advanced machine learning pipelines. FasTensor execution engine exploits the structural-locality in the multidimensional arrays to automate data management operations such as file I/O, data partitioning, communication, parallel execution, and so on.

Tensor execution engine on GPU

Topics: Data Management, Analytics
Skills: C++, github
Difficulty: Difficult
Size: Large (350 hours)
Mentor: John Wu, Bin Dong, Suren Byna

Tensor based computing is needed by scientific applications and now advanced AI model training. Most tensor libraries are hand customized and optimized on GPU, and most of they only serve one kind of application. For example, TensorFlow is only optimized for AI model training. Optimizing generic tensor computing libraries on GPU can benefit wide applications. Our FasTensor, as a generic tensor computing library, can only work efficiently on CPU now. How to run the FasTensor on GPU is still none-explored work. Research and development challenges will include but not limited to: 1) how to maintain structure-locality of tensor data on GPU; 2) how to reduce the performance loss when the structure-locality of tensor is broken on GPU.

Develop a mechanism to move user-define computing kernels onto GPU
Evaluate the performance of the execution engine
Document the execution mechanism
Develop performance testing suite

Continuous Integration

Topics: Data Management, Analytics
Skills: C++, github
Difficulty: Medium
Size: Large (300 hours)
Mentor: John Wu, Bin Dong, Suren Byna

Develop a test suite for the public API of FasTensor
Automate execution of the test suite
Document the continuous integration process