Luanzheng "Lenny" Guo | UCSC OSPO

Reproducible CXL Emulation

Fri, 30 Jan 2026 00:00:00 +0000

Compute Express Link (CXL) is an emerging memory interconnect standard that enables shared, coherent memory across CPUs, accelerators, and multiple hosts, unlocking new possibilities in hyperscale, HPC, and disaggregated systems. However, because access to real multi-host CXL hardware is limited, it is difficult for researchers and students to experiment with, evaluate, and reproduce results on advanced CXL topologies. OCEAN (Open-source CXL Emulation At Hyperscale) [https://github.com/cxl-emu/OCEAN] is a full-stack CXL emulation platform built on QEMU that enables detailed emulation of CXL 3.0 memory systems, including multi-host shared memory pools, coherent fabric topologies, and latency modeling. This project will create reproducible experiment pipelines, automated deployment workflows, and user-friendly tutorials so that others can reliably run and extend CXL emulation experiments without requiring specialized hardware.

Reproducible CXL Emulation for Multi-Host Memory Systems

Streamline multi-host CXL emulation without specialized hardware.

Topics: CXL emulation Memory Systems Reproducibility
Skills: C/C++, Virtualization (QEMU), Scripting, Performance Modeling
Difficulty: Medium
Size: Large (350 hours)
Mentors: Mujahid Al Rafi, Luanzheng "Lenny" Guo.

Tasks:

Create automated deployment scripts and configuration templates for OCEAN-based CXL emulation topologies (single-host and multi-host).
Develop a standardized experiment harness for running memory performance benchmarks (e.g., OSU micro-benchmarks, STREAM-style tests) in emulated CXL environments.
Build reproducible experiment pipelines that others can run to evaluate latency, bandwidth, and scaling properties of CXL memory systems.
Produce tutorials, documentation, and reproducibility artifacts to guide new users through setup, execution, and analysis.
Package and contribute all scripts, configurations, and documentation back to the OCEAN open-source repository.

Exploring Security and Isolation in CXL-Based Memory Systems

Investigate security and isolation properties of CXL-based memory systems using software emulation.

Topics: CXL Systems Security Memory Isolation Side Channel Emulation
Skills: C/C++, Virtualization (QEMU), Scripting, Computer Architecture, Security
Difficulty: Medium
Size: Large (350 hours)
Mentors: Mujahid Al Rafi, Luanzheng "Lenny" Guo.

Tasks:

Study the CXL memory model and fabric architecture to identify potential security and isolation risks in multi-host shared memory environments (e.g., contention, timing variation, and resource interference).
Set up multi-host or multi-VM CXL emulation environments using OCEAN that mimic realistic multi-tenant deployments.
Design and implement reproducible micro-benchmarks to measure timing, bandwidth contention, or observable interference through shared CXL memory pools.
Analyze how fabric configuration choices (e.g., topology, latency injection, memory partitioning, or allocation policies) affect isolation and leakage behavior.
Explore and prototype mitigation strategies—such as memory partitioning, throttling, or policy-driven allocation—and evaluate their effectiveness using the emulation platform.

AI for Science: Automating Domain Specific Tasks with Large Language Models

Sun, 23 Feb 2025 21:30:56 -0800

Recent advancements in Large Language Models (LLMs) have transformed various fields by demonstrating remarkable capabilities in processing and generating human-like text. This project aims to explore the development of an open-source framework that leverages LLMs to enhance discovery across specialized domains.

The proposed framework will enable LLMs to analyze and interpret complex datasets, automate routine tasks, and uncover novel insights. A key focus will be on equipping LLMs with domain-specific expertise, particularly in areas where specialized tools – such as ANDES – are not widely integrated with LLM-based solutions. By bridging this gap, the framework will empower researchers and professionals to harness LLMs as intelligent assistants capable of navigating and utilizing niche computational tools effectively.

AI for Science: Automating Domain Specific Tasks with Large Language Models

Topics: Large Language Models AI for Science
Skills: Python, Experience with LLMs, Prompt Engineering, Fine-Tuning, LLM Frameworks
Difficulty: Medium-Difficult
Size: Large (350 hours)
Mentor: [Daniel Wong]Daniel Wong, [Luanzheng “Lenny” Guo]Luanzheng "Lenny" Guo

Project Tasks and Milestones

Designing an extensible framework that facilitates the integration of LLMs with specialized software and datasets.
Developing methodologies for fine-tuning LLMs to act as domain experts.
Implementing strategies for improving tool interoperability, allowing LLMs to interact seamlessly with less commonly used but critical analytical platforms.

Enhancing Reproducibility in Distributed AI Training: Leveraging Checkpointing and Metadata Analytics

Fri, 21 Feb 2025 09:00:00 -0700

Reproducibility in distributed AI training is a crucial challenge due to several sources of uncertainty, including stragglers, data variability, and inherent randomness. Stragglers—slower processing nodes in a distributed system—can introduce timing discrepancies that affect the synchronization of model updates, leading to inconsistent states across training runs. Data variability, stemming from non-deterministic data shuffling and differing data partitions across nodes, can also lead to variations in model performance. Additionally, inherent randomness in algorithm initialization, such as random weight beginnings and stochastic processes like dropout, further compounds these challenges. Reproducibility in AI is pivotal for ensuring the credibility of AI-driven scientific findings, akin to how reproducibility underpins traditional scientific research.

To enhance AI reproducibility, leveraging metadata analytics and visualization along with saved checkpoints offers a promising solution. Checkpointing in AI training is a pivotal technique that involves saving snapshots of a model and its parameters at regular intervals throughout the training process. This practice is essential for maintaining progress in the face of potential interruptions, such as hardware failures, and enables the resumption of training without having to restart from scratch. In the context of distributed AI training, checkpointing also provides a framework for analyzing and ensuring reproducibility, offering a means to systematically capture and review the training trajectory of models. Analyzing checkpoints can specifically help identify issues like stragglers, which are slower computing nodes in a distributed system that can impede synchronized progress. For example, by examining the time stamps and resource utilization data associated with each checkpoint, anomalies in processing time can be detected, revealing nodes that consistently lag behind others. This analysis enables teams to diagnose performance bottlenecks and optimize resource allocation across the distributed system, ensuring smoother and more consistent training runs. By combining checkpointing with metadata analytics, it becomes possible to pinpoint the exact training iterations where delays occur, thereby facilitating targeted investigations and solutions to improve overall system reproducibility and efficiency.

Workplan

The proposed work will include: 1) Setting up a checkpointing system within the distributed AI training framework to periodically save model states and metadata; 2) Designing a metadata analysis schema for populating model and system statistics from the saved checkpoints; 3) Conducting exploratory data analysis to identify patterns, anomalies, and sources of variability in the training process; 4) Creating visualization tools to represent metadata insights with collected statistics and patterns; 5) Using insights from metadata analytics and visualization to optimize resource distribution across the distributed system and mitigate straggler effects; and 6) Disseminating results and methodologies through academic papers, workshops, and open-source contributions.

Topics: Reproducibility AI distributed AI checkpoint metadata analysis
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Thu, 20 Feb 2025 09:00:00 -0700

Retrieval-Augmented Generation (RAG) frameworks, which merge the capabilities of retrieval systems and generative models, significantly enhance the relevance and accuracy of responses produced by large language models (LLMs). These frameworks retrieve relevant documents from a large corpus and use these documents to inform the generative process, thereby improving the contextuality and precision of the generated content. Ensuring reproducibility in data queries using similarity search within these RAG frameworks is critical for maintaining the reliability and consistency of scientific workflows. Reproducibility ensures that the same input query consistently yields the same output, which is vital for scientific tasks that rely on precise and repeatable results. Inconsistencies can arise from various sources, affecting the trustworthiness of scientific outcomes. Differences in retrieval algorithms can lead to variable sets of documents being retrieved for the same query. Variations in data indexing methods can cause inconsistencies in how documents are ranked and accessed. The stochastic nature of LLM operations introduces an element of randomness in the generative process. Updates in datasets can also alter the baseline against which queries are processed and interpreted, leading to different results over time.

This proposal aims to address these reproducibility challenges in similarity searches within RAG frameworks. This work involves analyzing the root causes of non-determinism, benchmarking and validating the consistency of query results, implementing enhancements to minimize variability, and developing tools and best practices to ensure reproducibility. Reproducibility in data queries can be influenced by several factors, including updates in datasets, differences in retrieval algorithms, varying data indexing methods, and the stochastic nature of LLM operations. Each of these factors can cause variability in the documents retrieved and in the generated responses. Ensuring consistency in query results across different runs is crucial for maintaining the integrity of LLM-driven scientific research, allowing researchers to confidently build upon prior work and achieve reliable, trustworthy outcomes.

Workplan

The proposed work will include: (1) Identifying sources of non-determinism and variability, such as algorithmic differences and indexing methods, in RAG; (2) Utilizing standardized scientific datasets to benchmark the reproducibility of similarity search results across different RAG frameworks; (3) Establishing protocols for handling dataset updates to ensure that such changes do not impact the reproducibility of similarity search results; and (4) Implementing mechanisms to track and document updates to datasets, ensuring that changes are reflected consistently across all instances of the RAG framework. By addressing these areas, the proposed work aims to mitigate challenges related to reproducibility in similarity search queries within RAG frameworks, ultimately enhancing the reliability and trustworthiness of scientific research outcomes.

Topics: Reproducibility LLM RAG Scientific Workflows
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Exploration of I/O Reproducibility with HDF5

Wed, 19 Feb 2025 09:00:00 -0700

Parallel I/O is a critical component in high-performance computing (HPC), allowing multiple processes to read and write data concurrently from a shared storage system. HDF5—a widely adopted data model and library for managing complex scientific data—supports parallel I/O but introduces challenges in I/O reproducibility, where repeated executions do not always produce identical results. This lack of reproducibility can stem from non-deterministic execution orders, variations in collective buffering strategies, and race conditions in metadata and dataset chunking operations within HDF5’s parallel I/O hierarchy. Moreover, many HDF5 operations that leverage MPI I/O require collective communication; that is, all processes within a communicator must participate in operations such as metadata creation, chunk allocation, and data aggregation. These collective calls ensure that the file structure and data layout remain consistent across processes, but they also introduce additional synchronization complexity that can impact reproducibility if not properly managed. In HPC scientific workflows, consistent I/O reproducibility is essential for accurate debugging, validation, and benchmarking, ensuring that scientific results are both verifiable and trustworthy. Tools such as h5bench—a suite of I/O kernels designed to exercise HDF5 I/O on parallel file systems—play an important role in identifying these reproducibility challenges, tuning performance, and ultimately supporting the overall robustness of large-scale scientific applications.

Workplan

The proposed work will include (1) analyzing and characterizing parallel I/O operations in HDF5 with h5bench miniapps, (2) exploring and validating potential reproducibility challenges within the parallel I/O hierarchy (e.g., MPI I/O), and (3) implementing solutions to address parallel I/O reproducibility.

Topics: Parallel I/O MPI-I/O Reproducibility HPC HDF5
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo and [Wei Zhang]Wei Zhang

Smart Batching for Large Language Models

Sun, 09 Feb 2025 10:15:56 -0700

Sequence tokenization is a crucial step during Large Language Model training, fine-tuning, and inference. User prompts and training data are tokenized and zero-padded before being fed to the model in batches. This process allows models to interpret human language by breaking down complex sentences into simple token units that are numerically represented in a token set. However, the process of sequence padding for maintaining batch dimensions can introduce unnecessary overhead if batching is not properly done.

In this project, we introduce Smart Batching, where we dynamically batch sequences in a fine-tuning dataset by their respective lengths. With this method, we aim to minimize the amount of zero padding required during sequence batching, which can result in improved and efficient fine-tuning and inference speeds. We also analyze this method with other commonly used batching practices (Longest Sequence, Random Shuffling) on valuable metrics such as runtime and model accuracy.

Project Title

Topics: Large Language Models Fine-Tuning AI Transformers
Skills: Python, Pytorch, Large Language Models
Difficulty: Moderate
Size: Large (350 hours)
Mentor: [Daniel Wong]Daniel Wong, [Luanzheng “Lenny” Guo]Luanzheng "Lenny" Guo

Project Tasks and Milestones

Implement an open source smart batching framework based on HuggingFace to allow for dynamically grouping sequences of similar token lengths into batches
Analyze runtime, padding, and model accuracy with smart batching and other commonly used batching practices
Apply smart batching with distributed fine-tuning and observe large language model outputs