PNNL | UCSC OSPO

Reproducible CXL Emulation

Fri, 30 Jan 2026 00:00:00 +0000

Compute Express Link (CXL) is an emerging memory interconnect standard that enables shared, coherent memory across CPUs, accelerators, and multiple hosts, unlocking new possibilities in hyperscale, HPC, and disaggregated systems. However, because access to real multi-host CXL hardware is limited, it is difficult for researchers and students to experiment with, evaluate, and reproduce results on advanced CXL topologies. OCEAN (Open-source CXL Emulation At Hyperscale) [https://github.com/cxl-emu/OCEAN] is a full-stack CXL emulation platform built on QEMU that enables detailed emulation of CXL 3.0 memory systems, including multi-host shared memory pools, coherent fabric topologies, and latency modeling. This project will create reproducible experiment pipelines, automated deployment workflows, and user-friendly tutorials so that others can reliably run and extend CXL emulation experiments without requiring specialized hardware.

Reproducible CXL Emulation for Multi-Host Memory Systems

Streamline multi-host CXL emulation without specialized hardware.

Topics: CXL emulation Memory Systems Reproducibility
Skills: C/C++, Virtualization (QEMU), Scripting, Performance Modeling
Difficulty: Medium
Size: Large (350 hours)
Mentors: Mujahid Al Rafi, Luanzheng "Lenny" Guo.

Tasks:

Create automated deployment scripts and configuration templates for OCEAN-based CXL emulation topologies (single-host and multi-host).
Develop a standardized experiment harness for running memory performance benchmarks (e.g., OSU micro-benchmarks, STREAM-style tests) in emulated CXL environments.
Build reproducible experiment pipelines that others can run to evaluate latency, bandwidth, and scaling properties of CXL memory systems.
Produce tutorials, documentation, and reproducibility artifacts to guide new users through setup, execution, and analysis.
Package and contribute all scripts, configurations, and documentation back to the OCEAN open-source repository.

Exploring Security and Isolation in CXL-Based Memory Systems

Investigate security and isolation properties of CXL-based memory systems using software emulation.

Topics: CXL Systems Security Memory Isolation Side Channel Emulation
Skills: C/C++, Virtualization (QEMU), Scripting, Computer Architecture, Security
Difficulty: Medium
Size: Large (350 hours)
Mentors: Mujahid Al Rafi, Luanzheng "Lenny" Guo.

Tasks:

Study the CXL memory model and fabric architecture to identify potential security and isolation risks in multi-host shared memory environments (e.g., contention, timing variation, and resource interference).
Set up multi-host or multi-VM CXL emulation environments using OCEAN that mimic realistic multi-tenant deployments.
Design and implement reproducible micro-benchmarks to measure timing, bandwidth contention, or observable interference through shared CXL memory pools.
Analyze how fabric configuration choices (e.g., topology, latency injection, memory partitioning, or allocation policies) affect isolation and leakage behavior.
Explore and prototype mitigation strategies—such as memory partitioning, throttling, or policy-driven allocation—and evaluate their effectiveness using the emulation platform.

Enhancing Reproducibility in Distributed AI Training: Leveraging Checkpointing and Metadata Analytics

Fri, 21 Feb 2025 09:00:00 -0700

Reproducibility in distributed AI training is a crucial challenge due to several sources of uncertainty, including stragglers, data variability, and inherent randomness. Stragglers—slower processing nodes in a distributed system—can introduce timing discrepancies that affect the synchronization of model updates, leading to inconsistent states across training runs. Data variability, stemming from non-deterministic data shuffling and differing data partitions across nodes, can also lead to variations in model performance. Additionally, inherent randomness in algorithm initialization, such as random weight beginnings and stochastic processes like dropout, further compounds these challenges. Reproducibility in AI is pivotal for ensuring the credibility of AI-driven scientific findings, akin to how reproducibility underpins traditional scientific research.

To enhance AI reproducibility, leveraging metadata analytics and visualization along with saved checkpoints offers a promising solution. Checkpointing in AI training is a pivotal technique that involves saving snapshots of a model and its parameters at regular intervals throughout the training process. This practice is essential for maintaining progress in the face of potential interruptions, such as hardware failures, and enables the resumption of training without having to restart from scratch. In the context of distributed AI training, checkpointing also provides a framework for analyzing and ensuring reproducibility, offering a means to systematically capture and review the training trajectory of models. Analyzing checkpoints can specifically help identify issues like stragglers, which are slower computing nodes in a distributed system that can impede synchronized progress. For example, by examining the time stamps and resource utilization data associated with each checkpoint, anomalies in processing time can be detected, revealing nodes that consistently lag behind others. This analysis enables teams to diagnose performance bottlenecks and optimize resource allocation across the distributed system, ensuring smoother and more consistent training runs. By combining checkpointing with metadata analytics, it becomes possible to pinpoint the exact training iterations where delays occur, thereby facilitating targeted investigations and solutions to improve overall system reproducibility and efficiency.

Workplan

The proposed work will include: 1) Setting up a checkpointing system within the distributed AI training framework to periodically save model states and metadata; 2) Designing a metadata analysis schema for populating model and system statistics from the saved checkpoints; 3) Conducting exploratory data analysis to identify patterns, anomalies, and sources of variability in the training process; 4) Creating visualization tools to represent metadata insights with collected statistics and patterns; 5) Using insights from metadata analytics and visualization to optimize resource distribution across the distributed system and mitigate straggler effects; and 6) Disseminating results and methodologies through academic papers, workshops, and open-source contributions.

Topics: Reproducibility AI distributed AI checkpoint metadata analysis
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Thu, 20 Feb 2025 09:00:00 -0700

Retrieval-Augmented Generation (RAG) frameworks, which merge the capabilities of retrieval systems and generative models, significantly enhance the relevance and accuracy of responses produced by large language models (LLMs). These frameworks retrieve relevant documents from a large corpus and use these documents to inform the generative process, thereby improving the contextuality and precision of the generated content. Ensuring reproducibility in data queries using similarity search within these RAG frameworks is critical for maintaining the reliability and consistency of scientific workflows. Reproducibility ensures that the same input query consistently yields the same output, which is vital for scientific tasks that rely on precise and repeatable results. Inconsistencies can arise from various sources, affecting the trustworthiness of scientific outcomes. Differences in retrieval algorithms can lead to variable sets of documents being retrieved for the same query. Variations in data indexing methods can cause inconsistencies in how documents are ranked and accessed. The stochastic nature of LLM operations introduces an element of randomness in the generative process. Updates in datasets can also alter the baseline against which queries are processed and interpreted, leading to different results over time.

This proposal aims to address these reproducibility challenges in similarity searches within RAG frameworks. This work involves analyzing the root causes of non-determinism, benchmarking and validating the consistency of query results, implementing enhancements to minimize variability, and developing tools and best practices to ensure reproducibility. Reproducibility in data queries can be influenced by several factors, including updates in datasets, differences in retrieval algorithms, varying data indexing methods, and the stochastic nature of LLM operations. Each of these factors can cause variability in the documents retrieved and in the generated responses. Ensuring consistency in query results across different runs is crucial for maintaining the integrity of LLM-driven scientific research, allowing researchers to confidently build upon prior work and achieve reliable, trustworthy outcomes.

Workplan

The proposed work will include: (1) Identifying sources of non-determinism and variability, such as algorithmic differences and indexing methods, in RAG; (2) Utilizing standardized scientific datasets to benchmark the reproducibility of similarity search results across different RAG frameworks; (3) Establishing protocols for handling dataset updates to ensure that such changes do not impact the reproducibility of similarity search results; and (4) Implementing mechanisms to track and document updates to datasets, ensuring that changes are reflected consistently across all instances of the RAG framework. By addressing these areas, the proposed work aims to mitigate challenges related to reproducibility in similarity search queries within RAG frameworks, ultimately enhancing the reliability and trustworthiness of scientific research outcomes.

Topics: Reproducibility LLM RAG Scientific Workflows
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo

Exploration of I/O Reproducibility with HDF5

Wed, 19 Feb 2025 09:00:00 -0700

Parallel I/O is a critical component in high-performance computing (HPC), allowing multiple processes to read and write data concurrently from a shared storage system. HDF5—a widely adopted data model and library for managing complex scientific data—supports parallel I/O but introduces challenges in I/O reproducibility, where repeated executions do not always produce identical results. This lack of reproducibility can stem from non-deterministic execution orders, variations in collective buffering strategies, and race conditions in metadata and dataset chunking operations within HDF5’s parallel I/O hierarchy. Moreover, many HDF5 operations that leverage MPI I/O require collective communication; that is, all processes within a communicator must participate in operations such as metadata creation, chunk allocation, and data aggregation. These collective calls ensure that the file structure and data layout remain consistent across processes, but they also introduce additional synchronization complexity that can impact reproducibility if not properly managed. In HPC scientific workflows, consistent I/O reproducibility is essential for accurate debugging, validation, and benchmarking, ensuring that scientific results are both verifiable and trustworthy. Tools such as h5bench—a suite of I/O kernels designed to exercise HDF5 I/O on parallel file systems—play an important role in identifying these reproducibility challenges, tuning performance, and ultimately supporting the overall robustness of large-scale scientific applications.

Workplan

The proposed work will include (1) analyzing and characterizing parallel I/O operations in HDF5 with h5bench miniapps, (2) exploring and validating potential reproducibility challenges within the parallel I/O hierarchy (e.g., MPI I/O), and (3) implementing solutions to address parallel I/O reproducibility.

Topics: Parallel I/O MPI-I/O Reproducibility HPC HDF5
Skills: C/C++, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Luanzheng "Lenny" Guo and [Wei Zhang]Wei Zhang