wbq-321 | UCSC OSPO

Final Report: A Systematic Investigation into the Reproducibility of RAG Systems

Fri, 05 Sep 2025 00:00:00 +0000

I’m Baiqiang, and this is the final report for the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project, mentored by Luanzheng “Lenny” Guo and Dongfang Zhao. This project successfully developed a novel framework to quantitatively measure reproducibility in AI systems, yielding several surprising and impactful results.

The Challenge: The Need for Systematic Measurement

Retrieval-Augmented Generation (RAG) is a cornerstone of AI for science, but its reliability is often compromised by non-determinism. While this issue was a known concern, a fundamental challenge was the lack of standardized tools and methodologies to systematically measure and quantify the sources of this inconsistency. Without a rigorous way to analyze the problem, it was difficult to move beyond ad-hoc tests and establish the true root causes, hindering the development of truly trustworthy AI systems for science.

Our Contribution: The ReproRAG Framework

To address this gap, the central contribution of this project is ReproRAG, a comprehensive, open-source benchmarking framework. ReproRAG is designed to systematically investigate sources of uncertainty across the entire RAG pipeline by:

Isolating Variables: It allows for controlled experiments on embedding models, numerical precision, retrieval algorithms, hardware configurations (CPU/GPU), and distributed execution environments.
Quantifying Uncertainty: It employs a suite of metrics—including Exact Match Rate, Jaccard Similarity, and Kendall’s Tau—to precisely measure the impact of each variable on the final retrieved results.

Key Findings: A New Hierarchy of Uncertainty

Our large-scale empirical study using ReproRAG challenged common assumptions and established a clear hierarchy of what actually impacts reproducibility.

Core Algorithms Are Not the Problem: Our most surprising finding is that modern retrieval libraries like FAISS are perfectly reproducible out-of-the-box. Across all tested index types (including approximate ones like HNSW and IVF) and execution environments (single-node CPU/GPU and multi-node distributed systems), we achieved perfect run-to-run reproducibility (1.000 scores on all metrics) when environmental factors like random seeds were controlled. This falsifies the common hypothesis that approximate nearest neighbor algorithms are a primary source of randomness.
Embedding Model Choice is a Dominant Source of Variation: We found that the choice of the embedding model is a dominant factor driving result variation. When comparing outputs from different state-of-the-art models (BGE, E5, Qwen) for the same query, the agreement was very low (e.g., Overlap Coefficient of ~0.43-0.54). This means a scientific conclusion drawn with one model may not be reproducible with another, as they are fundamentally “seeing” different evidence.
Environmental Factors Introduce Measurable “Drift”:
- Numerical Precision: Changing floating-point precision (e.g., FP32 vs. FP16) was a guaranteed source of variation, but it caused a small and quantifiable “embedding drift” rather than chaotic changes.
- Data Insertion: Incrementally adding new data to an index caused a predictable “displacement” of old results, not a re-shuffling. The relative ranking of the remaining original documents was perfectly stable (Kendall’s Tau of 1.000).
Common Determinism Flags Can Be Ineffective: Our tests showed that popular software-level controls, like cudnn.deterministic flags in PyTorch, had no observable effect on the output of modern transformer-based embedding models. This underscores the necessity of empirical validation over assuming that framework settings work as advertised.

Conclusion

This project successfully shifted the focus of the RAG reproducibility problem. The key challenge is not to fix supposedly “random” algorithms, but to rigorously control the entire experimental environment. We delivered ReproRAG, a framework that empowers researchers to do just that. Our findings provide actionable insights for the community: efforts to improve reproducibility should focus less on the retrieval algorithms themselves and more on disciplined management of embedding models, data versioning, and numerical precision.

Mid-Term Report: Uncovering the True Sources of Non-Reproducibility in AI for Science

Fri, 01 Aug 2025 00:00:00 +0000

Hello, I’m Baiqiang. I’m excited to share a mid-term update from the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project. This journey, mentored by Luanzheng “Lenny” Guo and Dongfang Zhao, has taken a fascinating and unexpected turn, leading to a much deeper understanding of what it takes to build truly reliable AI for science.

The Search for an Invisible Bug

As a quick recap, our project tackles the critical problem of non-determinism in Retrieval-Augmented Generation (RAG) systems. For science to be trustworthy, it must be repeatable. If an AI system gives different answers to the same question, it fails this fundamental test. Our initial goal, outlined in my proposal, was to find and fix the sources of this inconsistency, which we believed lay within the retrieval algorithms themselves.

To do this, we built a comprehensive testing framework capable of running thousands of controlled experiments. We designed it to meticulously measure the consistency of retrieval results while varying everything from the indexing algorithm to the underlying hardware.

A Surprising Discovery: The Usual Suspect is Innocent

The common wisdom in the community is that high-performance, approximate search libraries like FAISS are a major source of randomness. We put this to the test, running repeated queries against various index types, including complex ones like HNSW and IndexIVF.

Our results were clear and surprising: FAISS is remarkably reproducible out of the box. When run on a consistent hardware and software stack, it returns the exact same results, every single time. The library appears to have robust internal seed management that ensures deterministic behavior.

This finding was a pivotal moment. The non-reproducibility that researchers observe in practice is real, but it doesn’t come from where we expected. The problem isn’t the algorithm itself, but the environment it runs in. Our investigation immediately shifted to find the real culprits.

Pinpointing the True Sources of Non-Determinism

Our framework quickly helped us identify the true sources of inconsistency:

Hardware-Induced Variation (CPU vs. GPU): This is the most significant factor. Running the exact same retrieval code can produce different document rankings and even different document sets when executed on a CPU versus a GPU. This is likely due to subtle differences in floating-point arithmetic and library optimizations in the hardware stack.
The Impact of Numerical Precision: We also confirmed that changing the floating-point precision of the data (e.g., from FP32 to FP16) can introduce small numerical variations that are just large enough to reorder the results, potentially changing the evidence the LLM receives.

Our Mission Refined: Building Tools for Environmental Control

This discovery has sharpened our project’s mission. The challenge is not to “fix” a supposedly random algorithm, but to develop the tools and best practices to control for the entire experimental environment. Our focus for the second half of the project is to:

Develop a Hardware-Aware Configuration Tracker: We are building a tool that goes beyond logging software versions. It will capture the critical details of the hardware environment—CPU/GPU model, CUDA version, etc.—and link them directly to an experiment’s results.
Create a Cross-Environment Validation Suite: Our open-source benchmarking suite will empower researchers to test their own pipelines. Crucially, it will help them identify and diagnose inconsistencies when moving workflows between different machines, such as from a local laptop to a cloud-based GPU.
Establish New Best Practices: We will distill our findings into clear, actionable guidance. The key recommendation is no longer just about choosing the right algorithm, but ensuring a consistent and well-documented hardware and software environment to guarantee reproducible outcomes.

By following the evidence, we’ve uncovered the root cause of a critical problem in AI-driven research. We are now developing the solutions needed to manage it, paving the way for a future where scientific discoveries powered by AI are built on a foundation of verifiable trust.

Enhancing Reproducibility in RAG Frameworks for Scientific Workflows

Wed, 25 Jun 2025 00:00:00 +0000

Hello, I’m Baiqiang. As part of the Enhancing Reproducibility in RAG Frameworks for Scientific Workflows project, I am excited to introduce my work on a crucial challenge in modern computational science. My proposal under the mentorship of Luanzheng “Lenny” Guo at Pacific Northwest National Laboratory and Dongfang Zhao at the University of Washington aims to enhance the reproducibility of AI-driven scientific workflows.

The Problem: A Crisis of Confidence in AI for Science

Large Language Models (LLMs) are transforming scientific research, from accelerating literature reviews to generating novel hypotheses. However, their power is matched by their pitfalls: a tendency to “hallucinate” facts and a lack of transparency. Retrieval-Augmented Generation (RAG) was developed as a powerful solution, grounding LLM outputs in factual evidence retrieved from a specific knowledge base (like a database of scientific papers).

But a hidden problem lurks within RAG: non-determinism. The very first step of a RAG system—the similarity search that finds relevant documents—can produce different results even when asked the same question. Variations in indexing algorithms, data updates, or even the underlying software can change which documents are retrieved. For science, this is a critical flaw. If an experiment cannot be repeated with the same results, its conclusions cannot be trusted. This project tackles that challenge head-on.

Our Mission: Forging a Path to Reproducible RAG

This project proposes a comprehensive solution to systematically identify, measure, and mitigate non-determinism in RAG frameworks. Our goal is to empower researchers to build and use AI tools with confidence.

Our approach is built on four key pillars:

Systematic Analysis: We will conduct a deep dive into popular RAG components (like FAISS, ScaNN, and HNSW) to pinpoint the exact sources of randomness and variability.
Rigorous Benchmarking: We will develop a public, open-source benchmarking suite using standardized scientific datasets (from PubMed, arXiv, etc.). This will allow anyone to quantitatively measure the reproducibility of their own RAG pipeline using clear metrics like retrieval overlap and rank correlation.
Targeted Enhancements: Based on our findings, we will implement practical solutions, including:
- Promoting deterministic algorithms and configurations.
- Building robust data versioning and provenance tracking tools (inspired by DVC and Git LFS).
- Creating tools for precise configuration management to capture the entire experimental setup.
Practical Guidance and Open Source Tools: We will distill our insights into comprehensive documentation, reusable code examples, and best practices. All tools and findings will be contributed back to the open-source community.