<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>final-report | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/final-report/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/final-report/index.xml" rel="self" type="application/rss+xml"/><description>final-report</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Fri, 05 Sep 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>final-report</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/final-report/</link></image><item><title>Final Report: A Systematic Investigation into the Reproducibility of RAG Systems</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/pnnl/llm_rag_reproducibility/20250905-wbq321/</link><pubDate>Fri, 05 Sep 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/pnnl/llm_rag_reproducibility/20250905-wbq321/</guid><description>&lt;p>I&amp;rsquo;m Baiqiang, and this is the final report for the &lt;a href="https://ucsc-ospo.github.io/project/osre25/pnnl/llm_rag_reproducibility/" target="_blank" rel="noopener">Enhancing Reproducibility in RAG Frameworks for Scientific Workflows&lt;/a> project, mentored by Luanzheng &amp;ldquo;Lenny&amp;rdquo; Guo and Dongfang Zhao. This project successfully developed a novel framework to quantitatively measure reproducibility in AI systems, yielding several surprising and impactful results.&lt;/p>
&lt;h3 id="the-challenge-the-need-for-systematic-measurement">The Challenge: The Need for Systematic Measurement&lt;/h3>
&lt;p>Retrieval-Augmented Generation (RAG) is a cornerstone of AI for science, but its reliability is often compromised by non-determinism. While this issue was a known concern, a fundamental challenge was the lack of standardized tools and methodologies to systematically measure and quantify the sources of this inconsistency. Without a rigorous way to analyze the problem, it was difficult to move beyond ad-hoc tests and establish the true root causes, hindering the development of truly trustworthy AI systems for science.&lt;/p>
&lt;h3 id="our-contribution-the-reprorag-framework">Our Contribution: The ReproRAG Framework&lt;/h3>
&lt;p>To address this gap, the central contribution of this project is &lt;strong>ReproRAG&lt;/strong>, a comprehensive, open-source benchmarking framework. ReproRAG is designed to systematically investigate sources of uncertainty across the entire RAG pipeline by:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Isolating Variables:&lt;/strong> It allows for controlled experiments on embedding models, numerical precision, retrieval algorithms, hardware configurations (CPU/GPU), and distributed execution environments.&lt;/li>
&lt;li>&lt;strong>Quantifying Uncertainty:&lt;/strong> It employs a suite of metrics—including Exact Match Rate, Jaccard Similarity, and Kendall&amp;rsquo;s Tau—to precisely measure the impact of each variable on the final retrieved results.&lt;/li>
&lt;/ul>
&lt;h3 id="key-findings-a-new-hierarchy-of-uncertainty">Key Findings: A New Hierarchy of Uncertainty&lt;/h3>
&lt;p>Our large-scale empirical study using ReproRAG challenged common assumptions and established a clear hierarchy of what actually impacts reproducibility.&lt;/p>
&lt;ol>
&lt;li>
&lt;p>&lt;strong>Core Algorithms Are Not the Problem:&lt;/strong> Our most surprising finding is that modern retrieval libraries like FAISS are perfectly reproducible out-of-the-box. Across all tested index types (including approximate ones like HNSW and IVF) and execution environments (single-node CPU/GPU and multi-node distributed systems), we achieved perfect run-to-run reproducibility (1.000 scores on all metrics) when environmental factors like random seeds were controlled. This falsifies the common hypothesis that approximate nearest neighbor algorithms are a primary source of randomness.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Embedding Model Choice is a Dominant Source of Variation:&lt;/strong> We found that the choice of the embedding model is a dominant factor driving result variation. When comparing outputs from different state-of-the-art models (BGE, E5, Qwen) for the same query, the agreement was very low (e.g., Overlap Coefficient of ~0.43-0.54). This means a scientific conclusion drawn with one model may not be reproducible with another, as they are fundamentally &amp;ldquo;seeing&amp;rdquo; different evidence.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Environmental Factors Introduce Measurable &amp;ldquo;Drift&amp;rdquo;:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Numerical Precision:&lt;/strong> Changing floating-point precision (e.g., FP32 vs. FP16) was a guaranteed source of variation, but it caused a small and quantifiable &amp;ldquo;embedding drift&amp;rdquo; rather than chaotic changes.&lt;/li>
&lt;li>&lt;strong>Data Insertion:&lt;/strong> Incrementally adding new data to an index caused a predictable &amp;ldquo;displacement&amp;rdquo; of old results, not a re-shuffling. The relative ranking of the remaining original documents was perfectly stable (Kendall&amp;rsquo;s Tau of 1.000).&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Common Determinism Flags Can Be Ineffective:&lt;/strong> Our tests showed that popular software-level controls, like &lt;code>cudnn.deterministic&lt;/code> flags in PyTorch, had no observable effect on the output of modern transformer-based embedding models. This underscores the necessity of empirical validation over assuming that framework settings work as advertised.&lt;/p>
&lt;/li>
&lt;/ol>
&lt;h3 id="conclusion">Conclusion&lt;/h3>
&lt;p>This project successfully shifted the focus of the RAG reproducibility problem. The key challenge is not to fix supposedly &amp;ldquo;random&amp;rdquo; algorithms, but to rigorously control the entire experimental environment. We delivered &lt;strong>ReproRAG&lt;/strong>, a framework that empowers researchers to do just that. Our findings provide actionable insights for the community: efforts to improve reproducibility should focus less on the retrieval algorithms themselves and more on disciplined management of embedding models, data versioning, and numerical precision.&lt;/p></description></item></channel></rss>