<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>benchmarking | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/benchmarking/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/benchmarking/index.xml" rel="self" type="application/rss+xml"/><description>benchmarking</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Mon, 14 Jul 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>benchmarking</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/benchmarking/</link></image><item><title>Midway Through GSoC</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/embeddings/14072025-devadigapratham/</link><pubDate>Mon, 14 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/embeddings/14072025-devadigapratham/</guid><description>&lt;h1 id="midway-through-gsoc">Midway Through GSoC&lt;/h1>
&lt;p>Hello everyone! I’m Pratham Devadiga, and I’m thrilled to share a midterm progress update on my &lt;a href="https://summerofcode.withgoogle.com/programs/2025/projects/GcstSGAO" target="_blank" rel="noopener">GSoC 2025 project&lt;/a> with the Open Source Research Experience (OSRE). My project is focused on building the &lt;strong>first open-source billion-scale vector embeddings dataset&lt;/strong> from &lt;strong>real-world open source code&lt;/strong> to support benchmarking of Approximate Nearest Neighbor (ANN) algorithms and facilitate research in Retrieval-Augmented Generation (RAG).&lt;/p>
&lt;h2 id="project-overview">Project Overview&lt;/h2>
&lt;p>The goal of this project is to address a critical gap in the ecosystem: existing ANN benchmarks are either synthetic or limited in scale. With the explosion of code-focused LLMs and embedding models, there&amp;rsquo;s a pressing need for:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High-volume, high-dimensional vector datasets&lt;/strong> built from real-world data (open-source codebases).&lt;/li>
&lt;li>&lt;strong>Open, reproducible benchmarks&lt;/strong> that reflect realistic RAG workloads.&lt;/li>
&lt;li>A dataset that can be used to evaluate &lt;strong>ANN libraries&lt;/strong> like FAISS, HNSW, and Annoy on massive and practical retrieval tasks.&lt;/li>
&lt;/ul>
&lt;p>Our approach is to use high-quality open-source code repositories to extract meaningful code chunks, encode them into vector embeddings using open models, and make these datasets publicly available with metadata for downstream benchmarking and analysis.&lt;/p>
&lt;h2 id="progress-so-far">Progress So Far&lt;/h2>
&lt;p>We’ve made substantial foundational progress in the first half of the coding period. Key highlights:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Tested multiple embedding models&lt;/strong> such as &lt;code>codeBERT&lt;/code>, &lt;code>MiniLM-L6-v2&lt;/code>, and &lt;code>all-mpnet-base-v2&lt;/code>, evaluating trade-offs in speed, dimensionality, and GPU memory.&lt;/li>
&lt;li>&lt;strong>Selected &lt;code>codebert-base&lt;/code>&lt;/strong> (768d) as the current model for phase one due to its stable performance and manageable resource footprint.&lt;/li>
&lt;li>Implemented and validated a complete &lt;strong>script pipeline&lt;/strong> to:
&lt;ul>
&lt;li>Traverse large open-source repositories.&lt;/li>
&lt;li>Extract and chunk code intelligently (functions, classes, modules).&lt;/li>
&lt;li>Encode code into embeddings and attach metadata (repo, file path, license).&lt;/li>
&lt;li>Store results efficiently in parquet and NumPy formats.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Tested all components&lt;/strong> of the pipeline on sample datasets using multi-GPU setups, ensuring compatibility and robustness.&lt;/li>
&lt;/ul>
&lt;h2 id="challenges-and-learnings">Challenges and Learnings&lt;/h2>
&lt;p>Building a billion-scale dataset from real-world codebases is no small task. Here&amp;rsquo;s what we’ve encountered and learned along the way:&lt;/p>
&lt;h3 id="1-multi-gpu-pipeline-design">1. Multi-GPU Pipeline Design&lt;/h3>
&lt;p>Naively parallelizing the embedding process caused memory overflow and deadlocks due to model reloading across processes. We refactored the code using &lt;code>torch.multiprocessing&lt;/code> and pinned GPU contexts to avoid such issues, improving throughput on multi-GPU machines.&lt;/p>
&lt;h3 id="2-embedding-trade-offs">2. Embedding Trade-offs&lt;/h3>
&lt;p>We experimented with larger models but found that their generation time and memory use were too high to be practical in early phases. This helped us narrow down to scalable configurations for initial dataset generation.&lt;/p>
&lt;h3 id="3-preparing-for-scale">3. Preparing for Scale&lt;/h3>
&lt;p>Although the embeddings are not generated yet, all scripts are now &lt;strong>modular, parallelized, and reproducible&lt;/strong>, ensuring a smooth transition to billion-scale data generation in the second half.&lt;/p>
&lt;h2 id="whats-next">What’s Next&lt;/h2>
&lt;p>The second half of the project will focus on:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Scaling up embedding generation&lt;/strong> to &amp;gt;1B code chunks across hundreds of open-source repositories.&lt;/li>
&lt;li>&lt;strong>Running benchmarks&lt;/strong> using FAISS, HNSW, and Annoy on these embeddings.&lt;/li>
&lt;li>&lt;strong>Releasing the dataset&lt;/strong> on Hugging Face and AWS S3 with sharded access and metadata.&lt;/li>
&lt;li>&lt;strong>Writing a detailed benchmarking report&lt;/strong> comparing speed, accuracy, and memory trade-offs across ANN algorithms.&lt;/li>
&lt;/ul>
&lt;h2 id="final-thoughts">Final Thoughts&lt;/h2>
&lt;p>This journey so far has taught me a lot about building large-scale ML pipelines, managing real-world compute constraints, and ensuring reproducibility for research-grade datasets. I&amp;rsquo;m grateful to my mentor &lt;strong>Jayjeet Chakraborty&lt;/strong> and the OSRE team for their continuous support and guidance.&lt;/p>
&lt;p>Excited for the next half, where the real scale begins!&lt;/p>
&lt;p>Stay tuned for updates. You can find more about the project on my &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/embeddings">OSRE project page&lt;/a>.&lt;/p></description></item><item><title>Benchmarking the Future: Exploring High-Speed Scientific Data Streaming</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/anl/scistream/20250706-ankitkat042/</link><pubDate>Sun, 06 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/anl/scistream/20250706-ankitkat042/</guid><description>&lt;p>Hello! I&amp;rsquo;m &lt;a href="https://ucsc-ospo.github.io/author/ankitkat042/" target="_blank" rel="noopener">Ankit Kumar&lt;/a>, and although I&amp;rsquo;m a bit late with this introduction post due to a busy period filled with interviews and college formalities, I&amp;rsquo;m excited to share my journey with the OSRE 2025 program and the fascinating world of scientific data streaming.&lt;/p>
&lt;h2 id="about-me">About Me&lt;/h2>
&lt;p>I&amp;rsquo;m currently pursuing my BTech degree at the Indraprastha Institute of Information Technology Delhi (IIIT Delhi) and am based in New Delhi, India. As I approach graduation, I&amp;rsquo;m thrilled to be working on a project that perfectly aligns with my interests in systems and networking.&lt;/p>
&lt;p>My passion for technology has led me through various experiences:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>&lt;strong>Software Developer at CloudLabs&lt;/strong>: I worked at a platform founded by &lt;a href="https://faculty.iiitd.ac.in/~sumit/" target="_blank" rel="noopener">Dr. Sumit J Darak&lt;/a> that facilitates remote access to actual FPGA boards on a slot basis, making hardware experimentation accessible to students worldwide.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Data Mining Intern at &lt;a href="https://tasktracker.in/" target="_blank" rel="noopener">TaskTracker.in&lt;/a>&lt;/strong>: This experience gave me insights into large-scale data processing and analysis.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>&lt;strong>Undergraduate Researcher&lt;/strong>: Currently working under &lt;a href="https://faculty.iiitd.ac.in/~mukulika/" target="_blank" rel="noopener">Dr. Mukulika Maity&lt;/a> on benchmarking QUIC and TCP protocols across different environments including bare metal, virtual machines, and containers.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>I chose this OSRE project because it represents an incredible opportunity to work with some of the best minds in the industry at Argonne National Laboratory (ANL) while diving deep into cutting-edge networking technologies.&lt;/p>
&lt;h2 id="my-project-scistream-performance-analysis">My Project: SciStream Performance Analysis&lt;/h2>
&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/anl/scistream">SciStream project&lt;/a>, I&amp;rsquo;m focusing on two critical aspects of high-performance scientific data streaming:&lt;/p>
&lt;h3 id="1-tcpudp-performace-benchmarking">1. TCP/UDP Performace Benchmarking&lt;/h3>
&lt;p>I&amp;rsquo;m conducting comprehensive benchmarking of SSH and TLS tunnels using various open-source tools and parameters. This work is crucial for understanding how different protocols and their overhead impact the performance of real-time scientific data streaming. The goal is to provide researchers with evidence-based recommendations for moving/processing their high-speed data transfers without compromising performance.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="benchmarking_meme.png" srcset="
/report/osre25/anl/scistream/20250706-ankitkat042/benchmarking_meme_hu421a7ebb13e86e740532f6be0545cf9e_1301468_30852e26909ae3e8a70a243539d202b3.webp 400w,
/report/osre25/anl/scistream/20250706-ankitkat042/benchmarking_meme_hu421a7ebb13e86e740532f6be0545cf9e_1301468_cc337ea13c6aaff8c44a6cc4b452a3e3.webp 760w,
/report/osre25/anl/scistream/20250706-ankitkat042/benchmarking_meme_hu421a7ebb13e86e740532f6be0545cf9e_1301468_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/anl/scistream/20250706-ankitkat042/benchmarking_meme_hu421a7ebb13e86e740532f6be0545cf9e_1301468_30852e26909ae3e8a70a243539d202b3.webp"
width="760"
height="754"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;h3 id="2-quic-proxy-exploration">2. QUIC Proxy Exploration&lt;/h3>
&lt;p>I&amp;rsquo;m exploring different QUIC proxy implementations to understand their potential advantages over traditional TCP+TLS proxies in scientific workflows. QUIC, the protocol that powers modern web applications like YouTube, offers promising features for scientific data streaming, but comprehensive benchmarking is needed to validate its benefits.&lt;/p>
&lt;h2 id="working-with-cutting-edge-testbeds">Working with Cutting-Edge Testbeds&lt;/h2>
&lt;p>Currently, I&amp;rsquo;m conducting experiments using both the &lt;strong>&lt;a href="https://portal.fabric-testbed.net/" target="_blank" rel="noopener">FABRIC testbed&lt;/a>&lt;/strong> and &lt;strong>&lt;a href="https://www.es.net/" target="_blank" rel="noopener">ESnet testbed&lt;/a>&lt;/strong>. These platforms provide access to real high-speed network infrastructure, allowing me to test protocols and configurations under realistic conditions that mirror actual scientific computing environments.&lt;/p>
&lt;h2 id="the-team-experience">The Team Experience&lt;/h2>
&lt;p>These past two weeks have been incredibly rewarding, working alongside:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>&lt;a href="https://www.linkedin.com/in/alain-zhang-672086205/" target="_blank" rel="noopener">Alain Zhang&lt;/a>&lt;/strong> - my project mate from UC San Diego, cool guy.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://www.linkedin.com/in/castroflavio/" target="_blank" rel="noopener">Flavio Castro&lt;/a>&lt;/strong> - My project mentor and manager, goto person for my issues. currently at anl as a research development software engineer.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://www.anl.gov/profile/joaquin-chung" target="_blank" rel="noopener">Joaquin Chung&lt;/a>&lt;/strong> - Super mentor, brains behind the project. His guidance on the project is super valubale.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://www.anl.gov/profile/rajkumar-kettimuthu" target="_blank" rel="noopener">Rajkumar Kettimuthu&lt;/a>&lt;/strong> - Lead Scientist in our project whose comments on our paper critique are invaluable.&lt;/li>
&lt;li>&lt;strong>&lt;a href="https://www.linkedin.com/in/seena-vazifedunn/" target="_blank" rel="noopener">Seena Vazifedunn&lt;/a>&lt;/strong> - Graduate Research Assistant at University of Chicago. He asks very relevant and important questions during our report presentation and his feedbacks are very insightful.&lt;/li>
&lt;/ul>
&lt;p>The collaborative nature of this project has been fantastic, combining perspectives from different institutions and backgrounds to tackle complex networking challenges.&lt;/p>
&lt;p>Stay tuned for updates!&lt;/p>
&lt;hr>
&lt;p>&lt;em>This work is part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/anl/scistream">SciStream project&lt;/a> at Argonne National Laboratory, reimagining how scientific data moves across modern research infrastructure.&lt;/em>&lt;/p></description></item><item><title>Building a Billion-Scale Vector Embeddings Dataset</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/embeddings/14062025-devadigapratham/</link><pubDate>Sun, 15 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/embeddings/14062025-devadigapratham/</guid><description>&lt;h1 id="billion-vector-embeddings-dataset">Billion Vector Embeddings Dataset&lt;/h1>
&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/embeddings">Billion-Scale Embeddings Dataset project&lt;/a>, my &lt;a href="GSoC-proposal.pdf">proposal&lt;/a> under the mentorship of &lt;strong>Jayjeet Chakraborty&lt;/strong> aims to create the first large-scale, real-world vector embeddings dataset—bridging the critical gap in Approximate Nearest Neighbor (ANN) benchmarks and Retrieval-Augmented Generation (RAG) systems.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Existing ANN benchmarks often fall short—they’re either synthetic (like SIFT) or too small-scale (≤1M vectors). With the rapid evolution of LLM-based vector search systems (e.g., OpenAI’s 3072d &lt;code>text-embedding-3-large&lt;/code>), there&amp;rsquo;s a growing need for:&lt;/p>
&lt;ul>
&lt;li>High-dimensional (&amp;gt;1000d), large-scale (&amp;gt;100M) embeddings&lt;/li>
&lt;li>Real-world distributions (Wikipedia-scale text)&lt;/li>
&lt;li>Open, reproducible benchmarks for the community&lt;/li>
&lt;/ul>
&lt;h2 id="project-goals">Project Goals&lt;/h2>
&lt;ul>
&lt;li>Generate &lt;strong>1 billion&lt;/strong> embeddings from English Wikipedia using open-source models.&lt;/li>
&lt;li>Create multiple dimensional variants: &lt;strong>1024d&lt;/strong>, &lt;strong>4096d&lt;/strong>, and &lt;strong>8192d&lt;/strong>.&lt;/li>
&lt;li>Deduplicate, compress, and store embeddings with rich metadata (URL, timestamps, models).&lt;/li>
&lt;li>Benchmark ANN performance on FAISS, HNSW, and Annoy.&lt;/li>
&lt;li>Distribute the dataset via HuggingFace &amp;amp; AWS S3 with shard-level access.&lt;/li>
&lt;/ul>
&lt;h2 id="open-source-impact">Open Source Impact&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>ANN Libraries&lt;/strong>: Enable reproducible benchmarking for real-world workloads.&lt;/li>
&lt;li>&lt;strong>RAG Systems&lt;/strong>: Evaluate and optimize retrieval at scale using real Wikipedia text.&lt;/li>
&lt;li>&lt;strong>Researchers&lt;/strong>: Conduct large-scale studies on dimensionality, ANN accuracy, and compression trade-offs.&lt;/li>
&lt;/ul>
&lt;hr></description></item><item><title>h5bench with AI workloads</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/lbl/h5bench-ai/</link><pubDate>Tue, 11 Feb 2025 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/lbl/h5bench-ai/</guid><description>&lt;p>&lt;a href="https://github.com/hpc-io/h5bench" target="_blank" rel="noopener">h5bench&lt;/a> is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.&lt;/p>
&lt;p>Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library&amp;rsquo;s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.&lt;/p>
&lt;h3 id="h5bench-with-ai-workloads">h5bench with AI workloads&lt;/h3>
&lt;p>The proposed work will include (1) analyzing and characterizing AI workloads that rely on HDF5 datasets, (2) extracting a kernel of their I/O operations, and (3) implementing and validating the kernel in h5bench.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>benchmarking&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, C/C++, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/suren-byna/">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Architecting the Future of Scientific Data: Multi-Site Streaming Without Compromise</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/anl/scistream/</link><pubDate>Mon, 10 Feb 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/anl/scistream/</guid><description>&lt;p>Data is generated at ever-increasing rates, yet it’s often processed more slowly than it’s collected. Scientific instruments frequently operate below their full capacity or discard
valuable data due to network bottlenecks, security domain mismatches, and insufficient real-time processing capabilities.&lt;/p>
&lt;p>&lt;a href="https://github.com/scistream/scistream-proto" target="_blank" rel="noopener">SciStream&lt;/a> reimagines how scientific data moves across modern research infrastructure by providing a framework for high-speed (+100Gbps)
memory-to-memory streaming that doesn’t compromise on security. Whether connecting scientific instruments to analysis clusters or bridging across institutional boundaries, SciStream provides the foundation for next-generation scientific
workflows.&lt;/p>
&lt;p>Building on our &lt;a href="https://dl.acm.org/doi/abs/10.1145/3502181.3531475" target="_blank" rel="noopener">published research&lt;/a>, we’re now expanding the framework’s capabilities through open-source development and community
collaboration. These projects offer an opportunity for
students to gain hands-on experience with cutting-edge networking and security technologies used in high-performance computing (HPC), cloud infrastructure, and large-scale scientific
experiments.&lt;/p>
&lt;h3 id="scistream-securebench-a-framework-for-benchmarking-security-protocols-in-scientific-data-streaming">SciStream-SecureBench: A Framework for Benchmarking Security Protocols in Scientific Data Streaming&lt;/h3>
&lt;p>&lt;strong>Project Idea Description:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Security Protocols, Network Performance, Data Streaming, Reproducibility, High-throughput Computing&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Python, Scripting, Linux, Network Protocol Analysis, Containers, Benchmarking tools&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350) hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joaquin-chung/">Joaquin Chung&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/flavio-castro/">Flavio Castro&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Ever wondered why large scientific experiments need to move massive amounts of data securely and quickly? While TLS and SSH are standard for secure data transfer,
there’s a surprising lack of benchmarks that evaluate their performance in high-speed scientific workflows. This project aims to fill this gap by developing a benchmarking suite that
measures how different security configurations impact real-time scientific data streaming.&lt;/p>
&lt;h3 id="specific-tasks-of-the-project-include">&lt;strong>Specific Tasks of the Project Include&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>Developing benchmarking tools that measure key security performance metrics like handshake latency, throughput stability, and computational overhead.&lt;/li>
&lt;li>Running &lt;strong>real-world experiments&lt;/strong> on research testbeds (Chameleon, FABRIC) to simulate scientific data patterns.&lt;/li>
&lt;li>Automating comparative analysis between TLS and SSH, with focus on streaming-specific metrics like &lt;strong>time-to-first-byte and sustained throughput&lt;/strong>.&lt;/li>
&lt;li>Documenting best practices for security protocol selection in high-performance streaming.&lt;/li>
&lt;/ul>
&lt;h3 id="why-this-matters-for-your-career">&lt;strong>Why This Matters for Your Career&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>Gain expertise in &lt;strong>network security and performance analysis&lt;/strong>, highly valued in cybersecurity, cloud computing, and HPC.&lt;/li>
&lt;li>Work on a &lt;strong>real research challenge&lt;/strong> with potential for publication.&lt;/li>
&lt;/ul>
&lt;h3 id="scistream-streambench-comparative-analysis-of-scientific-streaming-frameworks">SciStream-StreamBench: Comparative Analysis of Scientific Streaming Frameworks&lt;/h3>
&lt;p>&lt;strong>Project Idea Description:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Data Streaming Protocols, Network Performance, Benchmarking, Distributed Systems, Real-time Computing&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Python, ZeroMQ, EPICS/PVAccess, Linux, Performance Analysis, Visualization&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350) hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joaquin-chung/">Joaquin Chung&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/flavio-castro/">Flavio Castro&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Scientific experiments generate enormous amounts of streaming data, but how do we choose the best framework for handling it efficiently? Despite the widespread use of ZeroMQ and
&lt;a href="https://dl.acm.org/doi/10.1145/3624062.3624610" target="_blank" rel="noopener">PVApy&lt;/a>,
there’s little systematic benchmarking comparing their performance. This project will develop &lt;strong>real-world benchmarks&lt;/strong> to evaluate how different frameworks handle scientific data in
&lt;strong>high-speed environments&lt;/strong>.&lt;/p>
&lt;h3 id="the-specific-tasks-of-the-project-include">&lt;strong>The Specific Tasks of the Project Include&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>Designing benchmarking methodologies to assess key performance metrics like &lt;strong>synchronization overhead, time-to-first-data, and throughput stability&lt;/strong>.&lt;/li>
&lt;li>Developing a test harness that simulates real-world streaming conditions (network variability, concurrent streams, dynamic data rates).&lt;/li>
&lt;li>Running experiments on &lt;strong>Chameleon and FABRIC testbeds&lt;/strong>.&lt;/li>
&lt;li>Automating data collection and visualization to highlight performance trends.&lt;/li>
&lt;li>Documenting best practices and framework-specific optimizations.&lt;/li>
&lt;/ul>
&lt;h3 id="why-this-matters-for-your-career-1">&lt;strong>Why This Matters for Your Career&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>Get hands-on experience with &lt;strong>real-time data processing&lt;/strong> and &lt;strong>network performance analysis&lt;/strong>.&lt;/li>
&lt;li>Learn benchmarking techniques useful for &lt;strong>distributed systems, cloud computing, and high-performance networking&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;h3 id="scistream-quic-next-generation-proxy-architecture-for-scientific-data-streaming">SciStream-QUIC: Next-Generation Proxy Architecture for Scientific Data Streaming&lt;/h3>
&lt;p>&lt;strong>Project Idea Description:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: QUIC Protocol, Network Proxies, Performance Analysis, Protocol Design, Hardware Acceleration&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Python/C++, Network Programming, QUIC (quiche/aioquic), Linux, Performance Analysis&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Hard&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350) hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joaquin-chung/">Joaquin Chung&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/flavio-castro/">Flavio Castro&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Ever wondered how YouTube loads videos faster than traditional web pages? That’s because of &lt;strong>QUIC&lt;/strong>, a next-generation protocol designed for speed and security. Initial evaluations
of federated streaming architectures (&lt;a href="https://par.nsf.gov/servlets/purl/10380551" target="_blank" rel="noopener">INDIS'22
paper&lt;/a>) suggest potential benefits of QUIC, but comprehensive benchmarking is
needed. This project explores whether &lt;strong>QUIC-based proxies&lt;/strong> can outperform traditional &lt;strong>TCP+TLS&lt;/strong> proxies for scientific data streaming, potentially revolutionizing how researchers move
large datasets.&lt;/p>
&lt;h3 id="the-specific-tasks-of-the-project-include-1">&lt;strong>The Specific Tasks of the Project Include&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>Developing a &lt;strong>QUIC-based proxy&lt;/strong> optimized for scientific workflows.&lt;/li>
&lt;li>Running benchmarks to compare &lt;strong>QUIC vs. traditional TLS proxies&lt;/strong>.&lt;/li>
&lt;li>Investigating &lt;strong>hardware encryption offloading&lt;/strong> for QUIC and TLS.&lt;/li>
&lt;li>Designing &lt;strong>reproducible experiments&lt;/strong> using Chameleon and FABRIC testbeds.&lt;/li>
&lt;li>Documenting best practices for deploying &lt;strong>QUIC proxies in HPC environments&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;h3 id="why-this-matters-for-your-career-2">&lt;strong>Why This Matters for Your Career&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>Gain experience in &lt;strong>cutting-edge networking protocols&lt;/strong> used in cloud computing (Google, Cloudflare, etc.).&lt;/li>
&lt;li>Learn about &lt;strong>hardware acceleration&lt;/strong> and its role in high-speed networking.&lt;/li>
&lt;/ul>
&lt;h3 id="scistream-auth-modern-authentication-and-user-interface-for-scientific-data-streaming">SciStream-Auth: Modern Authentication and User Interface for Scientific Data Streaming&lt;/h3>
&lt;p>&lt;strong>Project Idea Description:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: Authentication Systems, UI/UX Design, Security Integration, Scientific Computing&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Python, Web Development (React/Vue), OAuth 2.0/SAML, Security Analysis&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350) hours&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/joaquin-chung/">Joaquin Chung&lt;/a>, &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/flavio-castro/">Flavio Castro&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Not a security expert? You can still contribute by designing an interactive front-end!&lt;/p>
&lt;p>In today&amp;rsquo;s scientific computing landscape, authentication and user experience often act as barriers to adoption rather than enabling seamless collaboration. While SciStream excels at
high-speed data transfer, its reliance on a single authentication provider and command-line interface limits its accessibility. This project aims to transform SciStream into a more
versatile platform by implementing a modular authentication system and developing an intuitive graphical interface.&lt;/p>
&lt;p>By expanding beyond Globus Auth to support multiple authentication frameworks, we can enable broader adoption across different scientific communities while maintaining robust security.
Coupled with a modern GUI that visualizes real-time streaming activity, this enhancement will make SciStream more accessible to researchers—allowing them to focus on their science rather
than wrestling with complex configurations.&lt;/p>
&lt;p>This project will design a user-friendly interface that makes secure scientific data streaming as intuitive as using a cloud storage service. You&amp;rsquo;ll also gain hands-on experience with
authentication methods used by industry leaders like Google and Facebook, while directly improving access to scientific data.&lt;/p>
&lt;h3 id="the-specific-tasks-of-the-project-include-2">&lt;strong>The Specific Tasks of the Project Include&lt;/strong>&lt;/h3>
&lt;ul>
&lt;li>Design and implementation of a pluggable authentication system supporting multiple providers (OAuth 2.0, SAML, OpenID Connect, certificate-based auth)&lt;/li>
&lt;li>Development of a modern, responsive GUI using web technologies that provides real-time visualization of system status&lt;/li>
&lt;li>Creation of comprehensive security testing protocols to validate the authentication implementations&lt;/li>
&lt;li>Implementation of session management and secure credential handling within the GUI&lt;/li>
&lt;li>Design of an intuitive interface for managing streaming configurations and monitoring data flows&lt;/li>
&lt;li>Creation of documentation and examples to help facilities integrate their preferred authentication mechanisms&lt;/li>
&lt;/ul></description></item><item><title>h5bench</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/h5bench/</link><pubDate>Tue, 30 Jan 2024 10:15:00 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/lbl/h5bench/</guid><description>&lt;p>&lt;a href="https://github.com/hpc-io/h5bench" target="_blank" rel="noopener">h5bench&lt;/a> is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.&lt;/p>
&lt;p>Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library&amp;rsquo;s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.&lt;/p>
&lt;h3 id="h5bench--reporting-and-enhancing">h5bench / Reporting and Enhancing&lt;/h3>
&lt;p>The proposed work will include standardizing and enhancing the reports generated by the suite, and integrate additional I/O kernels (e.g., HACC-IO).&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>benchmarking&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Python, C/C++, good communicator&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="h5bench--compression">h5bench / Compression&lt;/h3>
&lt;p>The proposed work will focus on including compression capabilities into the h5bench core access patterns through HDF5 filters.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>I/O&lt;/code> &lt;code>HPC&lt;/code> &lt;code>benchmarking&lt;/code>, &lt;code>compression&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> C/C++, Python, HDF5&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jean-luca-bez/">Jean Luca Bez&lt;/a> and &lt;a href="mailto:sbyna@lbl.gov">Suren Byna&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>