benchmarking | UCSC OSPO

Midway Through GSoC

Mon, 14 Jul 2025 00:00:00 +0000

Midway Through GSoC

Hello everyone! I’m Pratham Devadiga, and I’m thrilled to share a midterm progress update on my GSoC 2025 project with the Open Source Research Experience (OSRE). My project is focused on building the first open-source billion-scale vector embeddings dataset from real-world open source code to support benchmarking of Approximate Nearest Neighbor (ANN) algorithms and facilitate research in Retrieval-Augmented Generation (RAG).

Project Overview

The goal of this project is to address a critical gap in the ecosystem: existing ANN benchmarks are either synthetic or limited in scale. With the explosion of code-focused LLMs and embedding models, there’s a pressing need for:

High-volume, high-dimensional vector datasets built from real-world data (open-source codebases).
Open, reproducible benchmarks that reflect realistic RAG workloads.
A dataset that can be used to evaluate ANN libraries like FAISS, HNSW, and Annoy on massive and practical retrieval tasks.

Our approach is to use high-quality open-source code repositories to extract meaningful code chunks, encode them into vector embeddings using open models, and make these datasets publicly available with metadata for downstream benchmarking and analysis.

Progress So Far

We’ve made substantial foundational progress in the first half of the coding period. Key highlights:

Tested multiple embedding models such as codeBERT, MiniLM-L6-v2, and all-mpnet-base-v2, evaluating trade-offs in speed, dimensionality, and GPU memory.
Selected codebert-base (768d) as the current model for phase one due to its stable performance and manageable resource footprint.
Implemented and validated a complete script pipeline to:
- Traverse large open-source repositories.
- Extract and chunk code intelligently (functions, classes, modules).
- Encode code into embeddings and attach metadata (repo, file path, license).
- Store results efficiently in parquet and NumPy formats.
Tested all components of the pipeline on sample datasets using multi-GPU setups, ensuring compatibility and robustness.

Challenges and Learnings

Building a billion-scale dataset from real-world codebases is no small task. Here’s what we’ve encountered and learned along the way:

1. Multi-GPU Pipeline Design

Naively parallelizing the embedding process caused memory overflow and deadlocks due to model reloading across processes. We refactored the code using torch.multiprocessing and pinned GPU contexts to avoid such issues, improving throughput on multi-GPU machines.

2. Embedding Trade-offs

We experimented with larger models but found that their generation time and memory use were too high to be practical in early phases. This helped us narrow down to scalable configurations for initial dataset generation.

3. Preparing for Scale

Although the embeddings are not generated yet, all scripts are now modular, parallelized, and reproducible, ensuring a smooth transition to billion-scale data generation in the second half.

What’s Next

The second half of the project will focus on:

Scaling up embedding generation to >1B code chunks across hundreds of open-source repositories.
Running benchmarks using FAISS, HNSW, and Annoy on these embeddings.
Releasing the dataset on Hugging Face and AWS S3 with sharded access and metadata.
Writing a detailed benchmarking report comparing speed, accuracy, and memory trade-offs across ANN algorithms.

Final Thoughts

This journey so far has taught me a lot about building large-scale ML pipelines, managing real-world compute constraints, and ensuring reproducibility for research-grade datasets. I’m grateful to my mentor Jayjeet Chakraborty and the OSRE team for their continuous support and guidance.

Excited for the next half, where the real scale begins!

Stay tuned for updates. You can find more about the project on my OSRE project page.

Benchmarking the Future: Exploring High-Speed Scientific Data Streaming

Sun, 06 Jul 2025 00:00:00 +0000

Hello! I’m Ankit Kumar, and although I’m a bit late with this introduction post due to a busy period filled with interviews and college formalities, I’m excited to share my journey with the OSRE 2025 program and the fascinating world of scientific data streaming.

About Me

I’m currently pursuing my BTech degree at the Indraprastha Institute of Information Technology Delhi (IIIT Delhi) and am based in New Delhi, India. As I approach graduation, I’m thrilled to be working on a project that perfectly aligns with my interests in systems and networking.

My passion for technology has led me through various experiences:

Software Developer at CloudLabs: I worked at a platform founded by Dr. Sumit J Darak that facilitates remote access to actual FPGA boards on a slot basis, making hardware experimentation accessible to students worldwide.
Data Mining Intern at TaskTracker.in: This experience gave me insights into large-scale data processing and analysis.
Undergraduate Researcher: Currently working under Dr. Mukulika Maity on benchmarking QUIC and TCP protocols across different environments including bare metal, virtual machines, and containers.

I chose this OSRE project because it represents an incredible opportunity to work with some of the best minds in the industry at Argonne National Laboratory (ANL) while diving deep into cutting-edge networking technologies.

My Project: SciStream Performance Analysis

As part of the SciStream project, I’m focusing on two critical aspects of high-performance scientific data streaming:

1. TCP/UDP Performace Benchmarking

I’m conducting comprehensive benchmarking of SSH and TLS tunnels using various open-source tools and parameters. This work is crucial for understanding how different protocols and their overhead impact the performance of real-time scientific data streaming. The goal is to provide researchers with evidence-based recommendations for moving/processing their high-speed data transfers without compromising performance.

2. QUIC Proxy Exploration

I’m exploring different QUIC proxy implementations to understand their potential advantages over traditional TCP+TLS proxies in scientific workflows. QUIC, the protocol that powers modern web applications like YouTube, offers promising features for scientific data streaming, but comprehensive benchmarking is needed to validate its benefits.

Working with Cutting-Edge Testbeds

Currently, I’m conducting experiments using both the FABRIC testbed and ESnet testbed. These platforms provide access to real high-speed network infrastructure, allowing me to test protocols and configurations under realistic conditions that mirror actual scientific computing environments.

The Team Experience

These past two weeks have been incredibly rewarding, working alongside:

Alain Zhang - my project mate from UC San Diego, cool guy.
Flavio Castro - My project mentor and manager, goto person for my issues. currently at anl as a research development software engineer.
Joaquin Chung - Super mentor, brains behind the project. His guidance on the project is super valubale.
Rajkumar Kettimuthu - Lead Scientist in our project whose comments on our paper critique are invaluable.
Seena Vazifedunn - Graduate Research Assistant at University of Chicago. He asks very relevant and important questions during our report presentation and his feedbacks are very insightful.

The collaborative nature of this project has been fantastic, combining perspectives from different institutions and backgrounds to tackle complex networking challenges.

Stay tuned for updates!

This work is part of the SciStream project at Argonne National Laboratory, reimagining how scientific data moves across modern research infrastructure.

Building a Billion-Scale Vector Embeddings Dataset

Sun, 15 Jun 2025 00:00:00 +0000

Billion Vector Embeddings Dataset

As part of the Billion-Scale Embeddings Dataset project, my proposal under the mentorship of Jayjeet Chakraborty aims to create the first large-scale, real-world vector embeddings dataset—bridging the critical gap in Approximate Nearest Neighbor (ANN) benchmarks and Retrieval-Augmented Generation (RAG) systems.

Motivation

Existing ANN benchmarks often fall short—they’re either synthetic (like SIFT) or too small-scale (≤1M vectors). With the rapid evolution of LLM-based vector search systems (e.g., OpenAI’s 3072d text-embedding-3-large), there’s a growing need for:

High-dimensional (>1000d), large-scale (>100M) embeddings
Real-world distributions (Wikipedia-scale text)
Open, reproducible benchmarks for the community

Project Goals

Generate 1 billion embeddings from English Wikipedia using open-source models.
Create multiple dimensional variants: 1024d, 4096d, and 8192d.
Deduplicate, compress, and store embeddings with rich metadata (URL, timestamps, models).
Benchmark ANN performance on FAISS, HNSW, and Annoy.
Distribute the dataset via HuggingFace & AWS S3 with shard-level access.

Open Source Impact

ANN Libraries: Enable reproducible benchmarking for real-world workloads.
RAG Systems: Evaluate and optimize retrieval at scale using real Wikipedia text.
Researchers: Conduct large-scale studies on dimensionality, ANN accuracy, and compression trade-offs.

h5bench with AI workloads

Tue, 11 Feb 2025 10:15:00 -0700

h5bench is a suite of parallel I/O benchmarks or kernels representing I/O patterns that are commonly used in HDF5 applications on high performance computing systems. h5bench measures I/O performance from various aspects, including the I/O overhead, and observed I/O rate.

Parallel I/O is a critical technique for moving data between compute and storage subsystems of supercomputers. With massive amounts of data produced or consumed by compute nodes, high-performant parallel I/O is essential. I/O benchmarks play an important role in this process; however, there is a scarcity of I/O benchmarks representative of current workloads on HPC systems. Toward creating representative I/O kernels from real-world applications, we have created h5bench, a set of I/O kernels that exercise HDF5 I/O on parallel file systems in numerous dimensions. Our focus on HDF5 is due to the parallel I/O library’s heavy usage in various scientific applications running on supercomputing systems. The various tests benchmarked in the h5bench suite include I/O operations (read and write), data locality (arrays of basic data types and arrays of structures), array dimensionality (1D arrays, 2D meshes, 3D cubes), I/O modes (synchronous and asynchronous). h5bench measurements can be used to identify performance bottlenecks and their root causes and evaluate I/O optimizations. As the I/O patterns of h5bench are diverse and capture the I/O behaviors of various HPC applications, this study will be helpful to the broader supercomputing and I/O community.

h5bench with AI workloads

The proposed work will include (1) analyzing and characterizing AI workloads that rely on HDF5 datasets, (2) extracting a kernel of their I/O operations, and (3) implementing and validating the kernel in h5bench.

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

Architecting the Future of Scientific Data: Multi-Site Streaming Without Compromise

Mon, 10 Feb 2025 00:00:00 +0000

Data is generated at ever-increasing rates, yet it’s often processed more slowly than it’s collected. Scientific instruments frequently operate below their full capacity or discard valuable data due to network bottlenecks, security domain mismatches, and insufficient real-time processing capabilities.

SciStream reimagines how scientific data moves across modern research infrastructure by providing a framework for high-speed (+100Gbps) memory-to-memory streaming that doesn’t compromise on security. Whether connecting scientific instruments to analysis clusters or bridging across institutional boundaries, SciStream provides the foundation for next-generation scientific workflows.

Building on our published research, we’re now expanding the framework’s capabilities through open-source development and community collaboration. These projects offer an opportunity for students to gain hands-on experience with cutting-edge networking and security technologies used in high-performance computing (HPC), cloud infrastructure, and large-scale scientific experiments.

SciStream-SecureBench: A Framework for Benchmarking Security Protocols in Scientific Data Streaming

Project Idea Description:

Topics: Security Protocols, Network Performance, Data Streaming, Reproducibility, High-throughput Computing
Skills: Python, Scripting, Linux, Network Protocol Analysis, Containers, Benchmarking tools
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Ever wondered why large scientific experiments need to move massive amounts of data securely and quickly? While TLS and SSH are standard for secure data transfer, there’s a surprising lack of benchmarks that evaluate their performance in high-speed scientific workflows. This project aims to fill this gap by developing a benchmarking suite that measures how different security configurations impact real-time scientific data streaming.

Specific Tasks of the Project Include

Developing benchmarking tools that measure key security performance metrics like handshake latency, throughput stability, and computational overhead.
Running real-world experiments on research testbeds (Chameleon, FABRIC) to simulate scientific data patterns.
Automating comparative analysis between TLS and SSH, with focus on streaming-specific metrics like time-to-first-byte and sustained throughput.
Documenting best practices for security protocol selection in high-performance streaming.

Why This Matters for Your Career

Gain expertise in network security and performance analysis, highly valued in cybersecurity, cloud computing, and HPC.
Work on a real research challenge with potential for publication.

SciStream-StreamBench: Comparative Analysis of Scientific Streaming Frameworks

Project Idea Description:

Topics: Data Streaming Protocols, Network Performance, Benchmarking, Distributed Systems, Real-time Computing
Skills: Python, ZeroMQ, EPICS/PVAccess, Linux, Performance Analysis, Visualization
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Scientific experiments generate enormous amounts of streaming data, but how do we choose the best framework for handling it efficiently? Despite the widespread use of ZeroMQ and PVApy, there’s little systematic benchmarking comparing their performance. This project will develop real-world benchmarks to evaluate how different frameworks handle scientific data in high-speed environments.

The Specific Tasks of the Project Include

Designing benchmarking methodologies to assess key performance metrics like synchronization overhead, time-to-first-data, and throughput stability.
Developing a test harness that simulates real-world streaming conditions (network variability, concurrent streams, dynamic data rates).
Running experiments on Chameleon and FABRIC testbeds.
Automating data collection and visualization to highlight performance trends.
Documenting best practices and framework-specific optimizations.

Why This Matters for Your Career

Get hands-on experience with real-time data processing and network performance analysis.
Learn benchmarking techniques useful for distributed systems, cloud computing, and high-performance networking.

SciStream-QUIC: Next-Generation Proxy Architecture for Scientific Data Streaming

Project Idea Description:

Topics: QUIC Protocol, Network Proxies, Performance Analysis, Protocol Design, Hardware Acceleration
Skills: Python/C++, Network Programming, QUIC (quiche/aioquic), Linux, Performance Analysis
Difficulty: Hard
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Ever wondered how YouTube loads videos faster than traditional web pages? That’s because of QUIC, a next-generation protocol designed for speed and security. Initial evaluations of federated streaming architectures (INDIS'22 paper) suggest potential benefits of QUIC, but comprehensive benchmarking is needed. This project explores whether QUIC-based proxies can outperform traditional TCP+TLS proxies for scientific data streaming, potentially revolutionizing how researchers move large datasets.

The Specific Tasks of the Project Include

Developing a QUIC-based proxy optimized for scientific workflows.
Running benchmarks to compare QUIC vs. traditional TLS proxies.
Investigating hardware encryption offloading for QUIC and TLS.
Designing reproducible experiments using Chameleon and FABRIC testbeds.
Documenting best practices for deploying QUIC proxies in HPC environments.

Why This Matters for Your Career

Gain experience in cutting-edge networking protocols used in cloud computing (Google, Cloudflare, etc.).
Learn about hardware acceleration and its role in high-speed networking.

SciStream-Auth: Modern Authentication and User Interface for Scientific Data Streaming

Project Idea Description:

Topics: Authentication Systems, UI/UX Design, Security Integration, Scientific Computing
Skills: Python, Web Development (React/Vue), OAuth 2.0/SAML, Security Analysis
Difficulty: Medium
Size: Large (350) hours
Mentors: Joaquin Chung, Flavio Castro

Not a security expert? You can still contribute by designing an interactive front-end!

In today’s scientific computing landscape, authentication and user experience often act as barriers to adoption rather than enabling seamless collaboration. While SciStream excels at high-speed data transfer, its reliance on a single authentication provider and command-line interface limits its accessibility. This project aims to transform SciStream into a more versatile platform by implementing a modular authentication system and developing an intuitive graphical interface.

By expanding beyond Globus Auth to support multiple authentication frameworks, we can enable broader adoption across different scientific communities while maintaining robust security. Coupled with a modern GUI that visualizes real-time streaming activity, this enhancement will make SciStream more accessible to researchers—allowing them to focus on their science rather than wrestling with complex configurations.

This project will design a user-friendly interface that makes secure scientific data streaming as intuitive as using a cloud storage service. You’ll also gain hands-on experience with authentication methods used by industry leaders like Google and Facebook, while directly improving access to scientific data.

The Specific Tasks of the Project Include

Design and implementation of a pluggable authentication system supporting multiple providers (OAuth 2.0, SAML, OpenID Connect, certificate-based auth)
Development of a modern, responsive GUI using web technologies that provides real-time visualization of system status
Creation of comprehensive security testing protocols to validate the authentication implementations
Implementation of session management and secure credential handling within the GUI
Design of an intuitive interface for managing streaming configurations and monitoring data flows
Creation of documentation and examples to help facilities integrate their preferred authentication mechanisms

h5bench

Tue, 30 Jan 2024 10:15:00 -0700

h5bench / Reporting and Enhancing

The proposed work will include standardizing and enhancing the reports generated by the suite, and integrate additional I/O kernels (e.g., HACC-IO).

Topics: I/O HPC benchmarking
Skills: Python, C/C++, good communicator
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna

h5bench / Compression

The proposed work will focus on including compression capabilities into the h5bench core access patterns through HDF5 filters.

Topics: I/O HPC benchmarking, compression
Skills: C/C++, Python, HDF5
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Jean Luca Bez and Suren Byna