distributed systems | UCSC OSPO

Reconfigurable and Placement-Aware Replication for Edge Systems

Sat, 31 Jan 2026 00:00:00 +0000

Project Description

Topics: Distributed systems
Skills: Rust, Java, Go, Python, Bash scripting, Linux, Docker.
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fadhil I. Kurnia

Modern replicated systems are typically evaluated under static configurations with fixed replica placement. However, real-world edge deployments are highly dynamic: workloads shift geographically, edge nodes join or fail, and latency conditions change over time. Our existing testbed provides reproducible evaluation for replicated systems but lacks support for dynamic reconfiguration and adaptive edge placement policies.

This project extends the existing open testbed to support:

Dynamic Replica Reconfiguration
- Membership changes (add/remove replicas)
- Leader migration and shard movement
- Online reconfiguration cost measurement (latency spikes, recovery overhead, state transfer cost)
Edge-Aware Placement Policies
- Demand-aware placement based on geographic workload skew
- Latency-aware and bandwidth-aware replica selection
- Comparison of static vs. adaptive placement strategies
- Evaluation under real-world latency matrices (e.g., US metro-level or cloud region traces)
What-if Simulation Framework
- Replay workload traces with time-varying demand
- Simulate hundreds of edge sites with realistic network conditions
- Quantify trade-offs between consistency, availability, reconfiguration overhead, and cost

The outcome will be an open-source framework that enables researchers to evaluate not only steady-state replication performance, but also how systems behave under churn, scaling events, and demand shifts. They are central challenges in real edge environments.

Expected Deliverables

Reconfiguration abstraction layer (API for membership & placement changes)
Placement policy plugin framework (k-means, facility-location heuristics, latency-minimizing, cost-aware)
Trace-driven dynamic workload engine
Public benchmark scenarios and reproducible experiment scripts
Artifact-ready documentation and evaluation report

[Final Blog] Distrobench: Distributed Protocol Benchmark

Sat, 30 Aug 2025 00:00:00 +0000

Introduction

This is the final blog for our contribution to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges project under the mentorship of Fadhil Kurnia for the OSRE program.

Distrobench is a framework to evaluate the performance of replication/coordination protocols for distributed systems. This framework standardizes benchmarking by allowing different protocols to be tested under an identical workload, and supports both local and remote deployment of the protocols. The frameworks tested are restricted under a key-value store application and are categorized under different consistency models, programming languages, and persistency (whether the framework stores its data in-memory or on-disk).

All the benchmark results are stored in a data.json file which can be viewed through a webpage we have provided. A user can clone the git repository, benchmark different protocols on their own machine or in a cluster of remote machines, then view the results locally. We also provided a webpage that shows our own benchmark results which ran on 3 Amazon EC2 t2.micro instances.

How to run a benchmark on Distrobench

Before running a benchmark using Distrobench, the protocol that will be benchmarked must first be built. This is to allow the script to initialize the protocol instance for local benchmark or to send the binaries into the remote machine. The remote machine running the protocol does not need to store the code for the protocol implementations, but does require dependencies for running that specific protocol such as Java, Docker, rsync, etc. The following are commands used to build the ailidani/paxi project which does not need any additional dependency to be run inside of a remote machine:

# Clone the Distrobench repository 
git clone git@github.com:fadhilkurnia/distro.git

# Clone the Paxi repository and build the binary 
cd distro/sut/ailidani.paxi
git clone git@github.com:ailidani/paxi.git
cd paxi/bin/
./build.sh

# Go back to the Distrobench root directory & run python script 
cd ../../../..
python main.py

By default, the script will start 3 local instances of a Paxi protocol implementation that the user chose through the CLI. The user can modify the number of running instances and whether or not it is deployed locally or in a remote machine by changing the contents of the .env file inside the root directory. The following is the contents of the default .env file:

NUM_OF_NODES=3

SSH_KEY=ssh-key.pem
REMOTE_USERNAME=ubuntu

PUBLIC_IP1=127.0.0.1
PUBLIC_IP2=127.0.0.1
PUBLIC_IP3=127.0.0.1

PRIVATE_IP1=127.0.0.1
PRIVATE_IP2=127.0.0.1
PRIVATE_IP3=127.0.0.1

CLIENT_IP=127.0.0.1

OUTPUT=data.json

When running a remote benchmark, a ssh-key should also be added in the root directory to allow the use of ssh and rsync from within the python script. All machines must also allow TCP connection through port 2000-2300 and port 3000-3300 because that would be the port range for communication between the running instances as well as for the YCSB benchmark. Running the benchmark requires the use of at least 3 nodes because it is the minimum number of nodes to support most protocols (5 nodes recommended).

To view the benchmark result in the web page locally, move data.json into the docs/ directory and run python -m http.server 8000. The page is then accessible through http://localhost:8000.

Deep dive on how Distrobench works

The following is the project structure of the Distrobench repository:

distro/
├── main.py // Main python script for running benchmark
├── data.json // Output file for main.py
├── README.md
├── .env // Config for running the benchmark
├── docs/
│ ├── index.html // Web page to show benchmark results
│ ├── data.json // Output file displayed by web page
│ ├── README.md
├── src/
│ ├── utils/
│ └── ycsb/ // Submodule for YCSB
└── sut/ // Systems under test
 ├── ailidani.paxi/
 └── run.py // Protocol-specific benchmark script called by main.py
 ├── apache.zookeeper/
 ├── etcd-io.etcd/
 ├── fadhilkurnia.xdn/
 ├── holipaxos-artifect.holipaxos/
 ├── otoolep.hraftd/
 └── tikv.tikv/

main.py will automatically detect directories inside sut/ and will call the main function inside run.py. The following is the structure of run.py written in pseudocode style:

FUNCTION main(run_ycsb: Function, nodes: List of Nodes, ssh: Dictionary)
 node_data = map_ip_port(nodes)

 SWITCH user\_input
 CASE 0:
 start()
 RETURN
 CASE 1:
 stop()
 RETURN
 CASE 2:
 client_data = []
 FOR EACH item IN node_data
 ADD item.client_addr TO client_data
 END FOR
 run_ycsb(client_data)
 RETURN
 END SWITCH
END FUNCTION

FUNCTION start()
 // Start the protocol instance (local or remote)
END FUNCTION

FUNCTION stop()
 // Stop the protocol instance (local or remote)
END FUNCTION

FUNCTION map_ip_port(nodes: List of Nodes) -> List of Dictionary
 // Generate port numbers based on the protocol requirements
END FUNCTION

The .env file provides both public and private IP addresses to add versatility when running a remote benchmark. Private IP is used for communication between remote machines if they are under the same network group. In the case of our own benchmark, four t2.micro EC2 instances are deployed under the same network group. Three of them are used to run the protocol and the fourth machine acts as the YCSB client. It is possible to use your local machine as the YCSB client instead of through another remote machine by specifying CLIENT_IP in the .env file as 127.0.0.1. The decision to use the remote machine as the YCSB client is made to reduce the impact of network latency between the client and the protocol servers to a minimum.

The main tasks of the start() function can be broken down into the following:

Generate custom configuration files for each remote machine instance (May differ between implementations. Some implementations does not require a config file because they support flag parameters out of the box, others require multiple configuration files for each instance)
rsync binaries into the remote machine (If running a remote benchmark)
Start the instances

The stop() function is a lot simpler since it only kills the process running the protocol and optionally removes the copied binary files in the remote machine. The run_ycsb() function passed onto run.py is defined in main.py and currently supports two types of workload:

Read-heavy: A single-client workload with 95% read and 5% update (write) operations
Update-heavy: A single-client workload with 50% read and 50% update (write) operations

A new workload can be added inside the src/ycsb/workloads directory. Both workloads above only run 1000 operations for the benchmark which may not be enough operations to properly evaluate the performance of the protocols. It should also be noted that while YCSB does support a scan operation, it is never used for our benchmark because none of our tested protocols implement this operation.

How to implement a new protocol in Distrobench

Adding a new protocol to distrobench requires implementing two main components: a Python integration script (run.py) and a YCSB database binding for benchmarking.

Create the protocol directory structure
- Create a new directory under sut/ using format yourrepo.yourprotocol/.
Write run.py integration
- Put script inside yourrepo.yourprotocol/ directory
- Must have the main(run_ycsb, nodes, ssh) function.
- Add start/stop/benchmark menu options
- Handle local (127.0.0.1) and remote deployment
Create YCSB client
- Make Java class extending YCSB’s DB class
- Put inside src/ycsb/yourprotocol/src/main/java/site/ycsb/yourprotocol
- Implement read(), insert(), update(), delete() methods
Register your client
- Register your client to src/pom.xml, src/ycsb/bin/binding.properties, and src/ycsb/bin/ycsb.
Build and test
- Run cd src/ycsb && mvn clean package
- Run python main.py
- Select your protocol and test it

Protocols which have been tested

Distrobench has tested 20 different distributed consensus protocols across 7 different implementation projects.

ailidani/paxi
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability, Eventual
- Protocol : Paxos, EPaxos, SDpaxos, WPaxos, ABD, chain, VPaxos, WanKeeper, KPaxos, Paxos_groups, Dynamo, Blockchain, M2Paxos, HPaxos.
apache/zookeeper
- Programming Language : Java
- Persistency : On-Disk
- Consistency Model : Linearizability + Primary Integrity
- Protocol : Zookeeper implements ZAB (Zookeper Atomic Broadcast)
etcd-io/etcd
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft
fadhilkurnia/xdn
- Programming Language : Java, Rust
- Persistency : On-Disk
- Consistency Model : Linearizability, Linearizability + Primary Integrity
- Protocol : Gigapaxos
Zhiying12/holipaxos-artifect
- Programming Language : Go, Rust
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Holipaxos, Omnipaxos, Multipaxos
otoolep/hraftd
- Programming Language : Go
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft
tikv/tikv
- Programming Language : Rust
- Persistency : On-Disk
- Consistency Model : Linearizability
- Protocol : Raft

Challenges

When attempting to benchmark HoliPaxos, the main challenge was handling versions that rely on persistent storage with RocksDB. Since some implementations are written in Go, it was necessary to find compatible versions of RocksDB and gRocksDB (for example, RocksDB 10.5.1 works with gRocksDB 1.10.2). Another difficulty was that RocksDB is resource-intensive to compile, and in our project we did not have sufficient CPU capacity on the remote machine to build RocksDB and run remote benchmarks.
Some projects did not compile successfully at first and required minor modifications to run.

Conclusion and future improvements

The current benchmark result shows the performance of all the mentioned protocols by throughput and benchmark runtime. The results are subject to revisions because it may not reflect the best performance for the protocols due to unoptimized deployment script. We are also planning to switch to a more powerful EC2 machine because t2.micro does not have enough resources to support the use of RocksDB as well as TiKV.

In the near future, additional features will be added to Distrobench such as:

Multi-Client Support: The YCSB client will start multiple clients which will send requests in parallel to different servers in the group.
Commit Versioning: Allows the labelling of all benchmark results with the commit hash of the protocol’s repository version. This allows comparing different versions of the same project.
Adding more Primary-Backup, Sequential, Causal, and Eventual consistency protocols: Implementations with support for a consistency model other than linearizability and one that provides an existing key-value store application are notoriously difficult to find.
Benchmark on node failure
Benchmark on the addition of a new node

[Final]Reproducibility of Interactive Notebooks in Distributed Environments

Wed, 20 Aug 2025 00:00:00 +0000

I am sharing a overview of my project Reproducibility of Interactive Notebooks in Distributed Environments and the work that I did this summer.

Project Overview

This project aims at improving the reproducibility of interactive notebooks which are executed in a distributed environment. Notebooks like in the Jupyter environment have become increasingly popular and are widely used in the scientific community due to their ease of use and portability. Reproducing these notebooks is a challenging task especially in a distributed cluster environment.

In the distributed environments we consider, the notebook code is divided into manager and worker code. The manager code is the main entry point of the program which divides the task at hand into one or more worker codes which run in a parallel, distributed fashion. We utlize several open source tools to package and containerize the application code which can be used to reproduce it across different machines and environments. They include Sciunit, FLINC, and TaskVine. These are the high-level goals of this project:

Generate execution logs for a notebook program.
Generate code and data dependencies for notebook programs in an automated manner.
Utilize the generated dependencies at various granularities to automate the deployment and execution of notebooks in a parallel and distributed environment.
Audit and package the notebook code running in a distributed environment.
Overall, support efficient reproducibility of programs in a notebook program.

Progress Highlights

Here are the details of the work that I did during this summer.

Generation of Execution Logs

We generate execution logs for the notebook programs in a distributed environment the Linux utility strace which records every system call made by the notebook. It includes all files accessed during its execution. We collect separate logs for both manager and the worker code since they are executed on different machines and the dependencies for both are different. By recording the entire notebook execution, we capture all libraries, packages, and data files referenced during notebook execution in the form of execution logs. These logs are then utilized for further analyses.

Extracting Software Dependencies

When a library such as a Python package like Numpy is used by the notebook program, an entry is made in the execution log which has the complete path of the accessed library file(s) along with additional information. We analyze the execution logs for both manager and workers to find and enlist all dependencies. So far, we are limited to Python packages, though this methodology is general and can be used to find dependencies for any programing language. For Python packages, their version numbers are also obtained by querying the package managers like pip or Conda on the local system.

Extracting Data Dependencies

We utilze similar execution logs to identify which data files were used by the notebook program. The list of logged files also contain various configuration or setting files used by certain packages and libraries. These files are removed from the list of data dependencies through post-processing done by analyzing file paths.

Testing the Pipeline

We have conducted our experiments on three use cases obtained from different domains using between 5 and 10 workers. They include distributed image convolution, climate trend analysis, and high energy physics experiment analysis. The results so far are promising with good accuracy and with a slight running time overhead.

Processing at Cell-level

We perform the same steps of log generation and data and software dependency extraction at the level of individual cells in a notebook instead of once for the whole notebook. As a result, we generate software and data dependencies at the level of individual notebook cells. This is achieved by interrupting control flow before and after execution of each cell to write special instructions to the execution log for marking boundaries of cell execution. We then analyze the intervals between these instructions to identify which files and Python packages are accessed by each specific cell. We use this information to generate the list of software dependencies used by that cell only.

We also capture data dependencies by overriding analyzing the execution logs generated by overriding the function of the open function call used to access various files.

Distributed Notebook Auditing

In order to execute and audit workloads in parallel, we use Sciunit Parallel which uses GNU Parallel for efficient parallel execution of tasks. The user specifies the number of tasks or machines to run the task on which is then distributed across them. Once the execution completes, their containerized executions need to be gathered at the host location.

Efficient Reproducibility with Checkpointing

An important challenge with Jupyter notebooks is that sometimes they are unnecessarily time-consuming and resource-intensive, especially when most cells remain unchanged. We worked on NBRewind which is a lightweight tool to accelerate notebook re-execution by avoiding redundant computation. It integrates checkpointing, application virtualization, and content-based deduplication. It enables two kinds of checkpoints: incremental and full-state. In incremental checkpoints, notebook states and dependencies across multiple cells are stored once such that only their deltas are stored again. In full-state checkpoints, the same is stored after each cell. During its restore process, it restores outputs for unchanged cells and thus enables efficient re-execution. Our empirical evaluation demonstrates that NBRewind can significantly reduce both notebook audit and repeat times with incremental checkpoints.

I am very happy abut the experience I have had in this project and I would encourage other students to join this program in the future.

Mid-term Blog: Building a Simulator for Benchmarking Replicated Systems

Fri, 25 Jul 2025 00:00:00 +0000

Introduction

Hello there, I’m Michael. In this report, I’ll be sharing my progress as part of the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges project under the mentorship of Fadhil Kurnia.

About the Project

The goal of the project is to build a language-agnostic interface that enables communication between clients and any consensus protocol such as MultiPaxos, Raft, Zookeeper Atomic Broadcast (ZAB), and others. Currently, many of these protocols implement their own custom mechanisms for the client to communicate with the group of peers in the network. An implementation of MultiPaxos from the MultiPaxos Made Complete paper for example, uses a custom Protobuf definition for the packets client send to the MultiPaxos system. With the support of a generalized interface, different consensus protocols can now be tested under the same workload to compare their performance objectively.

Progress

Literature Study: Reviewed papers and implementations of various protocols including GigaPaxos, Raft, Viewstamped Replication (VSR), and ZAB. Analysis focused on their log replication strategies, fault handling, and performance implications.
Development of Custom Protocol: Two custom protocols are currently under development and will serve as initial test subjects for the testbed:
- A modified GigaPaxos protocol
- A Primary-Backup Replication protocol with strict log ordering similar to ZAB (logs are ordered based on the sequence proposed by the primary)
Most of my time has been spent working on the two protocols, particularly on snapshotting and state transfer functionality in the Primary-Backup protocol. Ideally, the testbed should be able to evaluate protocol performance in scenarios involving node failure or a new node being added. In these scenarios, different protocol implementations often vary in their decision of whether to take periodic snapshots or to roll forward whenever possible and generate a snapshot only when necessary.

Challenges

Early in the project, the initial goal was to benchmark different consensus protocols using arbitrary full-stack web applications as their workload. Different protocols would replicate a full-stack application running inside Docker containers across multiple nodes and the testbed would send requests for them to coordinate between those nodes. In fact, the 2 custom protocols being worked on are specifically made to fit these constraints.

Developing a custom protocol that supports the replication of a Docker container is in itself already a difficult task. Abstracting away the functionality that allows communicating with the docker containers, as well as handling entry logs and snapshotting the state, is an order of magnitude more complicated.

As mentioned in the first blog, an application can be categorized into two types: deterministic and non-deterministic applications. The coordination of these two types of applications are handled in very different ways. Most consensus protocols support only deterministic systems, such as key-value stores and can’t easily handle coordination of complex services or external side effects. To allow support for non-deterministic applications would require abstracting over protocol-specific log structures. This effectively restricts the interface to only support protocols that conform to the abstraction, defeating the goal of making the interface broadly usable and protocol-agnostic.

Furthermore, in order to allow any existing protocols to support running something as complex as a stateful docker container without the protocol itself even knowing adds another layer of complexity to the system.

Future Goals

Given these challenges, I decided to pivot to using only key-value stores as the application being used in the benchmark. This aligns with the implementations of most of the existing protocols which typically use key-value stores. In doing so, now the main focus would be to implement an interface that supports HTTP requests from clients to any arbitrary protocols.

Midterm Blog: Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges

Fri, 25 Jul 2025 00:00:00 +0000

Hello! I’m Panji Sri Kuncara Wisma and I want to share my midterm progress on the “Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges” project under the mentorship of Fadhil I. Kurnia.

Project Overview

The goal of our project is to create an open testbed that enables fair, reproducible evaluation of different consensus protocols (Paxos variants, EPaxos, Raft, etc.) when deployed at network edges. Currently, researchers struggle to compare these systems because they lack standardized evaluation environments and often rely on mock implementations of proprietary systems.

XDN (eXtensible Distributed Network) is one of the important consensus systems we plan to evaluate in our benchmarking testbed. Built on GigaPaxos, it allows deployment of replicated stateful services across edge locations. As part of preparing our benchmarking framework, we need to ensure that the systems we evaluate, including XDN, are robust for fair comparison.

Progress

As part of preparing our benchmarking tool, I have been working on refactoring XDN’s FUSE filesystem from C++ to Rust. This work is essential for creating a stable and reliable XDN platform.

The diagram above illustrates how the FUSE filesystem integrates with XDN’s distributed architecture. On the left, we see the standard FUSE setup where applications interact with the filesystem through the kernel’s VFS layer. On the right, the distributed replication flow is shown: Node 1 runs fuselog_core which captures filesystem operations and generates statediffs, while Nodes 2 and 3 run fuselog_apply to receive and apply these statediffs, maintaining replica consistency across the distributed system.

This FUSE component is critical for XDN’s operation as it enables transparent state capture and replication across edge nodes. By refactoring this core component from C++ to Rust, we’re hopefully strengthening the foundation for fair benchmarking comparisons in our testbed.

Core Work: C++ to Rust FUSE Filesystem Migration

XDN relies on a FUSE (Filesystem in Userspace) component to capture filesystem operations and generate “statediffs” - records of changes that get replicated across edge nodes. The original C++ implementation worked but had memory safety concerns and limited optimization capabilities.

I worked on refactoring from C++ to Rust, implementing several improvements:

New Features Added:

Zstd Compression: Reduces statediff payload sizes
Adaptive Compression: Intelligently chooses compression strategies
Advanced Pruning: Removes redundant operations (duplicate chmod/chown, created-then-deleted files)
Bincode Serialization: Helps avoid manual serialization code and reduces the risk of related bugs
Extended Operations: Added support for additional filesystem operations (mkdir, symlink, hardlinks, etc.)

Architectural Improvements:

Memory Safety: Rust’s ownership system helps prevent common memory management issues
Type Safety: Using Rust enums instead of integer constants for better type checking

Findings

The optimization results performed as expected:

Statediff Size Reductions:

MySQL workload: 572MB → 29.6MB (95% reduction)
PostgreSQL workload: 76MB → 11.9MB (84% reduction)
SQLite workload: 4MB → 29KB (99% reduction)

The combination of write coalescing, pruning, and compression proves especially effective for database workloads, where many operations involve small changes to large files.

Performance Comparison: Remarkably, the Rust implementation matches or exceeds C++ performance:

POST operations: 30% faster (10.5ms vs 15ms)
DELETE operations: 33% faster (10ms vs 15ms)
Overall latency: Consistently better (9ms vs 11ms)

Current Challenges

While the core implementation is complete and functional, I’m currently debugging occasional latency spikes that occur under specific workload patterns. These edge cases need to be resolved before moving on to the benchmarking phase, as inconsistent performance could compromise the reliability of the evaluation.

Next Steps

With the FUSE filesystem foundation nearly complete, next steps include:

Resolve latency spike issues and complete XDN stabilization
Build benchmarking framework - a comparison tool that can systematically evaluate different consensus protocols with standardized metrics.
Run systematic evaluation across protocols

The optimized filesystem will hopefully provide a stable base for reproducible performance comparisons between distributed consensus protocols.

Developing an Open Testbed for Edge Replication System Evaluation

Sun, 15 Jun 2025 00:00:00 +0000

Hi, I’m Panji. I’m currently contributing to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges under the mentorship of Fadhil I. Kurnia. You can find more details on the project proposal here.

The primary challenge we’re addressing is the current difficulty in fairly comparing different edge replication systems. To fix this, we’re trying to build a testing platform with four key parts. We’re collecting real data about how people actually use edge services, creating a tool that can simulate realistic user traffic across many locations, building a system that mimics network delays between hundreds of edge servers, and packaging everything into an open-source toolkit.

This will let researchers test different coordination methods like EPaxos, Raft, and others using the same data and conditions. We hope this will help provide researchers with a more standardized way to evaluate their systems. We’re working with multiple programming languages and focusing on making complex edge computing scenarios accessible to everyone in the research community.

One of the most interesting aspects of this project is tackling the challenge of creating realistic simulations that accurately reflect the performance characteristics different coordination protocols would exhibit in actual edge deployments. The end goal is to provide the research community with a standardized, reproducible environment for edge replication.

Building a Simulator for Benchmarking Replicated Systems

Sat, 14 Jun 2025 00:00:00 +0000

Hi, I’m Michael. I’m currently contributing to the Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges under the mentorship of Fadhil Kurnia. You can find more details on the project proposal here.

What we are trying to achieve is to create a system to test and evaluate the performance of different consensus protocols and consistency models under the same application and workload. The consensus protocols and consistency models are both tested on various replicated black-box applications. Essentially, the testbed itself is able to deploy any arbitrary stateful application on multiple machines (nodes) as long as it is packaged in the form of a docker image. The consensus protocol is used to perform synchronization between the stateful part of the application (in most cases, the database). The goal is that by the end of this project, the testbed we are building has provided the functionality and abstraction to support the creation of new consensus protocols to run tests on.

One major challenge in implementing this is with regards to the handling of replication on the running docker containers. Generally, the services that can be deployed in this system would be of two types:

A Deterministic Application (An application that will always return the same output when given the same input. e.g., a simple CRUD app)
A Non-Deterministic Application (An application that may return the different outputs when given the same input. e.g., an LLM which may return different response from the same prompt request)

Both of these application types requires different implementation of consensus protocols. In the case of a deterministic application, since all request will always yield the same response (and the same changes inside the database of the application itself), the replication protocol can perform replication on the request to all nodes. On the other hand, in a non-determinisitic application, the replication protocol applies synchronization on the state of the database directly since a different response may be returned on the same request.

Reproducibility of Interactive Notebooks in Distributed Environments

Thu, 12 Jun 2025 00:00:00 +0000

Hello! I am Raza, currently a Ph.D. student in Computer Science at DePaul University. This summer, I will be working on reproducibility of notebooks in distributed enviornments, mentored by Prof. Tanu Malik. Here is a summary of my project proposal.

Interactive notebooks are web-based systems which enable encapsulating code, data, and their outputs for sharing and reproducibility. They have gained wide popularity in scientific computing due to their ease of use and portability. However, reproducing notebooks in different target environments remains challenging because notebooks do not carry the computational environment in which they are executed. This becomes even more challenging in distributed cluster environments where a notebook must be prepared to run on multiple nodes. In this project, we plan to (i) extend FLINC, an open-source user-space tool for distributed environments such that it can package notebook executions into notebook containers for execution and sharing across distributed environments, and (ii) integrate the extended Flinc with TaskVine, which provides the framework and orchestration to enable distributed notebook execution in high performance computing environments.

You can read my complete proposal here.

I am excited to work on this project and learn from the experience here!

Open Testbed for Reproducible Evaluation of Replicated Systems at the Edges

Sat, 15 Feb 2025 00:00:00 +0000

Project Description

Topics: Distributed systems
Skills: Java, Go, Python, Bash scripting, Linux, Docker.
Difficulty: Hard
Size: Large (350 hours)
Mentors: Fadhil I. Kurnia

Replication is commonly employed to improve system availability and reduce latency. By maintaining multiple copies, the system can continue operating even if some replicas fail, thereby ensuring consistent availability. Placing replicas closer to users further decreases latency by minimizing the distance data must travel. A typical illustration of these advantages is a Content Delivery Network (CDN), where distributing content to edge servers can yield latencies of under 10 milliseconds when users and contents are in the same city.

In recent times, numerous edge datastores have emerged, allowing dynamic data to be served directly from network-edge replicas. Each of these replicated systems may employ different coordination protocols to synchronize replicas, leading to varied performance and consistency characteristics. For instance, Workers KV relies on a push-based coordination mechanism that provides eventual consistency, whereas Cloudflare Durable Objects and Turso deliver stronger consistency guarantees. Additionally, researchers have introduced various coordination protocols—such as SwiftPaxos, EPaxos, OPaxos, WPaxos, Raft, PANDO, and QuePaxa—each exhibiting its own performance profile, especially when being used in geo-distributed deployment.

This project aims to develop an open testbed for evaluating replicated systems and their coordination protocols under edge deployment. Currently, researchers face challenges in fairly comparing different replicated systems, as they often lack control over replica placement. Many previous studies on coordination protocols and replicated systems relied on mock implementations, particularly for well-known systems like Dynamo and Spanner, which are not open source. An open testbed would provide a standardized environment where researchers can compare various replicated systems, classes of coordination protocols, and specific protocol implementations using common benchmarks. Since the performance of replicated systems and coordination protocols varies depending on the application, workload, and replica placement, this testbed would offer a more systematic and fair evaluation framework. Furthermore, by enabling easier testing and validation, the testbed could accelerate the adoption of research prototypes in the industry.

Project Deliverables

Compilation of traces and applications from various open traces and open benchmarks.
Distributed workload generator to run the traces and applications.
Test framework to simulate latency of 100s of edge servers for measurement.
Open artifact of the traces, applications, workload generator, and test framework, published on Github.

ReproNB: Reproducibility of Interactive Notebook Systems

Mon, 26 Feb 2024 00:00:00 +0000

Project Idea Description

Topics: HPC, MPI, distributed systems
Skills: C++, Python
Difficulty: Difficult
Size: Large; 350 hours
Mentors: Tanu Malik

Notebooks have gained wide popularity in scientific computing. A notebook is both a web-based interactive front- end to program workflows and a lightweight container for sharing code and its output. Reproducing notebooks in different target environments, however, is a challenge. Notebooks do not share the computational environment in which they are executed. Consequently, despite being shareable they are often not reproducible. We have developed FLINC (see also eScience'22 paper) to address this problem. However, it currently does not support all forms of experiments, especially those relating to HPC experiments. In this project we will extend FLINC to HPC experiments. This will involve using recording and replaying mechanisms such as ReMPI and rr within FLINC.

Project Deliverable

The project deliverable will be a set of HPC experiments that are packaged with FLINC and available on Chamaeleon.

ScaleRep: Reproducing and benchmarking scalability bugs hiding in cloud systems

Sat, 10 Feb 2024 00:00:00 +0000

Topics: Distributed systems, Scalability, Bug analysis, Bug reproducibility
Skills: Java, Python, bash scripting, perf, Linux internals
Difficulty: Hard
Size: Large (350 hours)
Mentors: Bogdan "Bo" Stoica (contact person), Yang Wang

Project Idea Description

Large-scale distributed systems are integral to the infrastructure of a wide range of applications and services. The continuous evolution of these systems requires ongoing efforts to address inherent faults which span a variety of issues including availability, consistency, concurrency, configuration, durability, error-handling, integrity, performance, and security. Recent developments in the field and the rise of cloud computing have been marked by a notable increase in the scale at which such systems operate.

This increase in scale introduces specific challenges, particularly in terms of system reliability and performance. As distributed systems expand beyond single machines, addressing the growing demands for computation, memory and storage becomes more difficult. This underlying complexity leads to the emergence of scalability bugs — defects that surface in large-scale deployments, yet do not reveal themselves in a small-scale setting.

To better understand scalability bugs, we set out to investigate a set of scalability issues documented over the last 5 years from 10 popular open-source large-scale systems. These bugs have led to significant operational challenges, such as system downtime, reduced responsiveness, data loss, and data corruption. Moreover, addressing them required extensive collaboration and problem-solving efforts among engineers and bug reporters, with discussions often spanning a month or more.

We observed that traditional bug finding techniques are insufficient for detecting scalability bugs since these defects are triggered by a mixture of scale-related aspects not properly investigated by previous approaches. These characteristics include the number of components involved, the system load and workload size, the reliability of recovery protocols, and the magnitude of intermediate failures. Although previous research examined some of these aspects, it has typically done so either in isolation (individually), or without providing a comprehensive understanding of the fundamental bug patterns, symptoms, root causes, fixes, and, more importantly, how easily these bugs can be reproduced in-house.

Therefore, the main goal of this project is to systematically understand, characterize, and document the challenges associated with scalability bugs, at-large. Our approach is twofold: first, to analyze scalability bugs in terms of reproducibility, and second, to develop methodologies for triggering them and measuring their impact. Specifically, we aim to:

Provide detailed accounts of bug reproduction experiences for a diverse set of recently reported scalability bugs from our benchmark applications;
Identify specific challenges that prevent engineers from reproducing certain scalability bugs and investigate how prevalent these obstacles are;
Create a suite of protocols to effectively trigger and quantify the impact of scalability bugs, facilitating their investigation in smaller-scale environments.

Project Deliverable

A set of Trovi replayable artifacts enabling other researchers to easily reproduce scalability bugs for our benchmark applications;
A set of Jupyter notebook scripts allowing to conveniently replay each step in our investigation;
A detailed breakdown of the challenges faced when reproducing scalability bugs and how these obstacles differ from those related to more “traditional” types of bugs.

ScaleBugs: Reproducible Scalability Bugs

Tue, 07 Feb 2023 00:00:00 +0000

Scalable systems lay essential foundations of the modern information industry. HPC data centers tend to have hundreds to thousands of nodes in their clusters. The use of “extreme-scale” distributed systems has given birth to a new type of bug: scalability bugs. As its name suggests, scalability bugs may be presented depending on the scale of a run, and thus, symptoms may only be observable in large-scale deployments, but not in small or median deployments. For example, Cassandra-6127 is a scalability bug detected in the popular distributed database Cassandra. The scalability bug causes unnecessary CPU usage, however, the symptom is not observed unless ~1000 nodes are deployed. This demonstrates the main challenge of studying scalability bugs: it is extremely challenging to reproduce without deploying the system at a large scale.

In this project, our goal is to build a dataset of reproducible scalability bugs. To achieve this, we will go through the existing bug reports for popular distributed systems, which include Cassandra, HDFS, Ignite, and Kafka. For each bug report, we determine if the reported bug depends on the scale of the run, such as the number of nodes utilized. With the collected scale-dependent bugs, we then will craft the workload to reproduce those scalability bugs. Our workloads will be designed to trigger some functionalities of the system under different configurations (e.g., different numbers of nodes), for which we will observe the impact on performance. For example, a successful reproduction should be able to show the performance drop along with an increasing number of nodes.

Building a Dataset of Reproducible Scalability Bugs

Topics: Scalability systems, bug patterns, reproducibility, bug dataset
Skills: Linux Shell, Docker, Java, Python
Difficulty: Medium
Size: Large (350 hours)
Mentors: Cindy Rubio González, Haryadi S. Gunawi, Hao-Nan Zhu
Contributor(s): Goodness Ayinmode, Zahra Nabila Maharani

The student will build a dataset of reproducible scalability bugs. Each bug artifact in the dataset will contain (1) the buggy and fixed versions of the scalability system, (2) a runtime environment that ensures reproducibility, and (3) a workload shell script that could demonstrate the symptoms of the bug under different scales.

Specific Tasks

Work with the mentors to understand the context of the project.
Learn the background of scalability systems.
Inspect the bug reports from Apache JIRA and identify scale-dependent bugs.
Craft shell scripts to trigger the exact scalability bug described by the bug report.
Organize the reproducible scalability bugs and write documentation to build the code and trigger the bug.

Adaptive Load Balancers for Low-latency Multi-hop Networks

Mon, 07 Nov 2022 10:15:56 -0700

This project aims at designing efficient, adaptive link level load balancers for networks that handle different kinds of traffic, in particular networks where flows are heterogeneous in terms of their round trip times. Geo distributed data centers are one such example. With the large-scale deployments of 5G in the near future, there will be even more applications, including more bulk transfers of videos and photos, augmented reality applications and virtual reality applications which take advantage of 5G’s low latency service. With the development and discussion about Web3.0 and Metaverse, the network workloads across data centers are only going to get more varied and challenging. All these add to heavy, bulk of data being sent to the data centers and over the backbone network. These traffic have varying quality of service requirements, like low latency, high throughput and high definition video streaming. Wide area network (WAN) flows are typically data heavy tasks that consist of backup data taken for a particular data center. The interaction of the data center and WAN traffic creates a very interesting scenario with its own challenges to be addressed. WAN and data center traffic are characterized by differences in the link utilizations and round trip times. Based on readings and literature review, there seems to be very little work on load balancers that address the interaction of data center and WAN traffic. This in turn motivates the need for designing load balancers that take into account both WAN and data center traffic in order to create high performance for more realistic scenarios. This work proposes a load balancer that is adaptive to the kind of traffic it encounters by learning from the network conditions and then predicting the optimal route for a given flow.

Through this research we seek to contribute the following :

Designing a load balancer, that is adaptive to datacenter and WAN traffic, and in general can be adapted to varied traffic conditions
Real time learning of the network setup and predicting optimal paths
Low latency, high throughput and increased network utilization deliverables

Adaptive, Dynamic Load Balancing for data center and WAN traffic

Topics: ‘data center networking’, TCP/IP stack’, ‘congestion control’, ’load balancing’
Skills: C++, python, linux ; experience with network simulators would be helpful
Difficulty: moderate/ challenging
Size: Medium or Large (175 or 350 hours)
Mentors: Katia Obraczka,Abdul Kabbani, Lakshmi Krishnaswamy

Specific tasks:

Understanding the OMNeT++ network simulator and creating simple networks and data center topologies to understand the simulation environment.
Implementing existing load balancers on OMNeT++ and exploring the effect of different features of the load balancers with data center traffic and WAN traffic.
Finding and testing out WAN specific traffic that may exist, like video streaming traffic, large database queries etc.
Working with the mentors on developing a learning-based load balancer framework that learns from past sample traffic, network conditions, to adapt dynamically to current network conditions.

CephFS

Mon, 07 Nov 2022 10:15:56 -0700

CephFS is a distributed file system on top of Ceph. It is implemented as a distributed metadata service (MDS) that uses dynamic subtree balancing to trade parallelism for locality during a continually changing workloads. Clients that mount a CephFS file system connect to the MDS and acquire capabilities as they traverse the file namespace. Capabilities not only convey metadata but can also implement strong consistency semantics by granting and revoking the ability of clients to cache data locally.

CephFS namespace traversal offloading

Topics: Ceph, filesystems, metadata, programmable storage
Skills: C++, Ceph / MDS
Difficulty: Medium
Size: Large (350 hours)
Mentor: Carlos Maltzahn

The frequency of metadata service (MDS) requests relative to the amount of data accessed can severely affect the performance of distributed file systems like CephFS, especially for workloads that randomly access a large number of small files as is commonly the case for machine learning workloads: they purposefully randomize access for training and evaluation to prevent overfitting. The datasets of these workloads are read-only and therefore do not require strong coherence mechanisms that metadata services provide by default.

The key idea of this project is to reduce the frequency of MDS requests by offloading namespace traversal, i.e. the need to open a directory, list its entries, open each subdirectory, etc. Each of these operations usually require a separate MDS request. Offloading namespace traversal refers to a client’s ability to request the metadata (and associated read-only capabilities) of an entire subtree with one request, thereby offloading the traversal work for tree discovery to the MDS.

Once the basic functionality is implemented, this project can be expanded to address optimization opportunities, e.g. describing regular tree structures as a closed form expression in the tree’s root, shortcutting tree discovery.