operating systems | UCSC OSPO

IO logger: IO tracing in the modern computing era

Fri, 28 Feb 2025 00:00:00 +0000

Overview

Storage systems are critical components of modern computing infrastructures, and understanding their performance characteristics is essential for optimizing system efficiency. There were many works from twenty to thirty years ago, but the landscape has changed significantly with the advent of

cloud computing, virtualization, and storage disaggregation on the server side
ubiquitous fast wireless networking for end users that make remote storage feasible
AI and ML workloads that generate and move massive data both in the cloud and on the edge.

In this project, we aim to develop an IO logger, a tool for tracing, logging and analyzing IO operations in various computing environments. The IO logger will capture detailed information about read and write operations, latency, throughput, and other metrics to help researchers and practitioners understand the behavior of storage systems under different workloads and configurations. By providing a comprehensive view of IO performance, the IO logger will enable users to identify bottlenecks, optimize resource utilization, and improve system efficiency.

This project will have two phases:

IO logger for *NIX systems: Develop a tool leveraging eBPF and other tools for tracing IO operations on Linux and other Unix-like systems. The tool will capture detailed information about disk reads and writes, network transfers, and other IO activities, providing insights into system performance. The tool will be open-sourced, and we will work with industry partners and testbeds to integrate it into existing monitoring and analysis tools. Moreover, we will collect and open source the IO traces to benefit the community.
IO logger for personal computing environment: Develop a tool for end-users to trace IO operations on their personal devices, such as laptops, desktops, and mobile phones. We will design and implement tools for three different platforms, Window, MacOS and Andriod. We will use the tools to collect IO traces from volunteers and real-world applications. providing insights into storage usage, network activity, and application performance. The tool will be user-friendly, lightweight, and privacy-preserving, ensuring that users can monitor their IO activities without compromising their data security.

Notable difference and challenges compared to the existing works are:

more IO requests with rich features: open-source traces from previous works were collected all after page cache, which are often write-heavy, lose most IO requests, and do not provide enough features, e.g., process name. To address this, we will build a tool that can also records requests served by page cache, which requires the tool to be efficient and cannot impose significant overhead to the ruuning systems.
focus on new applications and workloads: the existing works were mostly outdated from the 1990s, during which the Internet has not been widely used, and applications are mostly processing local data and does not communicate with outside world. While there have been a few works looked into mobile storage a decade ago. The landscape has changed significantly since then, especially with the advent of AI and ML workloads that generate and move massive data both in the cloud and on the edge. This project will look into the difference and challenges brought by these new applications and workloads.

Topics: tracing tool, operating system, eBPF, performance evaluation
Skills: C programming, system programming, eBPF, Linux kernel, mobile application development
Difficulty: Hard
Size: Large (350 hours).
Mentors: Juncheng Yang

Assessing and Enhancing CC-Snapshot for Reproducible Experiment Environments

Tue, 18 Feb 2025 00:00:00 +0000

Overview

A critical challenge in computer systems research reproducibility is establishing and sharing experimental environments. While open testbeds like Chameleon provide access to hardware resources, researchers still face significant barriers when attempting to recreate the precise software configurations, dependencies, and system states needed for reproducible experiments. Environment snapshotting tools offer a solution, but face technical challenges in consistently capturing running systems without introducing distortions or requiring disruptive system modifications. This project addresses these fundamental reproducibility barriers by enhancing CC-Snapshot, an tool capturing the experimental environment configured by the user on bare metal images, to create more reliable and consistent system captures that can be shared and redeployed without loss of fidelity.

CC-Snapshot is a tool on the Chameleon testbed that enables users to package their customized environments as complex images or appliances. By allowing researchers to share these environments easily, CC-Snapshot offers a powerful mechanism for reproducibility, ensuring that experiments can be replicated and extended by others.

In this project, you will review existing CC-Snapshot workflows, research the latest snapshotting technologies, and develop enhancements that improve the tool’s usability and reliability. This includes ensuring snapshots are created consistently (even when the OS is actively running), preserving the integrity of user systems, and exploring advanced features such as out-of-band snapshotting and API-based triggers.

Key Outcomes

Improved Snapshot Consistency: New methods to capture the full state of a disk without risking corruption or data inconsistency.
Enhanced Reproducibility: A refined workflow that allows researchers to reliably share custom environments, facilitating collaborative and repeatable experiments.
User-Friendly Tooling: Streamlined processes that reduce disruption to running systems—so installing dependencies or rebooting into special environments is less burdensome.
Exploratory Features (Stretch Goals): Advanced mechanisms to stream disk data in real time during snapshotting and to initiate snapshots via an API call (for parity with VM snapshots).

Topics: Cloud Computing, Systems & Infrastructure, Reproducibility, Operating System Internals

Skills: Linux / OS Concepts, Cloud Tools, Systems Programming / Scripting, DevOps / CI

Difficulty: Moderate

Size: Medium

Mentors: Michael Sherman, Mark Powers

Tasks:

Ensure Snapshot Consistency
- Reboot into a ramdisk and copy the offline disk.
- Use kexec to switch to/from a ramdisk environment without a full reboot.
- Change images to use a snapshot-capable filesystem (e.g., LVM) for safer live snapshots.
- Investigate additional methods (e.g., blog.benjojo.co.uk) for safely imaging live disks.
Prevent System Modifications During Snapshot
- Currently, CC-Snapshot installs dependencies (e.g., qemu-img) on the running system, affecting its state.
- In-Band Fix: Download and run tools in a temp directory with static linking, avoiding system-level changes.
- Out-of-Band Approach: Snapshots done via ramdisk or kexec do not require altering the running system.
API-Triggered Snapshots
- Extend or integrate with the Nova “snapshot instance” API to support the same workflow for bare metal.
- Leverage Ironic’s new “service steps” feature for an automated snapshot pipeline.
(Stretch Goal) Streaming Snapshots
- Modify the workflow to stream data directly to storage, rather than making a full local copy first.
- Explore incremental or differential snapshot techniques to reduce bandwidth usage and storage overhead.

eBPF Monitoring Tools

Tue, 21 Feb 2023 00:00:00 +0000

eBPF is a technology that allows sandboxed programs to run in a priviledged context such as a Linux kernel. eBPF is for operating systems what Javascript is for web browsers: new functionality can be safely loaded without restarting or continually upgrading the operating system or browser and executed efficiently. eBPF is used to introduce new functionality into a running Linux kernel, including next-generation networking, observability, and security functionality. The following is just one idea of many possible.

Implement Darshan functionality as eBPF tool

Topics: performance, I/O, workload characterization
Difficulty: Medium
Size: Medium or large (175 or 350 hours)
Mentors: Tyler Reddy

Darshan is an HPC I/O characterization tool that collect statistics using a lightweight design that makes it suitable for full time deployment. Darshan is an interposer library that catches and counts IO requests (open, write, read, etc.) to a file/file system and it keeps the counters in buckets in data structure that can be queried. How many reads of small size, medium size, large size) for example are the types of things that are counted.

Having this be an interposer library requires users to link their application with this library. Having this function in epbf would make this same function transparent to users. Darshan has all the functions and could provide the list of functions to implement and the programmer could build and test these functions in ebpf on a linux machine. This could be a broadly available open tool that would be generally useful and but one of perhaps hundreds of examples of where ebpf based tools that could be in the open community for all to leverage.

operating systems | UCSC OSPO

IO logger: IO tracing in the modern computing era

Overview

Related works

Assessing and Enhancing CC-Snapshot for Reproducible Experiment Environments

Overview

Key Outcomes

eBPF Monitoring Tools

Implement Darshan functionality as eBPF tool