performance | UCSC OSPO

eBPF Monitoring Tools

Tue, 21 Feb 2023 00:00:00 +0000

eBPF is a technology that allows sandboxed programs to run in a priviledged context such as a Linux kernel. eBPF is for operating systems what Javascript is for web browsers: new functionality can be safely loaded without restarting or continually upgrading the operating system or browser and executed efficiently. eBPF is used to introduce new functionality into a running Linux kernel, including next-generation networking, observability, and security functionality. The following is just one idea of many possible.

Implement Darshan functionality as eBPF tool

Topics: performance, I/O, workload characterization
Difficulty: Medium
Size: Medium or large (175 or 350 hours)
Mentors: Tyler Reddy

Darshan is an HPC I/O characterization tool that collect statistics using a lightweight design that makes it suitable for full time deployment. Darshan is an interposer library that catches and counts IO requests (open, write, read, etc.) to a file/file system and it keeps the counters in buckets in data structure that can be queried. How many reads of small size, medium size, large size) for example are the types of things that are counted.

Having this be an interposer library requires users to link their application with this library. Having this function in epbf would make this same function transparent to users. Darshan has all the functions and could provide the list of functions to implement and the programmer could build and test these functions in ebpf on a linux machine. This could be a broadly available open tool that would be generally useful and but one of perhaps hundreds of examples of where ebpf based tools that could be in the open community for all to leverage.

GPU Emulator for Easy Reproducibility of DNN Training

Sun, 05 Feb 2023 00:00:00 +0000

Deep Neural Networks (DNN) have achieved success in many machine learning (ML) tasks including image recognition, video classification and natural language processing. Nonetheless, training DNN models is highly computation intensive and usually requires running complex computations on GPUs, while GPU is a very expensive and scarce resource. Therefore, many research works on DNN training are delayed because of the lack of access to GPUs. However, many research prototypes don’t require GPUs but only the performance profiles of GPUs. For example, research on DNN training storage systems doesn’t need to run real computations on GPUs, but only needs to know how much time each GPU computation will take. Meanwhile, GPU performance in DNN training is predictable and reproducible, as every batch of training performs a deterministic sequence of mathematical operations on a fixed number of data.

Therefore, in this project we seek to build a GPU emulator platform on PyTorch to easily reproduce DNN training without using real GPUs. We will measure the performance profiles of GPU computations for different models, GPU types, and batch sizes. Based on the measured GPU performance profiles, we will build a platform to emulate the GPU behaviors and reproduce DNN training using CPUs only. We will make the platform and the measurements open-source, allowing other researchers to reproduce the performance measurements and easily conduct research on DNN training systems. We will also encourage the community to enrich the database by adding GPU performance measurements for their own models and GPU types. We will be the first one to build and release this kind of GPU emulator for DNN training, and we believe researchers and the community can benefit a lot from it, especially after more and more GPU performance profiles are added by the community.

Building a platform to emulate GPU performance in DNN training

Topics: DNN training, reproducibility, GPU emulator, performance measurement - Skills: Linux, Python, PyTorch, deep learning
Difficulty: Medium
Size: 350 hours
Mentor(s): Vijay Chidambaram, Yeonju Ro
Contributor(s): Haoran Wu

The student will measure the GPU performance profiles for different models and GPU types, based on which the student will build a platform to emulate the GPU behaviors and easily reproduce DNN training. The GPU performance measurements should be made open-source and reproducible for other researchers to reproduce results and add GPU profiles for their own needs.

Specific tasks:

Work with mentors on understanding the context of the project.
Study and get familiar with the PyTorch DNN training pipelines
Measure GPU performance profiles for different DNN models and GPU types
Based on the GPU performance measurements, build a platform to emulate the GPU behaviors and reproduce DNN training without using real GPUs
Organize and document the codes to make them reproducible for the community

FlashNet: Towards Reproducible Data Science for Storage System

Thu, 02 Feb 2023 00:00:00 +0000

The Data Storage Research Vision 2025, organized in an NSF workshop, calls for more “AI for storage” research. However, performing ML-for-storage research can be a daunting task for new storage researchers. The person must know both the storage side as well the ML side as if studying two different fields at the same time. This project aims to answer these questions:

How can we encourage data scientists to look into storage problems?
How can we create a transparent platform that allows such decoupling?
Within the storage/ML community can we create two collaborative communities, the storage engineers and the storage data scientists?

In the ML/Deep Learning community, the large ImageNet benchmarks have spurred research in image recognition. Similarly, we would like to provide benchmarks for fostering storage research in ML-based per-IO latency prediction. Therefore, we present FlashNet, a reproducible data science platform for storage systems. To start a big task, we use I/O latency prediction as a case study. Thus, FlashNet has been built for I/O latency prediction tasks. With FlashNet, data engineers can collect the IO traces of various devices. The data scientists then can train the ML models to predict the IO latency based on those traces. All traces, results, and codes will be shared in the FlashNet training ground platform which utilizes Chameleon trovi for better reproducibility.

In this project, we plan to improve the modularity of the FlashNet pipeline and develop the Chameleon trovi packages. We will also continue to improve the performance of our binary-class and multiclass classifiers and test them on the new production traces that we collected from SNIA IOTA public trace repository. Finally, we will optimize the deployment of our continual-learning mechanism and test it in a cloud system environment. To the best of our knowledge, we are building the world-first end-to-end data science platform for storage systems.

Building FlashNet Platform

Topics: Storage systems, reproducibility, machine learning, continual learning
Skills: C++, Python, PyTorch, Experienced with Machine Learning pipeline
Difficulty: Medium
Size: Large (350 hours)
Mentors: Haryadi S. Gunawi
Contributor(s): Justin Shin, Maharani Ayu Putri Irawan

Build an open-source platform to enable collaboration between storage and ML communities, specifically to provide a common platform for advancing data science research for storage systems. The platform will be able to reproduce and evaluate different ML models/architecture, dataset patterns, data preprocessing techniques, and various feature engineering strategies.

Specific tasks:

Work with mentors on understanding the context of the project.
Reproduce the FlashNet evaluation results from prior works.
Build and improve FlashNet components based on the existing blueprint.
Collect and analyze the FlashNet evaluation results.

Efficient Communication with Key/Value Storage Devices

Sun, 27 Feb 2022 00:00:00 +0000

Network key value stores are used throughout the cloud as a storage backends (eg AWS ShardStore) and are showing up in devices (eg NVMe KV SSD). The KV clients use traditional network sockets and POSIX APIs to communicate with the KV store. An advancement that has occurred in the last 2 years is a new kernel interface that can be used in lieu of the POSIX API, namely io_uring. This new interface uses a set of shared memory queues to provide for kernel-to-user communication and permits zero copy transfer of data. This scheme avoids the overhead of system calls and can improve performance.

Implement `io_uring` communication backend

Topics: performance, I/O, network, key-value, storage
Difficulty: Medium
Size: Medium or large (120 or 150 hours)
Mentors: Philip Kufeldt (Seagate), Aldrin Montana (UC Santa Cruz) Contributor(s): Manank Patel

Seagate has been using a network-based KV HDD as a research vehicle for computational storage. This research vehicle uses open-source user library that implements a KV API by sending network protobuf-based RPCs to a network KV store. Currently it is implemented with the standard socket and POSIX APIs to communicate with the KV backend. This project would implement an io_uring communication backend and compare the results of both implementations.