SoR'23 | UCSC OSPO

Final Blog Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Wed, 25 Oct 2023 00:00:00 +0000

In my final blog, I will first introduce the project, then describe the achievements after the midterm and summarize our experiments. As part of the Measuring Research Prototypes under Unreported Settings my proposal under the mentorship of Yang Wang and Miao YU aims to understand the impact of missing settings in artifact evaluation.

In my midterm blog(/report/osre23/osu/missingsettings/20230802-ren.450/), I took three paratmeters as the PostgreSQL config to test the performance of TPC-C benchmark and got some initial results about the effect of different parameters separately on throughput performance. After the midterm, I continue doing experiments on these four parameters (shared_buffer, min_wal_size, max_wal_size and effective_cache_size) with more values and associate them to measure the effect on performance. These parameters are related to memory consumption, checkpoints and planner cost in the database server. You can refer to my previous blog for details.

For the experiment, we continue to measure the throughput performanace for the benchmark by setting scalefactor as 10 and incrementing worker terminals. The settings for database server are all default values except the four parameters we choose to tune. For the shared_buffer parameter, we choose from initial 128mb to 8gb, in total 6 values. Then for each shared_buffer setting, effective_cache_size includes three values, from initial 4gb to 16gb. Next, for each effective_cache_size setting we tune the min_wal_size and max_wal_size as a tuple, min_wal_size has two values and max_wal_size has four values, in total 6 values. We conduct the experiments by running three rounds for each setting and get all three throughput numbers and calculate their average values.

Based on the results, the observation holds as the conclusion from midterm blog. The throughput of the benchmark can be affected by tuning shared_buffer and max_wal_size. Effective_cache_size and min_wal_size do not have obvious effect for this benchmark. The improvement is limited after shared_buffer and max_wal_size reach a certain value.

In our experiment, we only choose three possible parameters for one benchmark. The experiment is expensive considering the consuming time. There are also more values of above mentioned parameters to test. This experiment can also indicate we may need to sample a subset of settings to generate observations that match those from a full extensive artifact evaluation.

Public Artifact and Data Visualization: A Journey to Empower

Tue, 24 Oct 2023 00:00:00 +0000

Hola Amigos! As we draw the curtains on our project titled Public Artifact and Data Visualization we’re thrilled to present the incredible advancements we’ve achieved since our mid-term update. Our mission has been to foster a deeper understanding of data and empower users to make informed decisions. Let’s delve into the remarkable evolution of our project.

Unveiling New Functionalities

Modular Architecture: Your Way, Your Choice

At the core of our project is a modular architecture designed to cater to your unique preferences. We firmly believe that choice empowers users. Thus, we’ve given you the option to select between a Graphical User Interface (GUI) and a Command-Line Interface (CLI). It’s about providing a platform that adapts to your specific requirements and style of interaction.

Real-time Backend Environment Monitoring: Data as it Happens

Real-time monitoring of backend environment data is at the heart of our project. It’s not just about collecting data; it’s about providing continuous insights into system performance. This feature empowers you to make real-time, data-driven decisions—an essential capability in today’s fast-paced computing landscape.

Visualizing Environment Variables: Clarity Amidst Complexity

We’ve placed a strong emphasis on user-friendly data visualization. Our enhancements enable you to navigate through detected variables effortlessly and compare iterations within different buckets. The result is a visual representation of complex data, making it easier to comprehend and analyze.

Predefined Monitoring Commands: Your Head Start

We understand that monitoring can be a daunting task. To simplify the process, we’ve introduced predefined monitoring commands such as mpstat and iostat. These templates serve as a launchpad for monitoring common system metrics, helping you get started quickly and efficiently.

Comprehensive Customization: Tailoring the Experience

Recognizing that every user has unique needs, our platform now offers extensive documentation. This documentation serves as a guide, enabling users to fine-tune their monitoring commands. It’s about tailoring the platform to match your specific requirements and preferences. The power to customize is firmly in your hands.

Import and Export Functionality: Seamless Collaboration

In an era where collaboration and data management are essential, we’ve introduced the capability to import and export environment data. This feature simplifies data management and supports collaborative efforts, making it easy to share monitoring data and conduct analysis across various environments.

Exploring Our Repositories

As mentioned earlier, we have completed the core functionalities of our platform, and we would love to have you try it out and provide us with valuable feedback. Here are the links to our repositories where you can explore and experiment with our platform:

GUI Repository and CLI Repository
- The journey begins with a choice. Our repositories cater to a diverse range of user preferences. Inside the README.md file of the GUI repository, you’ll find meticulous installation instructions to guide you through setting up the Graphical User Interface (GUI). It’s your portal to a user-friendly experience
Sample Repository
- For those eager to embark on their monitoring journey, our Sample Repository is a valuable resource. It provides scripts that not only enable you to run our program but also serve as templates. These templates are designed to simplify the monitoring of your own programs, tailored to your unique requirements.

Project Demo

To provide you with a glimpse of what our project can do, here are some demo images showcasing the capabilities and features of “Public Artifact and Data Visualization.”

Thank You for Joining Us

We appreciate your support and participation in this journey of data visualization and empowerment. Our commitment to enhancing the world of data comprehension remains unwavering. As we mark the end of this chapter, we eagerly anticipate the exciting future that awaits in the realm of data visualization. The path doesn’t end here; it’s just the beginning of a new chapter in our collective exploration of data’s potential.`

GPU Emulator for Easy Reproducibility of DNN Training -- Final Blog Post

Fri, 06 Oct 2023 00:00:00 +0000

Introduction

For the second half of the project, I spent some time reproducing the figures and then focused on hacking the source code of PyTorch to distinguish the Inter-GPU Computation (1GPU vs. 2GPUs).

Summarization

Finished reproducing figure 3, 4, 5, 6 from GPU Emulator for Easy Reproducibility of DNN Training.
Explored into inter-GPU computation in order to reproduce figure 9.

Reporsitory of Reproducing Figures

I have placed the repository of the deliverable here: https://github.com/FarFlyField/OSRE_DELIVERABLE/tree/main To use the repository, you can use any machine with just CPU, or you may check the results by renting GPU from Chameleon, comparing the result with the emulator’s.

The repository covers how to setup and understand the data produced from it. You will need to understand the spreadsheet and some of the graphing files.

Study of Inter-GPU Computation

I have dissected the source code in PyTorch to identify the computation time differences between using 1 GPU and 2 GPUs (inter-GPUs computation time). The one most significant difference is managed throughout the forward process. Here are a few most significant features that make the computation time longer when using 2 GPUs:

When using 1 GPU, PyTorch would put the images used to train the model onto the GPU in the main application once and for all. However, using 2 GPUs, PyTorch will transfer the images before running forward using a parallel function. The function contains two features:
- It dissects the images into multiple sections and puts them onto the GPUs respectively.
- It copied the model we need to train into the # of GPUs and create threads to train the models parallelly on these separate GPUs.
These two steps create a major time difference for the computation time we have for training multiple GPUs and also make the transfer time for using 2 GPUs smaller because we are counting the transferring time of images toward computation time.
After finishing running forward, the parallel function will gather the outputs from the two GPUs and send the output to the first GPU.

After gathering the output to the first GPU, the code will train the next batch and repeat the steps of transferring data, copying images, running parallel forwarding, and gathering the outputs once again.

The second significant difference that I’m working on right now is when PyTorch runs backward functions, which are more or less similar to forward but not the same at all. I have located the function loss.backward() function in our application code as the only contributor to the time difference in computation time. Here are a few tasks I did after locating it:

Recorded the functions’ call stack when using 1 GPU and 2 GPUs.
Recorded the time spent in the functions in the call stack of the functions.
Identified the inconsistency when measuring data, repeated and verified until the consistency is reached.

I have finished the basic measuring and drafting out the call stack but I haven’t figured out the exact differences. Because most of the functions are done in C++, printing out the inputs to evaluate the functions will be slightly harder but doable.

The data recorded and analyzed are placed here: https://docs.google.com/spreadsheets/d/1vFj-UE3mjtsHIc5OesKX1sDvr6fpPwtPUMl0pM3V8SA/edit?usp=sharing Summarized doc: https://docs.google.com/document/d/10XWNwCZ3kLzy4i6WgJ6KPsujEs2X1gzXblZtUoqMuJw/edit

Reproducible Evaluation of Multi-level Erasure Coding (Midterm)

Sat, 05 Aug 2023 00:00:00 +0000

Hi Everyone,

I hope everything goes well! This is my second blog post for my project Reproducible Evaluation of Multi-level Erasure Coding under the mentorship of John Bent, Anjus George, and Meng Wang. In summary, my project aims to build a platform to reproducibly evaluate the performance and durability of MLEC (Multi-Level Erasure Coding) for large-scale storage systems under different design configurations. The details are in this proposal.

In the course of these few weeks, I’ve completed several tasks to achieve the aim of this project, including

Literature Review
Studying the Erasure Coding Simulator and Creating Reproducible Evaluations, with the following policies
- Clustered/Declustered Local-level SLEC
- Clustered/Declustered Network-level SLEC
- MLEC with C/C, C/D, D/C, D/D configuration

Literature Review

Prior to developing the simulator, my first step was to delve into various literature related to distinct Erasure Coding policies. To understand a simulator for complex Erasure coding policy such as MLEC, I want to start from the simpler EC policies, and then extend my knowledge to more complex ones such as MLEC. Moreover, I also aimed to contrast the durability of MLEC with other comparable EC policies like LRC in my evaluations, making it vital to understand the implementation of these policies.

Over the first week, I read several papers regarding different chunk placement policies regarding erasure coding, including LRC (Local Reconstruction Codes), CL-LRC (Combined Locality for Local Reconstruction Codes), SODP (Single Overlap declustered parity), and MLEC (Multi-Level Erasure Coding). These papers offered a fundamental comprehension of each policy, their respective advantages and drawbacks, and their practical usage in production environments.

Simulator Reproduction

After gaining some understanding with the papers I read, I started to study the EC simulator by building the simulator myself. I got the MLEC simulator from the mentors. However, the simulator lacks documentation and guides, making it hard for others to reproduce evaluation results. The simulator is also complicated to understand, as it simulates various EC schemes, chunk placements, and rebuild policies, which results in 13,000 LOC. Therefore, my goal is to understand the design and implementation details of the simulator, after which I will create guides for reproducible evaluations.

In order to fully understand the simulator, the best way is to rebuild the simulator by myself. The simulator is designed to mimic disk failures over the span of a year under varying chunk placement policies. Once successfully rebuilt, the simulator will enable me to assess the durability of MLEC in relation to other widely-used chunk placement policies. I followed the given simulator and rewrote it on my own in Python.

Based on the skeleton of the given simulator, I first rebuilt a simple simulator that simulates SLEC (single level erasure coding, in both local and network settings) with clustered parities. With the arguments given, the simulator can run arbitrary numbers of iterations that simulate disk failures in one year. The simulator then collects iterations in which there is a data loss. The ratio of failed iterations to total executed iterations is the durability of the erasure coding policy. This simulation allows us to evaluate the durability of SLEC, laying foundations for later evaluation of MLEC.

Next, I extended my simulator from local-level SLEC implementation by adding more policies. I began by introducing a network-level SLEC policy with clustered parities. This differs slightly from the local-level EC as it necessitates the consideration of factors like network bandwidth within the simulator.

In addition, I have delved deeper into simulating declustered parities and successfully discovered a method to simulate disk failures. Basically, the simulator generates failures within a one-year timeframe and subsequently repairs them using priority queues. The disks associated with stripes experiencing the most failures are given the highest repair priority. With this construction, the simulator is capable of simulating local-level declustered parities, with the ability to specify parameters.

Upon successfully simulating local-level declustered parities, the construction of the simulator for network level declustered parities was rather straightforward. I then validated it using the simulator and math models provided by the mentors. The results perfectly agree with each other, which proves the correctness of my understanding for the SLEC declustered placements. By implementing the simulator myself, I strengthened my understanding of erasure coding designs and the simulation techniques, which equipped me with a solid foundation to continue to reproduce MLEC simulations.

Based on my knowledge gained from implementing SLEC simulators myself, I then reverse-engineered the MLEC simulator provided by the mentors from their MLEC paper. I choose to start from the simplest policy, which is clustered parities in both levels. After spending a considerable time digging into the simulator source codes, I was able to understand the simulation workflows, different repair methods that it implements, and the splitting method that it uses to simulate high durabilities. I then revised my simulator based on my understanding. I also tried to run a few experiments using the same configuration setups as specified in the paper. The results agree well with those in the paper, which verified the success of my reproducing work.

Technical Issues

In the process of building the MLEC, I’ve encountered many issues, conceptual or technical. The mentors are super helpful and responsive in the process, so I was able to have steady progress.

Summary

Overall, I’ve rebuilt a python simulator for various EC policies, and the simulator can successfully reproduce the results from paper.

Next Steps

My next step would be to package the simulator into reprodTrovi artifact, so others can reproduce evaluations on performance and durability of various EC policies, in particular MLEC

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time (Midterm Blog Post)

Thu, 03 Aug 2023 00:00:00 +0000

Introduction

As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time, our goal was to characterize the tools on genomic workflows in terms of system metrics and data quality to build machine learning models to predict the elapsed time of genomic workflows. While Shayantan (another contributor) did the analysis on data quality metrics, I contributed to the system metrics analysis. We are getting closer to that goal because we have managed to collect datasets and do some analysis.

Steps

In this project, we selected DNA-Seq Pipeline as the workflow to be analyzed. This pipeline consists of four tools for processing single-end reads, namely BWA-mem, Samtool-view, Picard-SortSam, Picard-MarkDuplicates. For each tool we executed it using various configurations and stored system metrics for each execution. To do this, we have to take two steps:

Step 1: Building the tools execution environment.
Step 2: Developing a program to execute tools using some configurations and collect runtime parameters (eg. CPU, RSS, VSZ, and IO) automatically.

Execution Environment

Tools are executed on Chameleon instances by submitting them using Slurm. The machine used in collecting system metrics is a Haswell instance of the Chameleon Texas server. This instance uses Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz with following detailed specifications.

Number of CPUs	48
Number of threads per core	2
Number of cores per socket	12
Number of sockets	2

In this experiment, we use n+1 instances, where there are n compute nodes and 1 master node. Each execution is done by submitting a job, which is a tool with a certain configuration, from a master node and it will be processed by one of the compute nodes. In order for the tool to be executed, we need to set the master node to be a common container using NFS. This common container is used to store input files and commands for executing tools so that all nodes can access them without having to download and install them.

Executing and Collecting System Metrics

Tools will be executed in various specific configurations by varying parameters such as input size, number of CPU allocation, memory allocation and threads. For example, for BWA-mem respectively the number of variations in values for the number of CPU allocations, memory allocations, and threads is 5, 4, and 5 using 10 different files so that there are 5 x 4 x 5 x 10 = 1000 configuration combinations. For each configuration will be executed 8 times so that there are 8000 data points. Configuration details can be seen in the following table.

	#repetions	#files	#allocated CPU	#allocated memory	#threads	total
BWA-mem	8	10	2, 4, 8, 16, 32	8, 16, 32, 64	2, 4, 8, 16, 32	8000
Samtool-view	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000
Picard-Sortsam	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000
Picard-MarkDuplicates	10	10	2, 4, 8, 16, 32	8, 16, 32, 64	-	2000

Meanwhile, to run the tools, we use the following commands:

BWA-mem

$BWA mem -t $threads $REF_DIR/hg19.fa ${INPUT_DIR}/${sra_id}*.fastq > ${OUTPUT_DIR}/${sra_id}.sam

Samtool-view

$SAMTOOLS view $INPUT_DIR/${sra_id}.sam -Shb -o $OUTPUT_DIR/${sra_id}.bam

Picard-SortSam

java -jar $PICARD SortSam \
CREATE_INDEX=true \
INPUT=$INPUT_DIR/${sra_id}.bam \
OUTPUT=$OUTPUT_DIR/${sra_id}.bam \
SORT_ORDER=coordinate \
VALIDATION_STRINGENCY=STRICT

Picard-MarkDuplicates

java -jar $PICARD MarkDuplicates \
CREATE_INDEX=true \
INPUT=$INPUT_DIR/${sra_id}.bam \
OUTPUT=$OUTPUT_DIR/${sra_id}.bam \
METRICS_FILE=$OUTPUT_DIR/${sra_id}_rmd.txt \
VALIDATION_STRINGENCY=STRICT

In Slurm, each job has a job id. In addition, there is a scontrol listpids command to see the job id to PID mapping. Using this, we can obtain system metrics for a job by gathering information from the /proc/$PID system file. Information that can be obtained from it is the use of CPU, physical memory, virtual memory, read bytes, and write bytes at a particular time. So that in collecting this data, we will record these features along with the timestamp at 1 second intervals throughout the execution process.

Results

We also have calculated the correlation for each feature with the elapsed time. For BWA-mem, the features that correlate more than absolute of 0.5 are Input size, Average CPU Usage, and Output file size , which is in SAM format. For samtools there are input size, average cpu usage and output size in BAM. For Sortsam, there are input size, write operation, and BAM output size. For MarkDuplicates, there are input size and BAM output size.

Features\Tools	BWA-mem	Samtool-view	Picard-SortSam	Picard-MarkDuplicates
Allocated CPU	-0.145	-0.095	-0.179	-0.156
Allocated physical memory	-0.010	-0.038	-0.069	0.132
Input size	0.583	0.651	0.937	0.922
Threads	-0.072	-	-	-
Average CPU	-0.607	-0.567	-0.479	-0.480
Peak CPU	-0.175	0.174	-0.170	0.046
Average RSS	0.040	0.034	0.131	0.182
Peak RSS	0.068	0.046	0.314	0.175
Average VSZ	0.032	-0.349	-0.127	0.090
Peak VSZ	0.048	0.074	-0.130	0.088
Write bytes	0.037	0.190	0.735	0.244
Read bytes	-0.031	0.109	0.070	0.110
Output SAM size	0.589	-	-	-
Output BAM size	-	0.763	0.934	0.923
Output BAI size	-	-	0.400	0.399

Future Works

For further work, we will analyze the correlation between elapsed time and features whose scores are below an absolute 0.5. Because there is a possibility that these features are actually correlated with the elapsed time but do not appear to be correlated because the measurements are made by calculating the overall data. So we also need to calculate the feature correlation for each data grouped by input file. Then, we create a machine learning model to predict elapsed time.

[FLASHNET]: Leveraging ML-augmented I/O in Linux

Wed, 02 Aug 2023 00:00:00 +0000

Hello everyone,

This is my second blog post for SoR 2023. As you may recall from my initial blogpost, I am working on the Flashnet project under the mentorship of Haryadi S. Gunawi.

I’ve been assigned two major tasks under Flashnet:

Perform post-training quantization (PTQ) on existing Flashnet models
Implement a rocksDB client (to interface with the Flashnet kernel) with 3-way replication

Task 1: Perform post-training quantization (PTQ) on existing Flashnet models

Since all of our models are currently built using the keras API, I decided to use the tensorflow-lite library, which supports direct conversion. Unfortunately, I encountered several persistent bugs while attempting to apply full-integer quantization on our binary neural network model:

Shape/dimension distortion:

Bug description: The quantized tflite model produces outputs of shape (8, 1) –– same as input shape–– when the original model produces single-value outputs (1, 1).

Status: Resolved

The original model has an input dimension of 8 for each input/x-value and there could be several inputs grouped in a single batch.
Input/batch size is also determined implicitly in the normalization layer of the original model
However, the “interpreter” in the quantized model runs inference one by one, and so batch size needs to be explicitly set to “1” i.e. a shape of single input, (1,8)
Doing so resolves the model distortion

Incorrect y-value range:

Bug description: There are no variation in the quantized model outputs (i.e. it spits out the same value for each input row)

In the original model, each inference output is a floating point value between 0 and 1. Outputs also vary according to input. This output is rounded towards 0 or 1 using a 0.5 standard cutoff (i.e. x > 0.5 → x = 1). Since the quantized model condenses 32-bit floats into 8-bit integers, we should expect a similar variation in output values across an 8-bit integer range.

Printing the quantized model weights, I discovered that weight burst/exploding gradient may be occur during quantization process i.e. the values of weights are exploding to infinity or vanishing to 0, and therefore unable to deliver any meaningful value. The likely consequence of this is that the inference output always equals the bias matrix (since the Wx term in y = Wx + B gets zeroed out).

Status: Open

Multiple potential causes were considered, without any success:
- Improper quantization of inputs/outputs
- Insufficient training time/number of epochs
- Incompatible model type/structure
- Incompatible tensorflow-lite version
At this point, I concluded that tensorflow-lite is too bug-ridden to make making any further attempts with the library not worthwhile.

Task 2: Implement a rocksDB client (to interface with the Flashnet kernel) with 3-way replication

rocksdb is an embedded database for key-value data. Our Flashnet team is currently implementing a Flashnet client in ceph, and so they have tasked me to explore an implementation in rocksdb as an alternative.

I’ve started on this segment of the project only recently, so my current work is still in its formative stages. As of writing, I’ve been primarily concerned with setup of software (on a new chameleon instance), running toy db examples, and educating myself on basic terminology/rocksdb documentation.

Future work

I expect to continue working on Task 1 (do quantization from ground-up or use a different library) and Task 2 as detailed above. I also hope to implement a transformer-based model to supplement our existing suite of Flashnet models.

[Midterm] FlashNet: Towards Reproducible Continual Learning for Storage System

Wed, 02 Aug 2023 00:00:00 +0000

Mid-Term Report

As part of the FlashNet my proposal under the mentorship of Haryadi S. Gunawi and Daniar Kurniawan aims to implement and optimize the FlashNet model in real-world storage systems using continual learning techniques. We focus on predicting I/Os latency to decide whether or not the I/O should be failovered to other SSD. The following sections elaborates the work description, major milestones achieved, accomplishments, and challenges during the first half of summer.

Work Description, Major Milestones Achieved, and Accomplishments

For the first half of the summer, I implemented continual learning pipeline of the model and several drift detection algorithms. After that, I evaluated the effectiveness. Below are the detailed description for each subtask.

1. Continual Learning pipeline

Firstly, I designed the pipeline. As shown on the graph below, the pipeline contains 4 main modules, namely initial train, retrain, inference, and monitor.

The modules were first developed in Python using linear regression model. Turns out, linear regression model is not good enough that it gave bad accuracy. To overcome this problem, I introduced more models and learning task.

Hence, in the final implementation, we have random forest and neural networks model for both regression and classification task. Aforementioned models outperforms linear regression. The pipeline is also already optimized.

2. Drift detection algorithms

Sometimes, the built model’s performance may degrade when facing recent I/Os having different characteristics than what it was trained upon. Hence, there should be a retrain process. Retrain should be triggered. The trigger could be as simple as periodically, or using technique called drift detection. While retraining too often might cause big overhead for computation, retraining too seldom might also cause performance degradation. Hence, we should build a good and reliable drift detection algorithm that can sense the presence of concept and covariate drift in recent data.

In order to build a good algorithm, I used heuristics derivated from the understanding about latency and throughput change over time. However, the result turns out not really good. Thus, I’ve been relying on using statistical tests as the drift detector. By far, Kalmogorov-Smirnov Test–commonly known as ks-test–is the best drift detector.

3. Evaluation

The featured image in the headline of this blog, also shown below, is the result of the evaluation. I evaluated the models and drift detection algorithms using Cumulative Distribution Function (CDF) graph, to see if any tail cut is made.

Challenges

During the implementation, I encountered several challenges as follows,

1. Choice of Model

Since we want to integrate the pipeline to real storage systems, we had to be mindful of model choice. Machine learning based models are lighter than deep learning based models. However, deep learning based models offer higher accuracy, thus more preferable. Hence, I implemented both and examine the effectivity of the models.

2. Choice of Drift Detection Algorithm

Continual learning technique is chosen for this task may require the model to be retrained since the workload may change over time. However, the implication is we need to have a condition that triggers the retraining to be done. As training model is costly, we need to retrain it mindfully. Thus, we use drift detection algorithm to detect whether or not retraining is needed.

There are two types of drift detection algorithms, namely statistical based test and model based drift detection. For minimizing overhead reason, we pick statistical tests. There exists various algorithms of choice. I picked 5 of them to be implemented and evaluated.

Plan

For the second half of the summer, I am going to study Riak and create Chameleon Trovi artifact for deploying Riak in a cluster.

Midterm Blog Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Wed, 02 Aug 2023 00:00:00 +0000

As part of the Measuring Research Prototypes under Unreported Settings my proposal under the mentorship of Yang Wang and Miao YU aims to understand the impact of missing settings in artifact evaluation.

Based on our project proposal, the first step is to test the benchmark application on targeted systems. We pick open-source database system PostgreSQL as the target system. We test the TPC-C benchmark on PostgreSQL under default settings. We measure the throughput performanace for the benchmark by setting scalefactor as 10 and incrementing worker terminals. The settings for database server are all default values. We will take these results as baseline. In order to test on more parameters and system settings, we need to choose an association of parameters to get optimal throughput.

We use an online tool PGTune, which aims to tune PostgreSQL config by the hardware. We select shared_buffer, min/max_wal_size and effective_cache_size as first set of parameters to measure. They are related to memory consumption, checkpoints and planner cost in the database server. Based on PostgreSQL official documentation, shared_buffer sets the amount of memory the database server uses for shared memory buffers. Max_wal_size sets the maximum size to let the WAL grow during automatic checkpoints. Larger settings for shared_buffers usually require a corresponding increase in max_wal_size, in order to spread out the process of writing large quantities of new or changed data over a longer period of time. Effective_cache_size sets the planner’s assumption about the effective size of the disk cache that is available to a single query. This is factored into estimates of the cost of using an index; a higher value makes it more likely index scans will be used, a lower value makes it more likely sequential scans will be used.

We conduct the experiments by setting the parameters with increments and compare the throughput performance with each other and the baseline. Based on the results, the throughput of the benchmark with larger shared_buffer and max_wal_size is up to 1.5X of the performance under default settings. The improvement by tuning max_wal_size is larger than that of tuning shared_buffer. The increased effective_cache_size does not have effect for this benchmark workload compared to its default value of the system.

There are more values of above mentioned parameters to test. Next, I will test those parameters with increments of the values. Furthemore, we need to choose an association of more parameters to get optimal throughput. Also, the tuning tool may not generate optimal values for very high memory systems based on its description. This requires we test more possible parameters and their values for better performance.

Mid-term blog post for Public Artifact Data and Visualization

Mon, 31 Jul 2023 00:00:00 +0000

Over the past few weeks, our platform development has been progressing steadily, and we are excited to share the milestones we have achieved so far. As planned in our introductory blog, we have successfully laid the groundwork for the platform with the guidance and support of our mentor.

Milestones and Accomplishments

Here are some of the key functionalities we have implemented so far:

Modular Architecture: We successfully designed the platform with a modular architecture, separating the Graphical User Interface (GUI) and Command-Line Interface (CLI) functionalities. This modularity allows users to interact with the platform in their preferred way.
Experiment and Bucket Creation: Users can now create experiments, buckets (for storing different implementations of experiments), and iterations using either the GUI or CLI.
Real-time Backend Environment Monitoring: Through the command line interface, users have the capability to control the monitoring of backend environment data, allowing for real-time tracking and analysis of important metrics.
Visualizing Environment Variables: Users can now visualize detected environment variables on the platform. Moreover, they can compare iterations within different buckets and gain more insights by observing the timeseries data, such as CPU usage, in a graphical format.

Challenges

In the early stages of designing our platform, we encountered significant challenges at the system design level. One of the most daunting obstacles we faced was devising an effective method to monitor backend environment variables. To tackle this obstacle, we engaged in extensive discussions and sought guidance from our mentor. After careful consideration, we decided to adopt a multi-process approach to monitor the backend environment variables effectively. Specifically, we devised a meticulous strategy of creating a separate process in the background for each specific metric we needed to monitor. By allocating a dedicated process to each metric, we ensured a streamlined and efficient monitoring process.

Currently, we are facing a challenge related to monitoring metrics. Since different users have varying monitoring requirements, it is impractical for us to manually write monitoring solutions for each user. To address this issue, we are actively working on implementing a pluggable design that allows users to configure their own monitoring preferences.

Our approach involves providing users with the flexibility to define their custom configuration files or write monitoring programs following our documented guidelines. This way, users can specify the specific metrics they wish to monitor and tailor the monitoring process to their individual needs.

Try it Out!

GUI Repository and CLI Repository
- In the README.md file of GUI repo, you will find detailed installation instructions to set up the Graphical User Interface (GUI). Follow the steps provided to get started with our platform.
Sample Repository
- In this repository, we have included scripts that allow you to run our program. Additionally, you can use these scripts as templates to monitor your own programs according to your specific requirements.

We welcome you to take the platform for a test drive and feel free to raise any issues you encounter during the installation process. Your feedback is invaluable to us, as it helps us identify and address any potential installation challenges and improve the user experience.

Enhancing Drift Detection through Fine-Tuning Llama2

Sun, 30 Jul 2023 00:00:00 +0000

Greetings everyone, I’m Kangrui. Over the past few weeks, we’ve dedicated our efforts and have consequently made significant progress in our drift detection methods. Now, I’m excited to present to you a detailed elaboration on how we prompted and fine-tuned Llama2 to efficiently carry out the drift detection task.

Motivation

Why LLM in drift detection method?

The use of large language models (LLMs) in drift detection methods presents numerous benefits that place it as a prominent solution in this domain.

Rapid Development: LLMs are in the vanguard of technological advancement. This field is evolving rapidly with continuous enhancements in model architecture, training techniques, and data handling. With every new version, these models are showing an increasing capacity to understand and generate human-like text, pushing the limits of what is achievable in Natural Language Processing (NLP) and Artificial Intelligence (AI) as a whole.
Superior Performance: Traditional drift detection methodologies such as Page-Hinkley, EDDM, and HDDM have their merits and have found success in numerous scenarios. Even Deep Learning (DL) techniques, like training a predictive model based on error rates, have made significant strides in the field. However, when handling complex, high-dimensional, and real-time data, LLMs have demonstrated exceptional results. They are not only able to effectively predict and respond to drifts but also adapt to new trends more swiftly. Our experiments using LLMs like GPT-3.5-turbo have yielded impressive results, notably outperforming other methods.

Fig. 1: Concept dirfts detected by GPT-3.5-turbo in Cori dataset

Flexibility: One of the major advantages of using LLMs is their flexibility in dealing with different types of input and output. In contrast to traditional methods, which are confined to single feature concept drift detection and can only process numerical values, LLMs can handle a range of input types including text, numbers, and more complex data structures. This capability allows them to detect multi-feature concept drifts, thereby broadening the scope and complexity of problems they can tackle. Moreover, the generation capability of LLMs can provide rich and detailed output, facilitating more comprehensive insights into the detected drifts.

Why Llama2 in drift detection method?

Llama2 presents a series of advantages that make it an excellent choice for applying llm in drift detection. Here’s a breakdown of the key reasons:

Performance Guarantee: As a newly released model, Llama2 has undergone extensive development and testing, providing a reliable guarantee of performance. It represents the cutting edge in AI technology, having benefited from the latest research and advancements in language model design.
Accessibility Guarantee: One significant advantage of Llama2 is that it is open-source. It is readily accessible on HuggingFace, which also provides a range of mature tools to fine-tune and deploy the model.
Flexibility for Fine-Tuning: Llama2 comes in different sizes, such as 7B, 13B, and 75B parameters, which allows for flexibility in model selection based on the task’s requirements and computational resources.

Data

Dataset

In our study, we employed Synthetic data streams for the fine-tuning of Llama2. Synthetic data streams serve as an invaluable resource for controlled experiments in the domain of drift detection. These curated datasets encompass varied types of drifts, providing us with the capability to assess the efficacy of our detection algorithms under diverse scenarios.

Here is a brief introduction to the synthetic datasets we used:

Sine1 & Sine2: These datasets induce abrupt concept drift within a two-dimensional feature space. The classification rule, a sine function, dictates the instance labels, which are flipped at every drift point.
Mixed: This dataset, characterized by its combination of numeric and boolean features, uses a composite classification rule. The abrupt concept drift is simulated via a periodic reversal of class labels.
Stagger: This categorical dataset incorporates abrupt concept drift by periodically altering the classification rules tied to the features.
Circles & LED: These datasets are designed to simulate gradual concept drift. In Circles, the classification of instances is determined by their spatial relation to specific circles. LED imitates a seven-segment digit display, introducing drift by interchanging the pertinent attributes.

Typically, the synthetic datasets contain 100,000 or 1,000,000 instances. The concept drift happens every 25000 or 33333 instances each portraying either abrupt (with drifting period of 50 instances) or gradual concept drifts (with drifting period of 500 instances).

Data Preprocessing and Metrics

Given the token limit of Llama2 and the specific requirements of our project, we needed to transform the data into an appropriate format.

As such, we processed each data stream into three sections: the ‘undrifted’ period, the ‘drifting’ period, and the ‘drifted’ period. All instances in each section were randomly and independently drawn from the original data stream, summing up to a maximum of 100 instances. The number of instances for the undrifted and drifted periods ranged from 20 to 50, and for the drifting period, it ranged from 10 to 20.

For instance, let’s consider a dataset containing 100,000 instances where the concept drift occurs every 25,000 instances, causing abrupt concept drift. To format a data point, we could draw 20 to 50 instances from the first 25,000 as the undrifted period. Then, we could draw 10 to 20 instances from the 25,001st to 25,050th instance as the drifting period. Finally, we would draw 10 to min(100 - num(undrifted period) - num(drifting period), 50) from the 25,051st to 50,050th instance as the drifted period. This newly formatted data stream would then be fed into Llama2.

We also included some additional information to assist Llama2’s inference process. A typical data point in our processed dataset includes:

{
 "before_period": [0, 31],
 "transition_period": [32, 38],
 "after_period": [39, 59],
 "before_index": [196, 19963],
 "transition_index": [20002, 20030],
 "after_index": [20310, 39984],
 "meta": "Dataset: MIXED\n\tv's type is nominal, range is ('False', 'True')\n\tw's type is nominal, range is ('False', 'True')\n\tx's type is numeric\n\ty's type is numeric\n\tclass's type is nominal, range is ('p', 'n')\n",
 "data_stream": ...
}

From this dictionary, the “meta” and “data_stream” entries are fed into Llama2. The “transition_period” serves as the criterion: if Llama2’s answer lies within the “transition_period”, we deem it correct.

Llama2

Inference

We experimented with three variations of prompts during the inference phase.

Prompt Version 1:

[INST] <<SYS>>
 You are a helpful, respectful, and honest assistant. Always provide the most helpful responses possible while ensuring safety. Ensure that your responses are socially unbiased, positive, and free from harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. If a question lacks coherence or sense, explain why instead of providing incorrect information. If you are uncertain about an answer, refrain from sharing false information.
 <</SYS>>
 Your task is to identify the index in a given data stream where the relationship between the features and labels begins to change. The data stream is formatted as a list, with each element being a two-element list: the first represents the features (also a list), and the second is the label. If your answer is 'x', it indicates that the data pattern starts shifting at the xth data point in the stream.
 Here's an example of the data's metadata: Dataset: SINE1
 x's type is numeric
 y's type is numeric
 class's type is nominal, range is ('p', 'n')

 The given data stream is: [[[0.7, 0.07], 'p'], [[0.45, 0.78], 'n'], ..., [[0.64, 0.45], 'n']]
 Your task is to respond with a single index. No additional information is required.
[/INST]

Prompt Version 2:

The same as Prompt 1, but with a specific range for the index response:

Please provide an index ranging from 0 to 96. No additional information is required.

Prompt Version 3:

This prompt uses an instruction-input-output design, which we adopted for fine-tuning:

Below is an instruction paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Identify the index in a given data stream where the relationship between features and labels begins to change. The data stream is formatted as a list, each element being a two-element list: the first represents the features (also a list), and the second is the label. For instance, if the response is 'x', it means that the data pattern starts shifting at the xth data point in the stream. Only respond with an index, no further information is necessary.

### Input:
Meta Data:
Dataset: SINE1
 x's type is numeric
 y's type is numeric
 class's type is nominal, range is ('p', 'n')

Data stream:
[[[0.7, 0.07], 'p'], [[0.45, 0.78], 'n'], .., [[0.64, 0.45], 'n']]

### Response:

Despite minor differences between Prompt Version 1 and Version 2, both suggested by Meta, the results varied significantly, a topic we will delve into in the following section. Prompt Version 3, employing the instruction-input-output structure, was used during our fine-tuning process.

Fine-Tuning

We utilized the tools provided by llama-recipes to fine-tune Llama2. The key command used to initiate the fine-tuning process is illustrated below:

python llama_finetuning.py --use_peft \
 --peft_method lora \
 --quantization \
 --model_name meta-llama/Llama-2-13b-chat-hf \
 --output_dir ./fine_tuned_model/Llama-2-13b-chat-hf-test_finetune \
 --dataset alpaca_dataset \
 --batch_size_training 40 \
 --num_epochs 1

Some explaination about the parameters:

--use_peft: This flag indicates the use of the Parameter-Efficient Fine-Tuning (PEFT) method. PEFT allows us to fine-tune the model more efficiently.
--peft_method lora: Here, we specify that the Lora (Layer-wise Optimal Brain Surgeon with Relevance-based Adjustment) method should be used for PEFT.
--quantization: The quantization flag is used to reduce the memory footprint of the model during the inference stage. It does so by reducing the precision of the model's weights.
--dataset alpaca_dataset: Specifies the dataset setting used for fine-tuning, in this case, the 'alpaca_dataset' indicates the instruction-input-output structure for fine-tuning.

Results

The performance of various models and prompt versions is depicted in Fig. 2.

Fig. 2: Performance comparison of different models and prompt versions.

It is evident from the results that the design of the prompt has a significant impact on Llama2’s performance. Furthermore, due to computational resource constraints, we have only managed to fine-tune Llama2 on a portion of our dataset (approximately 1,000 instances). The entire training set consists of 19,000 instances, and the test set includes 5,000 instances. Despite these limitations, a performance increase is noticeable after fine-tuning.

GPU Emulator for Easy Reproducibility of DNN Training -- Interim Blog Post

Sun, 30 Jul 2023 00:00:00 +0000

Introduction

Motivation

The growing popularity of Deep Neural Networks has resulted in a substantial increase in demand for Graphics Processing Units (GPUs). GPUs are crucial for conducting matrix computations in DNN training and inference. However, they are expensive to purchase for personal use, and the limited availability of GPU resources in public research clouds like Chameleon further exacerbates the issue. This scarcity of resources can cause delays in DNN-related research projects. Therefore, building an emulator can ameliorate the trouble of reserving GPUs, and the emulator can be modified to gather the profiles needed for optimization much quicker.

Overture

The follwing sections will introduce the completed tasks and specify the details within each. The contents are briefly summarized and will try to present the necessary information only. We finished the following tasks:

Literature Review
Emulator implementation:
- Time Profiling
- Pinned Memory
- Inter-GPUs Computation
Reproducing Figures

I will introduce them and the importance of each one.

Tasks + Reason

Literature Review

While waiting for the measurements, I started reading about other GPU-related papers, especially the ones about GPU Schedulers. We found that besides emulating computation and transfer time, we should also emulate the GPU memory profile in order to reproduce some other papers. Fortunately, it’s doable. In fact, without actually using a GPU, we can emulate many aspects of the GPU, more than just its timing. I found several papers that are reproducible theoretically, but they use Tensorflow while my current work targets Pytorch. Therefore I need to keep looking for the ones that use Pytorch.

Afterwards, we started doing more paper reviews and looked over the papers about GPU Scheduling from 2018-2023 to see if we can reproduce figures from other papers. We went over 150 papers to search for the ones that do have implementation in PyTorch and the complemented GitHub page. We managed to find about 15 papers built in PyTorch and 6 of them were published on GitHub.

We found the paper “CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs” and its GitHub page. The paper has three badges of “Artifacts Available, Evaluated, and Reproduced.” The paper’s content is implemented in PyTorch which means we can probably emulate this paper’s result with the emulator we already have by adding more features. We have started testing out to see if we can set up a similar environment and reproduce the experiments in the paper. After checking out the reproducibility of the paper, we will try to reproduce it using our emulator, and we might add new features to our emulator during this process.

Firstly, I tried to reproduce the figures in the paper “CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs”, but stopped after a considerable number of attempts because the README was incomplete and too hard to follow. I first headed to the GitHub of the paper. I read the paper and understood that the GNN’s training was not the same as regular deep learning training, because it had input irregularity, and CoGNN helped better schedule the jobs to the machines by their algorithm. However, when I tried to install the software by the requirement of their environment README in order to reproduce the figures, many dependency issues were there, and barely any packages required were installed successfully. Their README in the software module was unclear on how to run the experiments too. Following the experiment setup did not give me the expected results. After a set of struggles with even completing one suggested experiment, we eventually decided to move on with other papers and abandoned this paper, reminding me the importance of reproducibility again.

Secondly, we found another paper “Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent”. After reading the paper, we figured that the main focus was on distributing the resources (CPU, GPU) of the nodes to the jobs that were distributed by the Kubernetes Scheduler. In this way, there would be less GPU fragmentation and a higher utility rate of the resources. The paper used a simulator to simulate a large number of nodes and run the jobs by simulation. I successfully ran the experiments demonstrated in the repo and even created a smaller sample so that we could gain the result faster, because their original experiment takes 1020 times which will take about a month. However, when we dug deeper into their paper, we soon realized that their emulator is not a “real” one. Although their emulator is built off Kubernetes, the side where they used to create the figures are mere simulators and therefore doesn’t fit with our goal of emulating only GPU-related parts while running other real-system parts.

Reason:

The purpose is to figure out which papers can be reproduced using the emulator, and what other features are needed for the emulator to work.

Emulator implementation

Time Profiling

I did the performance profiling of different GPUs, which included CPU-to-GPU data transfer time and GPU computation time. These two elements will always be rather constant on GPUs so they can be easily emulated by profiling first and then utilized in the emulation. We did it for 6 different GPUs including k80, rtx6000, m40, a100pcie, v100, and p100.

After having the performance profiling information of a few types of GPU nodes, I implemented the first naive version of the emulator. I used the profile recorded and sleep() function to represent the amount of time that each step needs to accomplish. Meanwhile, the time also varies with the command given so some simple arithmetics were implemented too. It’s implemented on a CPU node yet if we want to know the time profile of a GPU, we can still get them just like on a real GPU node.

Reason:

The time profile collected can be compared with Data Wait Time to conduct research on minimizing pipeline stall across different GPUs and models.

Pinned Memory

Pin memory threads – GPU-based Pytorch utilizes such threads to copy data from SHM to pinned memory, but CPU-based Pytorch doesn’t do so. Therefore, I need to implement an emulation of the pin mem threads. Fortunately, the data copy time is predictable. I have already found out that pin mem time has little to do with # of workers or the model type but only the batch size. I still need to find out if it has anything to do with the GPU nodes, which I assume not at this point.

While implementing the features, We first emulated the CPU-to-GPU transfer time and GPU computation time for the p100 GPU based on the profiled information. Another CUDA behavior that requires emulation is that CUDA copies data from shared memory to pinned memory. In order to emulate it, we measured and emulated the time for copying such data (pinned memory). However, the emulator did not behave exactly as the real GPU. This was because we only emulated the time cost of using pinned_memory, but didn’t emulate its memory cost. In order to resolve the problem above, we wrote a CPython module to manually allocate page-locked memory (which behaves the same as CUDA’s pinned_memory). After we implemented this mechanism, the emulator’s fundamental functions were equipped and properly mimicked CUDA’s behaviors.

Reason:

After collecting the GPU profile, I did a comparison with the actual GPU but noticed some differences in their IO time, meaning there was a difference between the emulation-based Pytorch and the actual GPU-based Pytorch.

Inter-GPUs Computation

We worked on the emulation of inter-GPU computation time in order to emulate Figure 9 in the DNN stall paper. This is one of the influential factors in multi-GPU training and we decided to first figure out how to implement this feature. As claimed in the paper, the larger the batch size, the less time it took to update the model. However, our current emulator would give out the same computation time since we have not added features to emulate inter-GPU behaviors. The smaller the batch size, more overheads were proven to be larger. The first step was to rent a lease that had 2 GPUs and saw the effects of inter-GPUs on computation time. We found that there was a small amount of overhead when running two GPUs instead of 1 GPU on the p100 node. My job was to find out where and how these overheads happened and find ways to emulate these features in order to reproduce Figure 9. We used resnet18, 4 workers, 10 batches to separately run 128 batch-size with 1 GPU (Group A) and 256 batch-size with 2 GPUs (Group B). With our current emulator, we would get the same computation time for both experiments to finish 1 batch. However, we saw that the computation time of Group B was longer than Group A, meaning there were some overheads in computation time. I then hacked into the source code of PyTorch and successfully figured out one part of the overhead contributing factors.

Reason:

To better complete the emulator so that it can procide accurate emulation even when using more than 1 GPU on a machine.

Reproducing Figures

After implementing the emulator, we managed to use it to reproduce Figures 3, 4, 5, and 6 in the paper “Analyzing and Mitigating Data Stalls in DNN Training” after a series of experiments and testing. It was noted that some environments in the paper were not the same as what we ran in the past week, but general patterns did apply to the expected hypothesis and measurements. We double checked all the data and figures produced and found out that our prototype meets our expectations, and it was time to look for other papers to reproduce to make the emulator more interesting. The orginial comparing with the reproduced figures are demonstrated as below, you can notice that the patterns do reflect our expected results: Original Figure 3:

Reproduced Figure 3:

Original Figure 4:

Reproduced Figure 4:

Original Figure 5:

Reproduced Figure 5:

Original Figure 6:

Reproduced Figure 6:

Reason:

Our origninal goal was to reproduce papers. Therefore, reproducing figures is a really good step to achieve that.

Summary + Coming Future

We will keep on trying to complete the emulator and figure out the exact mechanisms needed for the implementation. We will also seek for more features and see if it’s possible to add in better features into the emulator.

Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time

Fri, 16 Jun 2023 00:00:00 +0000

Hi! I’m Charis, an undergraduate student in the IT and Big Data Analytics program at the Calvin Institute of Technology. As part of the Reproducible Analysis & Models for Predicting Genomics Workflow Execution Time my proposal under the mentorship of In Kee Kim and Martin Putra aims to gain insight into features that are highly correlated with execution times of genomics workflows and build machine learning models for predicting workflow execution time.

Genomics workflows exhibit a long-tail pattern in their execution times. According to the previous project team’s findings, approximately 2% of genomics workflows had a median execution time of up to 15%, resulting in weeks of execution. Interestingly, it was observed that input quality plays a role in these execution time differences. Therefore, we will analyze features such as the quality of input data as well as the amount of resources allocated in the execution of genomics workflows to find features that correlate with execution time. Based on these features we will build a machine learning model that can predict the execution time of genomics workflows.

By collaborating with Shayantan Banerjee (another contributor) who will study data quality, I will study the system metrics of genomics workflows both at workflow-level and tool-level. Metrics will be collected by running genomics workflows using the Slurm workload manager under various resource allocation conditions. Genomics workflows will be executed on Chameleon clusters of different sizes.

GPU Emulator for Easy Reproducibility of DNN Training

Tue, 13 Jun 2023 00:00:00 +0000

Hi! I’m Haoran Wu, a third year at the University of Chicago majoring in Economics and Computer Science. With my proposal, I’m working on the GPU Emulator for Easy Reproducibility of DNN Training project with Professor Vijay Chidambaram. A Deep Neural Network (DNN) is an advanced artificial neural network that employs multiple layers to process intricate patterns and relationships within data. It finds applications in various fields such as image and speech recognition, natural language processing, and predictive modeling. The layers in a DNN progressively extract higher-level features from raw input data, enabling the network to learn and generalize patterns effectively.

Nevertheless, not all DNN research experiments require the use of a GPU. System researchers, for instance, may be primarily interested in performance profiles and not necessarily in the accuracy of training or inference. These researchers might focus on optimizing the storage layer and data loading of DNN training. In such cases, a GPU emulator that accurately replicates GPU behavior without needing a physical GPU can fulfill their requirements. By utilizing a GPU emulator, system researchers can evaluate their system optimizations’ performance without competing for limited GPU resources in the cloud, thereby avoiding unnecessary delays in their research progress. Our work will eventually be open source and benefit the community.

FlashNet: Towards Reproducible Continual Learning for Storage System

Sun, 04 Jun 2023 00:00:00 +0000

Hello! I’m Rani, a third year undergraduate student at Institut Teknologi Bandung majoring at Informatics. As part of the FlashNet my proposal under the mentorship of Haryadi S. Gunawi and Daniar Kurniawan aims to implement and optimize the FlashNet model in real-world storage systems using continual learning techniques.

In real world workloads, it is known that the I/O stream changes and varies. Hence, the performance of I/O read/write could vary and introduce the tail latency. We would like to predict the latency of I/O read to cut the tail and improve the system’s performance. This project focuses on improving the FlashNet pipeline and introducing adaptability to the machine learning models built.

During the summer, we planned to implement the continual learning pipeline using machine learning models we have built previously in the project. Of course, continual learning isn’t a continual learning without the ability of self-motivated retraining. Thus, we will implement several drift detection algorithms, evaluate, and test them. Besides, we will also build a visualization platform to evaluate and monitor the performance of the models built. Lastly, we planned to create Chameleon Trovi artifacts to demonstrate our experiments and make these implementations available and reproducible to the public.

Reproducible Evaluation of Multi-level Erasure Coding

Wed, 31 May 2023 00:00:00 +0000

Hi! My name is Alex, an undergraduate student at the University of Chicago. As part of the Reproducible Evaluation of Multi-level Erasure Coding, my proposal under the mentorship of John Bent and Anjus George aims to build a platform to reproducibly evaluate the performance and durability of MLEC (Multi-Level Erasure Coding) for large-scale storage systems under different design configurations.

To provide some context, Erasure Coding (EC) is a common approach to protect data from disk failures. Data centers nowadays increasingly use Multi-Level Erasure Coding (MLEC), a newly developed erasure coding method that aims to deal with the drawbacks of Single-Level Erasure Coding (SLEC). Despite its increasing popularity, there have not been many systematic studies to analyze and evaluate MLEC, which is the focus of this project.

The evaluation will primarily be conducted through simulations, since modifying configurations in a real large-scale system is costly and impractical. The expected deliverables of this project will be:

An MLEC simulator that can reproducibly simulate different configurations of the MLEC system, e.g. coding parameter selection, chunk placement scheme, repair method choice, etc.
An analysis of the performance and durability tradeoffs between different MLEC design choices based on the evaluation results from the simulation
Reproduced SLEC evaluation results using existing SLEC simulators
A comparison between MLEC and SLEC on performance and durability tradeoffs
Well-written documents and detailed guides on how to reproduce the evaluation results

Our plan is to build the simulator throughout the summer. We hope our simulator and evaluation results can provide designers of large-scale storage systems with valuable insights on choosing the most appropriate erasure coding configuration per their needs.

[FLASHNET]: Leveraging ML-augmented I/O in Linux

Tue, 30 May 2023 00:00:00 +0000

Hi! I’m Justin, an undergraduate at the University of Chicago. As part of the Flashnet project my proposal under the mentorship of Daniar Kurniawan and Haryadi S. Gunawi aims to port the Flashnet model into the Linux kernel.

In this attempt, I will borrow architecture/design choices from LAKE (to take advantage of its integration of ML-focused hardware acceleration in the kernel) and evaluation criteria from LinnOS to test for model inference accuracy. I also plan to support latency “bucket” inference output to improve accuracy. Ultimately, my goal is to gain further insight into best practices for integrating ML models into real-life operating systems like Linux and to inform general design choices for the Flashnet pipeline.

Automatic Cluster Performance Shifts Detection Toolkit

Sat, 27 May 2023 00:00:00 +0000

Hi! I am Kangrui, a Pre-doc student at the University of Chicago. As part of the Automatic Cluster Performance Shifts Detection Toolkit my proposal under the mentorship of Sandeep Madireddy and Ray Andrew aims to design a real-time performance shift detection algorithm for high-performance computing clusters, ensuring minimal overheads.

This project focuses on developing a real-time performance shift detection algorithm tailored to heterogeneous workloads, aiming to promptly inform administrators about performance changes. The primary goal is to design an algorithm that efficiently detects shifts in real-time, with minimal system overheads.

In addition to algorithm development, we plan to enhance the Darshan toolkit’s functionality by integrating our algorithm, offering users early performance shift detection. This integration will aid administrators in making informed system utilization and scheduling decisions.

To promote transparency and reproducibility, we’ll encapsulate our findings, scripts, and profiling data within a Jupyter notebook, especially Chameleon Trovi, enabling other researchers to reproduce our experiments easily.

Looking ahead, we plan to expand the algorithm’s applicability to cater to diverse HPC workloads and infrastructures. Other areas of interest include its use in detecting shifts in financial markets or monitoring IoT data streams. Further refinement of our algorithm, to reduce overheads and improve real-time detection capabilities, is also a part of our future endeavours. This task may involve evaluating various shift detection methods and noise filtering techniques.

Measuring Open-source Database Systems under TPC-C Benchmark with Unreported Settings

Thu, 25 May 2023 00:00:00 +0000

The project plans to measure the impact of different missing settings for open-source database systems, such as MySQL and PostgreSQL particularly under the TPC-C Benchmark. The objective requires to run experiments on popular settings that are not reported and fix any problems during the experiments for the target systems. The project will compare the performance characteristics, and analyze the impact of missing settings on the performance of the target systems.