Lihaowen (Jayce) Zhu | UCSC OSPO

Final Blog: FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Fri, 16 Aug 2024 00:00:00 +0000

Background

Hello, I’m Lihaowen (Jayce) Zhu, a 2024 SoR contributor for the FEP-bench project, under the mentorship of Yuyang (Roy) Huang. Before we started, let’s recap the goal of our project and our progress until mid term. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. In order to solve these challenges, we have collected the basic information of various common datasets for different machine learning tasks, and corresponding preprocessing pipelines.

Methodology

Since our goal is to improve the efficiency of the machine learning preprocessing pipeline and keep the training process of the Deep Learning model busy, it means that we need to enhance the preprocessing throughput which is the feed rate from the preprocessing stage to the training stage. According to some previous works, we have a new way to look at the Deep Learning Preprocessing Pipelines. The preprocessing pipeline can be split into 2 parts. The first part contains steps that are run once (S1-Sm). We can call it the “offline” part. The second part includes all the rest steps, which are run at every iteration of training. We call it the ”online” part. After the offline preprocessing steps, the output data is written back to disk. Then the online preprocessing steps need to load that data from storage first and do the following operations. We can split the pipeline at any step, and each split is a preprocessing strategy. By using this method, some specific strategies can achieve a much higher final preprocessing throughput. Our project adopts this method to profile the performance of different strategies. And our goal is to maximize the final preprocessing throughput into training, for a specific pipeline. We want to make this an automatic process, rather than ask for extra user instructions or parameters.

Experiment

Next, we did the data preprocessing strategy experiment on the LibriSpeech dataset, which is an audio dataset for ML tasks like Auto Speech Recognition. The dataset size is 6.3 GB with almost 30000 samples. Each audio file is in a binary format FLAC. As a result, the first step of the preprocessing pipeline we use is decoding, which converts the binary data into arrays of floats. Then we applied some typical audio preprocessing steps of transformation (normalization, padding, extract loudest section) and augmentation (random cut, random shift audio, random mask, random add noise) to audio data. Finally, the audio data is converted to Log-Mel Spectrogram signal, which is commonly used in audio tasks like Speech Recognition and Speaker identification.

We have benchmarked the throughput performance and storage overhead of all possible strategy split points, and have seen some trade-offs between them. Both storage overhead and throughput speed-up use the fully online method as the baseline. What we’ve observed from our results is that the speed-up keeps increasing when we put operations into the offline part, and the storage consumption is very low for the strategies after audio decoding. Also, we analysed the performance of individual methods of transformation and augmentation steps. We find that the speed-up performance is quite stable between 1.0 and 1.2 across these methods, but some methods can have a high storage overhead, like normalization and random noise.

Another thing we observed during our experiments is that different dataset sizes can influence the preprocessing pipeline throughput. We found that the throughput speed-up of 10000 samples is almost double the speed-up of 5000 samples. It seems like a larger dataset size may lead to a higher speed-up. So, we were thinking that does every operation follows this pattern or only certain operations can have increasing throughput with increasing dataset size, and then did experiments about the throughput speed-ups on different dataset sizes of all operations in the audio preprocessing pipeline. The results showed that only the audio decoding step can have a great increase in speed-up for larger dataset sizes. But for transformation, augmentation and LMS, the throughputs always stay at a steady level. This indicates that the only audio decoding step can become faster and faster when the dataset size grows.

Conclusion

In our work, we have built up a collection of common datasets and their preprocessing pipelines for different machine-learning tasks. For the audio dataset LibriSpeech, we have done experiments about the trade-offs between throughput speed-ups and storage overhead, and dataset sizes. We have found that speed-ups keep increasing when more and more operations are divided into the offline part. Only the audio decoding step can become faster and faster when the dataset size grows.

Future works

In the near future, we still want to find the optimal preprocessing strategy by profiling only a small part of the original enormous dataset. The second thing is that besides the audio dataset, we must expand the range of our experiments on other datasets and ML tasks. Finally, we need to implement our goal of building an automatic system that decides the optimal strategy of a preprocessing pipeline.

Mid Term Blog: FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Thu, 18 Jul 2024 00:00:00 +0000

Introduction

Hello, I’m Lihaowen (Jayce) Zhu, a 2024 SoR contributor for the FEP-bench project, under the mentorship of Yuyang (Roy) Huang. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. By exploring innovative caching, prefetching, and heuristic strategies, this proposal aims to optimize the preprocessing workflow, thereby enhancing efficiency and reducing the required resources of ML projects.

Motivation

Our research project is based on the context of Deep Neural Networks. To train a DNN, we first need a large amount of data. All raw data must be preprocessed by a data preprocessing pipeline, which is specific to different ML tasks. As usual, in a preprocessing pipeline, the data must be loaded from the disk and converted to the correct format, transformed and augmented. And then, it can be fed into the training stage. In common ML training tasks and datasets, the data preprocessing stage can consume almost 65% of the total training time. However, compared with the fast development of computing hardware including GPUs and TPUs, the speed of data preprocessing pipelines has not been improved by a lot and cannot keep up with these hardware innovations, which leads to a bottleneck in the efficiency of Deep Neural Network training.

The bottlenecks can be divided into 2 categories: the data side and the computation side. The data side bottleneck is mainly caused by the data transfer in the system, including data fetching, I/O bound, huge size of data, and complex data format. However, the computation side bottleneck can always happen during data preprocessing operations and data shuffling. For distributed Machine Learning training systems, gathering the distributed data can also lead to the computation side bottleneck.

Current Progress

In order to improve the efficiency of the machine learning preprocessing pipeline, we first need to understand and document the preprocessing workflows commonly used in machine learning, including pipelines of Natural Language Processing, Computer Vision, and Audio datasets. As a result, for the past month, we have built up a collection of common datasets for different machine learning tasks. The dataset types include NLP, CV, Audio, Linear Regression, Video and LiDAR. The machine learning job types are collected based on the dataset types, such as sentiment analysis for NLP, and image classification for CV. The data has either a structured or unstructured format. In addition, our collection contains the following attributes:

Data/Sample size
Typical preprocessing operations
Preprocessing difficulty: hard/easy
Input splittable
Output reusable
CPU/GPU/IO Bound
Dataset and preprocessing links.

By collecting all this data, we can gain an overview of all common preprocessing pipelines in the current machine learning research field, and build up a solid basis for the next phase of our project, which requires hard work on benchmark profiling. For example, for the Audio datasets, we focus on the LibriSpeech dataset. It contains 1000 hours of speech sampled at 16kHz, making it one of the largest publicly available datasets for speech recognition tasks. The typical preprocessing steps of the LibriSpeech dataset include feature extraction, label to integer conversion, and padding.

Challenges

During the first phase of the project, I met a lot of challenges as I had not been exposed to topics similar to this project. The first big problem was that I needed to learn the concepts of some machine learning tasks from scratch, such as NLP, so that I could have a better understanding of the common datasets and pipelines. Also, I needed to deeply review a lot of different preprocessing pipelines for each machine learning task, to make the table more comprehensive.

FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning

Wed, 12 Jun 2024 00:00:00 +0000

Hello, I’m Lihaowen (Jayce) Zhu, currently pursuing my Master of Science in Computer Science at the University of Chicago. I will be spending my summer working on the project FEP-Bench: Benchmarking for Enhanced Feature Engineering and Preprocessing in Machine Learning under the mentorship of Yuyang (Roy) Huang and Swami Sundararaman, my proposal.

The landscape of machine learning (ML) is profoundly impacted by the initial stages of feature engineering and data preprocessing. This phase, critical for the success of ML projects, is often the most time-consuming, representing about 80% of the effort in typical ML workflows. The FEP-Bench project proposes to address the significant bottlenecks encountered during this phase, particularly focusing on the challenges posed by data retrieval from data lakes and computational inefficiencies in data operations. By exploring innovative caching, prefetching, and heuristic strategies, this proposal aims to optimize the preprocessing workflow, thereby enhancing efficiency and reducing the required resources of ML projects.