Kyrillos Ishak | UCSC OSPO

Understanding Data Leakage in Machine Learning: A Focus on TF-IDF

Thu, 05 Sep 2024 00:00:00 +0000

Hello again!

This is my final blog post, and I will be discussing the second material I created for the 2024 Summer of Reproducibility Fellowship. As you may recall from my first post, I am working on the Exploring Data Leakage in Applied ML: Reproducing Examples of Irreproducibility project with Fraida Fund and Mohamed Saeed as my mentors.

This blog post will explore how data leakage can occur during feature extraction, particularly with the commonly used TF-IDF vectorizer, and its impact on model generalization.

Introduction

In machine learning, data leakage is a critical issue that can severely impact model performance. It occurs when information from outside the training dataset is improperly used to create the model, leading to overly optimistic performance during evaluation. One common source of leakage comes from how features, such as those extracted using TF-IDF (Term Frequency-Inverse Document Frequency), are handled. In this post, we’ll explore how data leakage can happen during feature extraction with TF-IDF and how it affects model accuracy.

What is TF-IDF?

TF-IDF is a method used to evaluate how important a word is in a document relative to a collection of documents. It consists of two components:

Term Frequency (TF): Measures how frequently a term appears in a document.
Inverse Document Frequency (IDF): Reduces the importance of terms that appear frequently across many documents.

Together, they provide a weighted value for each word, reflecting its importance relative to the dataset.

How Data Leakage Occurs with TF-IDF

Data leakage with TF-IDF happens when the inverse document frequency (IDF) is calculated using the entire dataset (including the test set) before splitting it into training and test sets. This means the model has access to information from the test set during training, leading to artificially inflated results. This is a subtle form of data leakage, as it often goes unnoticed.

For example, when calculating the TF-IDF score, if the word “banana” appears more frequently in the test set but is considered during training, the model downplays its significance. As a result, the model may fail to predict correctly when “banana” is important in the test data.

Why Does This Matter?

If the test data is included when calculating the IDF, the model gains unintended insight into the test set’s word distribution. In real-world scenarios, the test data is supposed to be unseen during training. By allowing the model to see this information, you’re essentially reducing the uncertainty that the model should have about future data.

Impact of Data Leakage on Model Performance

Let’s consider two cases to understand the impact of data leakage in detail:

When a word is rare in the training set but common in the test set: The model will underestimate the importance of this word during training, leading to poor performance when the word is critical in test documents.
When a word is common in the training set but rare in the test set: The model will overemphasize the word during training, leading to poor predictions when the word doesn’t appear as often in unseen data.

Case Study: Data Leakage in TF-IDF

To see this effect in action, consider a small toy dataset where the presence of the word “banana” determines the label. If the word “banana” appears in a sentence, the label is 1; otherwise, the label is 0. Using TF-IDF to vectorize the text, we train a machine learning model to predict this label.

In the first scenario, we calculate the TF-IDF using the entire dataset before splitting it into training and testing sets. This causes data leakage since the model now knows the distribution of words across both sets. For instance, if “banana” is more common in the test set than the training set, the IDF score for “banana” will be lower across the entire dataset, leading the model to downplay its importance.

In the second scenario, we calculate TF-IDF only on the training set, ensuring that the test set remains unseen. This preserves the integrity of the test set, giving us a more realistic evaluation of the model’s performance.

In both scenarios, the model’s accuracy is drastically different. When leakage is present, performance is artificially high during training but poor when tested on unseen data. Without leakage, the model generalizes better, as it is evaluated on truly unseen data.

Avoiding Data Leakage

Avoiding data leakage is essential for building reliable machine learning models that generalize well to new data. Here are a few guidelines to help prevent leakage:

Split the dataset before feature extraction: Always divide your data into training and test sets before applying any feature engineering techniques.
Ensure proper cross-validation: When using cross-validation, ensure that the training and test splits do not overlap in any way that can leak information between them.
Be cautious with time-series data: In time-series models, avoid using future data to predict past events, as this can lead to leakage.

Conclusion

Avoiding data leakage is crucial for building robust machine learning models. In the case of TF-IDF, ensuring that feature extraction is done only on the training set and not on the entire dataset is key to preventing leakage. Properly addressing this issue leads to better generalization and more reliable models in real-world applications.

This blog post provided a case study on how TF-IDF can introduce data leakage and why it’s important to carefully handle your dataset before feature extraction. By splitting your data properly and ensuring that no test data “leaks” into the training process, you can build models that truly reflect real-world performance.

Thanks for reading!

Reproducing and addressing Data Leakage issue : Duplicates in dataset

Fri, 23 Aug 2024 00:00:00 +0000

Hello!

In this blog post, I will explore a common issue in machine learning called data leakage, using an example from the paper:

Benedetti, P., Perri, D., Simonetti, M., Gervasi, O., Reali, G., Femminella, M. (2020). Skin Cancer Classification Using Inception Network and Transfer Learning. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2020. ICCSA 2020. Lecture Notes in Computer Science(), vol 12249. Springer, Cham. https://doi.org/10.1007/978-3-030-58799-4_39 arXiv

Overview of the Paper

In this paper, the authors use transfer learning on a pretrained convolutional neural network (CNN) to classify skin lesions in dermatoscopic images from the HAM10000 (Human Against Machine with 10,000 training images) dataset. The paper reports a final accuracy of 78.9% on the validation set.

While this reported result appears to be impressive, there are concerns regarding the validity of this performance metric due to data leakage. Data leakage occurs when the model is trained or evaluated on data that it would not have access to during real-world deployment, leading to an overestimation of the model’s true performance.

Identifying Data Leakage in the Original Paper

Upon closer inspection, it appears that the original experiment suffers from data leakage in two significant ways:

Duplicate Images in Training and Validation Sets:

The HAM10000 dataset contains near-duplicate images of the same lesions in both the training and validation sets. This results in the model seeing very similar images during training and then again during validation. Consequently, the model’s performance is artificially inflated because it has already been “trained” on images similar to those in the validation set, making the task easier than it should be.
Using the Validation Set for Early Stopping and Final Evaluation:

Another critical issue is the use of the validation set for both early stopping and final model evaluation. Early stopping is a technique where training is halted when the model’s performance on a validation set no longer improves, preventing overfitting. However, if this same validation set is later used to evaluate the model’s final performance, it can lead to overfitting on the validation data itself, resulting in an overly optimistic estimate of model accuracy.

Our Reproduction and Results

To demonstrate the impact of these data leakage issues, we reproduced the experiment with corrected methodologies:

Corrected Data Split: We ensured that there were no duplicate images between the training and validation sets. This setup is crucial to simulate a realistic scenario where the model encounters completely unseen data during validation.
Separate Validation and Test Sets: We introduced a distinct test set to evaluate the final model performance, independent of the data used for early stopping.

Results Comparison

	Original results	Our results
Accuracy	78.9%	78.6%
Number of epochs	Approx. 42 epochs	40 epochs
Training size	Unknown	7000 samples
Validation size	478 samples	478 samples
Confusion martix

Analysis of the Results

While our reproduced accuracy of 78.6% is close to the original reported accuracy, it is based on a properly separated training and validation set, avoiding the data leakage pitfalls of the original paper. The slight drop in accuracy further highlights the overestimation of the original model’s performance due to data leakage.

Moreover, using a separate test set for final evaluation provides a more reliable measure of the model’s ability to generalize to new, unseen data. The confusion matrices show that our model’s performance is consistent across different lesion classes, confirming the robustness of the evaluation.

Conclusion

Data leakage is a common and often overlooked problem in applied machine learning, leading to misleading performance metrics and irreproducible results. By carefully examining and correcting these issues in our reproduction, we hope to provide a clearer understanding of the importance of proper data handling and validation practices.

It is crucial for researchers and practitioners to be vigilant about data leakage and ensure that their models are trained, validated, and tested under realistic conditions. This not only ensures the credibility of their results but also enhances the real-world applicability of their models.

Thank you for reading, and stay tuned for more insights on machine learning reproducibility!

Data leakage in applied ML: reproducing examples of irreproducibility

Fri, 14 Jun 2024 00:00:00 +0000

Hello,

I am Kyrillos Ishak I am happy to be part of SOR 2024, I am working on Data leakage in applied ML: reproducing examples of irreproducibility project. My proposal was accepted.

I am excited to work with Fraida Fund and Mohamed Saeed as my mentors. The objective of the project is to develop educational resources that can be adjusted by professors/instructors to explain specific data leakage problems. This involves ensuring the reproducibility of certain research papers that contain data preprocessing issues, then fixing these issues to demonstrate how they can affect the results.

Data leakage is a problem caused when information from outside the training dataset is used to create the model. This issue can lead to overly optimistic performance estimates and, ultimately, models that do not perform well on new, unseen data.

Despite the importance of addressing data leakage, many people from fields not closely related to computer science, are often unfamiliar with it, even if they are aware of best practices for data preprocessing. Developing educational materials on this topic will greatly benefit them.

I am excited to dive into the topic of data leakage in machine learning. Throughout the summer, I will be sharing regular updates and insightful blog posts on this subject. Stay tuned for more information!