<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Shaivi Malik | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/shaivi-malik/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/shaivi-malik/index.xml" rel="self" type="application/rss+xml"/><description>Shaivi Malik</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/shaivi-malik/avatar_hu0a917157a52c6dfbcf1801ea5197dcc2_3463216_270x270_fill_lanczos_center_3.png</url><title>Shaivi Malik</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/shaivi-malik/</link></image><item><title>Data Leakage in Applied ML: model uses features that are not legitimate</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/nyu/data-leakage/20240924-shaivimalik/</link><pubDate>Tue, 24 Sep 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/nyu/data-leakage/20240924-shaivimalik/</guid><description>&lt;p>Hello everyone!&lt;/p>
&lt;p>I have been working on reproducing the results from &lt;strong>Identification of COVID-19 Samples from Chest X-Ray Images Using Deep Learning: A Comparison of Transfer Learning Approaches&lt;/strong>. This study aimed to distinguish COVID-19 cases from normal and pneumonia cases using chest X-ray images. Since my last blog post, we have successfully reproduced the results using the VGG19 model, achieving a 92% accuracy on the test set. However, a significant demographic inconsistency exists: normal and pneumonia chest X-ray images were from pediatric patients, while COVID-19 chest X-ray images were from adults. This allowed the model to achieve high accuracy by learning features that were not clinically relevant.&lt;/p>
&lt;p>In &lt;a href="https://github.com/shaivimalik/covid_illegitimate_features/blob/main/notebooks/Correcting_Original_Result.ipynb" target="_blank" rel="noopener">Reproducing “Identification of COVID-19 samples from chest X-Ray images using deep learning: A comparison of transfer learning approaches” without Data Leakage&lt;/a>, we followed the methodology outlined in the paper, but with a key change: we used datasets containing adult chest X-ray images. This time, the model achieved an accuracy of 51%, a 41% drop from the earlier results, confirming that the metrics reported in the paper were overly optimistic due to data leakage, where the model learned illegitimate features.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="GradCAM from husky vs wolf example " srcset="
/report/osre24/nyu/data-leakage/20240924-shaivimalik/gradcam_hu02772b80d1d95ff5ae817af6261a6059_438521_7bc94e0816aa962665434756bf41e27d.webp 400w,
/report/osre24/nyu/data-leakage/20240924-shaivimalik/gradcam_hu02772b80d1d95ff5ae817af6261a6059_438521_a160058d1708baa257daa63de5fada34.webp 760w,
/report/osre24/nyu/data-leakage/20240924-shaivimalik/gradcam_hu02772b80d1d95ff5ae817af6261a6059_438521_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/nyu/data-leakage/20240924-shaivimalik/gradcam_hu02772b80d1d95ff5ae817af6261a6059_438521_7bc94e0816aa962665434756bf41e27d.webp"
width="760"
height="329"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>To further illustrate this issue, we created a &lt;a href="https://github.com/shaivimalik/covid_illegitimate_features/blob/main/notebooks/Exploring_ConvNet_Activations.ipynb" target="_blank" rel="noopener">toy example&lt;/a> demonstrating how a model can learn illegitimate features. Using a small dataset of wolf and husky images, the model achieved an accuracy of 90%. We then revealed that this performance was due to a data leakage issue: all wolf images had snowy backgrounds, while husky images had grassy backgrounds. When we trained the model on a dataset where both wolf and husky images had white backgrounds, the accuracy dropped to 70%. This shows that the accuracy obtained earlier was an overly optimistic measure due to data leakage.&lt;/p>
&lt;p>You can explore our work on the COVID-19 paper &lt;a href="https://github.com/shaivimalik/covid_illegitimate_features" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;p>Lastly, I would like to thank &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/mohamed-saeed/">Mohamed Saeed&lt;/a> for their support and guidance throughout my SoR journey.&lt;/p></description></item><item><title>Data Leakage in Applied ML</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/nyu/data-leakage/20240813-shaivimalik/</link><pubDate>Tue, 13 Aug 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/nyu/data-leakage/20240813-shaivimalik/</guid><description>&lt;p>Hello everyone!&lt;/p>
&lt;p>I have been working on reproducing the results from &lt;strong>Characterization of Term and Preterm Deliveries using Electrohysterograms Signatures&lt;/strong>. This paper aims to predict preterm birth using Support Vector Machine with RBF kernel. However, there is a major flaw in the methodology: &lt;strong>preprocessing on training and test set&lt;/strong>. This happens when preprocessing is performed on the entire dataset before splitting it into training and test sets.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Sample produced from test and training set samples" srcset="
/report/osre24/nyu/data-leakage/20240813-shaivimalik/leakage_hu7171e85c8455cc3219721a2e3b71a711_62548_687703a1dee465e80fb3dbe262dd5860.webp 400w,
/report/osre24/nyu/data-leakage/20240813-shaivimalik/leakage_hu7171e85c8455cc3219721a2e3b71a711_62548_42051adaf7804083284553c10ca73861.webp 760w,
/report/osre24/nyu/data-leakage/20240813-shaivimalik/leakage_hu7171e85c8455cc3219721a2e3b71a711_62548_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/nyu/data-leakage/20240813-shaivimalik/leakage_hu7171e85c8455cc3219721a2e3b71a711_62548_687703a1dee465e80fb3dbe262dd5860.webp"
width="760"
height="589"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Sample produced from training set samples" srcset="
/report/osre24/nyu/data-leakage/20240813-shaivimalik/no_leakage_hu60ff986c558a17237e53708798334267_66856_47e6397030251c1681ff92260f687641.webp 400w,
/report/osre24/nyu/data-leakage/20240813-shaivimalik/no_leakage_hu60ff986c558a17237e53708798334267_66856_8bad9197813df4344757765d43878a56.webp 760w,
/report/osre24/nyu/data-leakage/20240813-shaivimalik/no_leakage_hu60ff986c558a17237e53708798334267_66856_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/nyu/data-leakage/20240813-shaivimalik/no_leakage_hu60ff986c558a17237e53708798334267_66856_47e6397030251c1681ff92260f687641.webp"
width="760"
height="594"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>Reproducing the published results came with its own challenges, including updating EHG-Oversampling to extract meaningful features from EHG signals and finding optimal hyperparameters for the model. Through our work on reproducing the published results and creating toy example notebooks, we have been able to demonstrate that data leakage leads to overly optimistic measures of model performance and models trained with data leakage fail to generalize to real-world data. In such cases, performance on test set doesn&amp;rsquo;t translate to performance in the real-world.&lt;/p>
&lt;p>Next, I&amp;rsquo;ll be reproducing the results published in &lt;strong>Identification of COVID-19 Samples from Chest X-Ray Images Using Deep Learning: A Comparison of Transfer Learning Approaches&lt;/strong>.&lt;/p>
&lt;p>You can follow my work on the EHG paper &lt;a href="https://github.com/shaivimalik/medicine_preprocessing-on-entire-dataset" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;p>Stay tuned for more insights on data leakage and updates on our progress!&lt;/p></description></item><item><title>Data leakage in applied ML: reproducing examples from genomics, medicine and radiology</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/nyu/data-leakage/20240701-shaivimalik/</link><pubDate>Mon, 01 Jul 2024 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre24/nyu/data-leakage/20240701-shaivimalik/</guid><description>&lt;p>Hello everyone! I&amp;rsquo;m Shaivi Malik, a computer science and engineering student. I am thrilled to announce that I have been selected as a Summer of Reproducibility Fellow. I will be contributing to the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre24/nyu/data-leakage/">Data leakage in applied ML: reproducing examples of irreproducibility&lt;/a> project under the mentorship of &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/fraida-fund/">Fraida Fund&lt;/a> and &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/mohamed-saeed/">Mohamed Saeed&lt;/a>. You can find my proposal &lt;a href="https://drive.google.com/file/d/1WAsDif61O2fWgtkl75bQAnIcm2hryt8z/view?usp=sharing" target="_blank" rel="noopener">here&lt;/a>.&lt;/p>
&lt;p>This summer, we will reproduce studies from medicine, radiology and genomics. Through these studies, we&amp;rsquo;ll explore and demonstrate three types of data leakage:&lt;/p>
&lt;ol>
&lt;li>Pre-processing on train and test sets together&lt;/li>
&lt;li>Model uses features that are not legitimate&lt;/li>
&lt;li>Feature selection on training and test sets&lt;/li>
&lt;/ol>
&lt;p>For each paper, we will replicate the published results with and without the data leakage error, and present performance metrics for comparison. We will also provide explanatory materials and example questions to test understanding. All these resources will be bundled together in a dedicated repository for each paper.&lt;/p>
&lt;p>This project aims to address the need for accessible educational material on data leakage. These materials will be designed to be readily adopted by instructors teaching machine learning in a wide variety of contexts. They will be presented in a clear and easy-to-follow manner, catering to a broad range of backgrounds and raising awareness about the consequences of data leakage.&lt;/p>
&lt;p>Stay tuned for updates on my progress! You can follow me on &lt;a href="https://github.com/shaivimalik" target="_blank" rel="noopener">GitHub&lt;/a> and watch out for my upcoming blog posts.&lt;/p></description></item></channel></rss>