Qianru Zhang | UCSC OSPO

Final Blog: BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Thu, 29 Aug 2024 00:00:00 +0000

Hello! I’m Qianru! I have been contributing to the BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking project under the mentorship of Ziheng Duan. My project aims to provide a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics.

Motivation and Overview

The “BenchmarkST” project was driven by the need to address a critical challenge in spatial transcriptomics: the impact of sparse data on downstream tasks, such as spatial domain identification. Sparse data can significantly degrade the performance of these tasks. For example, in a 10X Visium dataset of human brain Dorsolateral Prefrontal Cortex (DLPFC), using the complete dataset with GraphST (a state-of-the-art clustering method) for clustering resulted in an ARI (Adjusted Rand Index) of 0.6347. However, when using only 20% of the data—a common scenario—the performance dropped dramatically to 0.1880. This stark difference highlights the importance of effective gene imputation, which can help restore the lost information and improve the accuracy of downstream analyses.

To tackle this issue, the BenchmarkST project led to the creation of the Impeller package. This package provides a standardized, easily accessible evaluation framework for gene imputation in spatial transcriptomics, offering preprocessed datasets, reproducible evaluation methods, and flexible inference interfaces. It spans across different platforms, species, and organs, aiming to enhance the integrity and usability of spatial transcriptomics data.

What Was Accomplished

Development of the Impeller Package

Data Aggregation and Preprocessing:

We aggregated and preprocessed spatial transcriptomic datasets from multiple platforms (10X Visium, StereoSeq, SlideSeqV2), species (human, mouse), and organs (Dorsolateral Prefrontal Cortex, olfactory bulb). These datasets are readily available for download within the package.

Unified Evaluation Framework:

A reproducible framework was developed, integrating methods such as K-Nearest Neighbors (KNN) and the deep learning-based Impeller method, enabling users to easily evaluate the performance of different gene imputation techniques.

Inference Interfaces:

We provided interfaces that allow users to apply gene imputation on custom datasets, offering the flexibility to predict any gene in any cell, maximizing the utility for diverse research needs.

Code Contributions and Documentation

Repository:

All code related to the Impeller package has been committed to the Impeller repository.

Link to Versions:

Here you can find all the versions made during the project, with detailed descriptions of each change.

README.md:

Detailed documentation on how to use the Impeller package, including installation instructions, usage examples, and explanations of the key components.

Halfway Through GSOC: My Experience and Learnings

Thu, 18 Jul 2024 00:00:00 +0000

Hello there! I’m Qianru, and this is my mid-term blog post for the 2024 Google Summer of Code. I am working on the BenchmarkST project, focusing on benchmarking gene imputation methods in spatial transcriptomics. My goal is to create a comprehensive, reproducible platform for evaluating these methods across various datasets and conditions.

In this post, I will share some of the progress I have made so far, the challenges I have faced, and how I overcame them. I will also highlight some specific accomplishments and what I plan to do next.

Achievements:

Developed the Python Package: I created the “Impeller” Python package, which includes tools for downloading example data, processing it, and training models. This package aims to standardize gene imputation tasks in spatial transcriptomics.
Example Data Integration: Successfully integrated various spatial transcriptomics datasets into the package for benchmarking purposes.
Benchmarking Framework: Established a framework for objective comparison of different gene imputation methodologies.

Python Package: Installation and Usage

You can install the package using pip:

pip install Impeller

Download Example Data

from Impeller import download_example_data
download_example_data()

Load and Process Data

from Impeller import load_and_process_example_data, val_mask, test_mask, x, original_x = load_and_process_example_data()

Train Model

from Impeller import create_args, train args = create_args(),test_l1_distance, test_cosine_sim, test_rmse = train(args, data, val_mask, test_mask, x, original_x)

Challenges:

Reproducing the results of various gene imputation methods was not an easy task. I faced several challenges along the way:

Lack of Standardized Data: Some methods had incomplete or missing code, making it difficult to reproduce their results accurately.
Reproducibility Issues: Successfully integrated various spatial transcriptomics datasets into the package for benchmarking purposes.
Resource Limitations: Running large-scale experiments required significant computational resources, which posed constraints on the project timeline.

Future Work:

Moving forward, I plan to:

Extend the package’s functionalities to include more datasets and imputation methods.
Enhance the benchmarking framework for more comprehensive evaluations.
Collaborate with other researchers to validate and improve the package’s utility in the bioinformatics community.

I hope you found this update informative and interesting. If you have any questions or feedback, please feel free to contact me. Thank you for your attention and support!

BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Sun, 09 Jun 2024 00:00:00 +0000

Hello! My name is Qianru, and I will be working on a project to improve spatial transcriptomics during Google Summer of Code 2024. My project, Benchmarking Gene Imputation Methods for Spatial Transcriptomics, is mentored by Ziheng Duan and Cormac Flanagan. The goal is to create a standard platform to evaluate methods for filling in missing gene data, which is a big challenge in spatial transcriptomics. My proposal can be viewed here!

Spatial transcriptomics lets us see where genes are active in tissues, giving us insight into how cells interact in their natural environment. However, current methods often miss some gene data, making it hard to get a complete picture. Gene imputation can help fill in these gaps.

My project will:

Create a benchmark dataset to standardize gene imputation tasks across different platforms, species, and organs.

Compare various gene imputation methods to see how well they work in different scenarios.

Develop a user-friendly Python package with tools for gene imputation to help researchers improve their data.

I’m excited to contribute to this project and help advance the field of spatial transcriptomics by making data analysis more accurate and comprehensive.