Ziheng Duan | UCSC OSPO

Agent4Target: An Agent-based Evidence Aggregation Toolkit for Therapeutic Target Identification

Tue, 20 Jan 2026 00:00:00 +0000

Topics: therapeutic target identification, drug discovery, evidence aggregation, AI agents, biomedical knowledge integration
Skills:
- Programming Languages: Python; experience with modern ML tooling preferred
- Machine Learning / AI: agent-based systems, workflow orchestration, weak supervision (basic), representation learning
- Software Engineering: modular system design, APIs, CLI tools, documentation
- Biomedical Knowledge (preferred): familiarity with drug–target databases (e.g., PHAROS, DepMap, Open Targets)
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Ziheng Duan (contact person)

Project Idea Description

Identifying and prioritizing high-quality therapeutic targets is a foundational yet challenging task in drug discovery. Modern target identification relies on aggregating heterogeneous evidence from multiple sources, including genetic perturbation screens, disease associations, chemical biology, and biomedical literature. These evidence sources are highly fragmented, noisy, and heterogeneous in both format and reliability.

While large language models and AI agents have recently shown promise in automating scientific workflows, many existing approaches focus on end-to-end prediction or conversational interfaces. Such systems are often difficult to reproduce, extend, or integrate into existing research pipelines, limiting their practical adoption by the biomedical community.

This project proposes Agent4Target, an agent-based evidence aggregation toolkit that reframes therapeutic target identification as a structured, modular workflow. Instead of using agents for free-form reasoning, Agent4Target employs agents as orchestrated components that systematically collect, normalize, score, and explain evidence supporting candidate therapeutic targets.

The goal is to deliver a reusable, open-source toolchain that can be integrated into diverse drug discovery workflows, independent of any single downstream prediction model or publication.

Key Idea and Technical Approach

Agent4Target models target identification as a multi-stage, agent-driven pipeline, coordinated by a central orchestrator:

Evidence Collector Agents
Specialized agents retrieve target-level evidence from heterogeneous sources, such as:
- Genetic perturbation and dependency data (e.g., DepMap)
- Target annotation and development status (e.g., PHAROS)
- Disease association scores (e.g., Open Targets)
- Automatically summarized literature evidence
Normalization & Scoring Agent
Collected evidence is converted into a unified, structured schema using typed data models (e.g., JSON / Pydantic).
This agent performs:
- Evidence normalization across sources
- Confidence-aware scoring and aggregation
- Optional weighting or calibration strategies
Explanation Agent
Rather than free-text generation, this agent produces structured explanations that explicitly link scores to supporting evidence, enabling transparency and interpretability for downstream users.
Workflow Orchestrator
A lightweight orchestration layer (e.g., LangGraph or a state-machine-based controller) manages agent execution, dependencies, and failure handling, ensuring reproducibility and extensibility.

This modular design allows individual agents to be replaced, extended, or reused without altering the overall system.

Project Objectives

Design a Modular Agent-based Architecture
- Define clear interfaces for evidence collection, normalization, scoring, and explanation agents.
Implement a Standardized Evidence Schema
- Develop a unified data model for heterogeneous target-level evidence.
Build a Reproducible Orchestration Framework
- Implement a deterministic, inspectable workflow for agent coordination.
Deliver a Community-Ready Toolkit
- Provide CLI tools, example notebooks, and clear documentation to support adoption.
Benchmark and Case Studies
- Demonstrate the toolkit on representative target identification scenarios using public datasets.

Project Deliverables

Open-Source Agent4Target Codebase
- A well-documented Python package with modular agent components.
Command-Line Interface (CLI)
- Tools for running end-to-end evidence aggregation pipelines.
Standardized Output Schema
- Machine-readable evidence summaries suitable for downstream modeling.
Example Notebooks and Benchmarks
- Demonstrations of usage and performance on real-world target identification tasks.
Documentation
- Installation guides, extension tutorials, and developer documentation.

Impact

Agent4Target provides a practical bridge between AI agents and real-world drug discovery workflows. By emphasizing structured evidence aggregation, reproducibility, and interpretability, this project enables researchers to systematically reason about therapeutic targets rather than relying on opaque, end-to-end models. The resulting toolkit can serve as a foundation for future work in AI-assisted drug discovery, weak supervision, and biomedical knowledge integration.

HistoMoE: A Histology-Guided Mixture-of-Experts Framework for Gene Expression Prediction

Tue, 20 Jan 2026 00:00:00 +0000

Topics: computational pathology, spatial transcriptomics, gene expression prediction, mixture-of-experts, multimodal learning
Skills:
- Programming Languages: Python; experience with PyTorch preferred
- Machine Learning: CNNs / vision encoders, mixture-of-experts, multimodal representation learning
- Data Analysis: handling large-scale histology image patches and gene expression matrices
- Bioinformatics Knowledge (preferred): familiarity with spatial transcriptomics or scRNA-seq data
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Ziheng Duan (contact person)

Project Idea Description

Histology imaging is one of the most widely available data modalities in biomedical research and clinical practice, capturing rich morphological information about tissues and disease states. In parallel, spatial transcriptomics (ST) technologies provide spatially resolved gene expression measurements, enabling unprecedented insights into tissue organization and cellular heterogeneity. However, the high cost and limited accessibility of ST experiments remain a major barrier to their widespread adoption.

Predicting gene expression directly from histology images offers a promising alternative, enabling molecular-level inference from routinely collected pathology data. Existing approaches typically rely on a single global model that maps image embeddings to gene expression profiles. While effective to some extent, these models struggle to capture the strong organ-, tissue-, and cancer-specific heterogeneity that underlies gene expression patterns.

This project proposes HistoMoE, a histology-guided mixture-of-experts (MoE) framework that explicitly models biological heterogeneity by learning specialized expert models for different cancer types or organs, and dynamically routing histology image patches to the most relevant experts.

Key Idea and Technical Approach

As illustrated in the figure above, HistoMoE integrates multiple data modalities and learning components:

Vision Encoder
Histology image patches are encoded into high-dimensional visual representations using a convolutional or transformer-based vision backbone.
Text / Metadata Encoder
Sample-level metadata (e.g., tissue type, organ, disease context) is encoded using a lightweight text or embedding model.
Gating Network
A gating network jointly considers image and metadata embeddings to infer routing weights over multiple cancer- or organ-specific expert models.
Expert Models
Each expert specializes in modeling gene expression patterns for a specific biological context (e.g., CCRCC, COAD, LUAD), producing patch-level gene expression predictions.

By explicitly modeling biological structure through expert specialization, HistoMoE aims to improve both prediction accuracy and interpretability, allowing researchers to understand which biological experts drive each prediction.

Project Objectives

Design and Implement the HistoMoE Framework
- Build a modular MoE architecture with pluggable vision encoders, gating networks, and expert models.
Multimodal Routing and Expert Specialization
- Explore how image features and metadata jointly inform expert selection.
Benchmarking and Evaluation
- Compare HistoMoE against single-model baselines on multiple cancer and organ-specific spatial transcriptomics datasets.
Interpretability Analysis
- Analyze expert routing behavior to reveal biologically meaningful patterns.

Project Deliverables

Open-Source HistoMoE Codebase
- Well-documented Python implementation with training, evaluation, and visualization tools.
Benchmark Results
- Quantitative comparisons demonstrating improvements over non-expert baselines.
Visualization and Analysis Tools
- Tools for inspecting expert usage, routing weights, and gene-level predictions.
Documentation and Tutorials
- Clear instructions and examples to enable adoption by the research community.

Impact

HistoMoE introduces an expert-system perspective to histology-based gene expression prediction, bridging morphological and molecular representations through biologically informed specialization. By combining multimodal learning with mixture-of-experts modeling, this project advances the interpretability and accuracy of computational pathology methods and contributes toward scalable, cost-effective alternatives to spatial transcriptomics experiments.

StaR: A Stability-Aware Representation Learning Framework for Spatial Domain Identification

Tue, 20 Jan 2026 00:00:00 +0000

Topics: spatial transcriptomics, spatial domain identification, representation learning, model robustness
Skills:
- Programming Languages: Python; PyTorch experience preferred
- Machine Learning: representation learning, clustering, robustness and stability analysis
- Data Analysis: spatial transcriptomics preprocessing and evaluation (ARI, clustering metrics)
- Bioinformatics Knowledge (preferred): familiarity with spatial transcriptomics or scRNA-seq data
Difficulty: Advanced
Size: Large (350 hours)
Mentors: Ziheng Duan (contact person)

Project Idea Description

Spatial domain identification is a fundamental task in spatial transcriptomics (ST), aiming to partition tissue sections into biologically meaningful regions based on spatially resolved gene expression profiles. These spatial domains often correspond to distinct anatomical structures, cellular compositions, or functional microenvironments, and serve as a critical foundation for downstream biological analysis.

Despite rapid methodological progress, most existing spatial domain identification methods are highly sensitive to random initialization. In practice, simply changing the random seed can lead to substantially different clustering results and large performance fluctuations, even when using identical hyperparameters and datasets. This instability severely undermines the reliability, reproducibility, and interpretability of spatial transcriptomics analyses.

Problem: Seed Sensitivity and Unstable Representations

Empirical evidence shows that state-of-the-art spatial domain identification models can exhibit substantial performance variance across random seeds. For example, the Adjusted Rand Index (ARI) may vary from relatively strong performance (e.g., ARI ≈ 0.65) to noticeably degraded yet still reasonable outcomes (e.g., ARI ≈ 0.50) solely due to different random initializations.

By systematically evaluating models across hundreds to thousands of random seeds, we observe that:

Model performance landscapes are highly rugged, with sharp cliffs and isolated high-performing regions.
Standard training objectives implicitly favor brittle representations that are not robust to small perturbations in initialization or optimization trajectories.

These observations suggest that instability is not a peripheral issue, but rather a structural limitation of current representation learning approaches for spatial transcriptomics.

Proposed Solution: StaR

This project proposes StaR, a Stability-Aware Representation Learning framework designed to explicitly address seed sensitivity in spatial domain identification.

The core idea of StaR is to learn representations that are robust to perturbations in model parameters and training dynamics, rather than optimizing solely for peak performance under a single random seed. Concretely, StaR introduces controlled noise or perturbations into the training process and encourages consistency across multiple perturbed model instances, guiding the model toward flatter and more stable regions of the parameter space.

By prioritizing stability during representation learning, StaR aims to produce embeddings that:

Yield consistent spatial domain assignments across random seeds
Maintain competitive or improved clustering accuracy
Better reflect underlying biological structure

Project Objectives

Characterize Instability in Existing Methods
- Systematically quantify seed sensitivity across popular spatial domain identification models.
Develop Stability-Aware Training Objectives
- Design perturbation-based or consistency-driven losses that encourage robust representations.
Integrate StaR into Existing Pipelines
- Apply StaR to widely used spatial transcriptomics workflows with minimal architectural changes.
Evaluation and Benchmarking
- Evaluate StaR using clustering metrics (e.g., ARI) and stability metrics across multiple datasets and random seeds.
Biological Validation
- Assess whether stability-aware representations preserve biologically meaningful spatial patterns.

Project Deliverables

StaR Framework Implementation
- An open-source Python implementation compatible with common spatial transcriptomics toolchains.
Stability Benchmarks
- Comprehensive evaluations demonstrating reduced performance variance across seeds.
Visualization Tools
- Tools for visualizing performance landscapes, stability surfaces, and spatial domain consistency.
Documentation and Tutorials
- Clear examples enabling researchers to adopt StaR in their own analyses.

Impact

StaR addresses a critical yet underexplored challenge in spatial transcriptomics: model instability and poor reproducibility. By shifting the focus from single-run performance to stability-aware representation learning, this project improves the reliability and trustworthiness of spatial domain identification methods. StaR has the potential to become a foundational component in robust spatial transcriptomics pipelines and to inspire broader adoption of stability-aware principles in biological representation learning.

RAG-ST: Retrieval-Augmented Generation for Spatial Transcriptomics

Wed, 15 Jan 2025 00:00:00 +0000

Topics: bioinformatics, spatial transcriptomics, gene expression generation, retrieval-augmented generation, large models
Skills:
- Programming Languages:
  - Proficient in Python, and familiarity with machine learning libraries such as PyTorch.
- Data Analysis:
  - Experience with spatial transcriptomics datasets and statistical modeling.
- Machine Learning:
  - Understanding of vision models, retrieval-based systems, and MLP architectures.
- Bioinformatics Knowledge (preferred):
  - Familiarity with scRNA-seq data integration and computational biology tools.
Difficulty: Advanced
Size: Large (350 hours). Given the scope of integrating RAG models, building a robust database, and ensuring interpretable predictions, this project involves substantial computational and data preparation work.
Mentors: Ziheng Duan (contact person)

Project Idea Description

Spatial transcriptomics (ST) is a revolutionary technology that provides spatially resolved gene expression measurements, enabling researchers to study cellular behaviour within tissues with unprecedented detail. This technology has transformed our understanding of complex biological systems, such as disease progression, tissue development, and cellular heterogeneity. However, the widespread adoption of ST is limited by its high cost and technical requirements.

Histology imaging, on the other hand, is far more accessible and cost-effective. If gene expression could be accurately predicted from histology images, it would enable researchers to leverage these abundant images for high-resolution biological insights without the need for expensive spatial transcriptomics experiments. This task has immense potential to democratize spatial transcriptomics research and significantly reduce costs.

Challenges in Current Approaches

Current methods for predicting gene expression from histology images typically involve:

Using large vision models to encode histology image patches into embeddings.
Employing Multi-Layer Perceptrons (MLPs) to map these embeddings to gene expression profiles.

While these approaches have shown promise, they suffer from two critical limitations:

Accuracy: The MLP-based mappings often fail to fully capture the biological complexity encoded in the histology images, leading to suboptimal predictions.
Interpretability: These models act as black boxes, providing no insight into the underlying biological rationale for the predictions. Researchers cannot determine why a specific gene expression profile was generated, limiting trust and utility in biological contexts.

Project Motivation

To overcome these limitations, this project proposes a novel Retrieval-Augmented Generation (RAG) framework for spatial transcriptomics. Instead of relying solely on black-box MLPs, RAG-ST will:

Retrieve relevant examples from a curated database of paired histology images, scRNA-seq data, and gene expression profiles.
Use these retrieved examples to inform and enhance the generation process, resulting in predictions that are both more accurate and biologically interpretable.

This approach not only grounds predictions in biologically meaningful data but also provides transparency by revealing which database entries influenced the results.

Project Objectives

Database Construction:
- Curate a large and diverse database of histology images paired with scRNA-seq and gene expression data.
Model Development:
- Develop a RAG framework combining vision-based encoders and retrieval-enhanced generation techniques.
- Incorporate interpretability mechanisms to link predicted gene expressions to retrieved examples.
Evaluation and Benchmarking:
- Assess RAG-ST against state-of-the-art methods, focusing on accuracy, interpretability, and biological validity.

Project Deliverables

Curated Database:
- A publicly available, well-documented database of histology images and gene expression profiles.
RAG-ST Framework:
- An open-source Python implementation of the RAG-ST model, with retrieval, generation, and visualization tools.
Benchmark Results:
- Comprehensive evaluations demonstrating the benefits of RAG-ST over conventional pipelines.
Documentation and Tutorials:
- User-friendly guides to facilitate adoption by the spatial transcriptomics research community.

Impact

By integrating retrieval-augmented generation with large models, RAG-ST represents a paradigm shift in spatial transcriptomics. It offers a cost-effective, accurate, and interpretable solution for gene expression prediction, democratizing access to high-quality spatial transcriptomic insights and fostering advancements in biological research.

BenchmarkST: Cross-Platform, Multi-Species Spatial Transcriptomics Gene Imputation Benchmarking

Sat, 17 Feb 2024 00:00:00 +0000

Topics: bioinformatics, spatial transcriptomics, gene imputation, benchmarking, cross-platform/species analysis
Skills:
- Programming Languages:
  - Proficient in Python and/or R, commonly used in bioinformatics.
- Data Analysis:
  - Experience with statistical data analysis and machine learning models.
- Bioinformatics Knowledge (not required but preferred):
  - Proficiency in bioinformatics and computational biology.
  - Familiarity with spatial transcriptomics datasets and platforms.
Difficulty: Advanced
Size: Large (350 hours). Given the scope of integrating multi-platform, multi-species datasets and the complexity of benchmarking gene imputation methods, this project is substantial. It requires extensive data preparation, analysis, and validation phases, making it suitable for a larger time investment.
Mentors: Ziheng Duan (contact person)

Project Idea Description

The orchestration of cellular life is profoundly influenced by the precise control of gene activation and silencing across different spatial and temporal contexts. Understanding these complex spatiotemporal gene expression patterns is vital for advancing our knowledge of biological processes, from development and disease progression to adaptation. While single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to profile gene expression across thousands of cells simultaneously, its requirement for cell dissociation strips away the critical spatial context, limiting our comprehension of cellular interactions within their native environments. Recent strides in spatial transcriptomics have started to bridge this gap by enabling spatially resolved gene expression measurements at single-cell or even sub-cellular resolutions. These advancements offer unparalleled opportunities to delineate the intricate tapestry of gene expression within tissues, shedding light on the dynamic interactions between cells and their surroundings.

Despite these technological advances, a significant challenge remains: the datasets generated by spatial transcriptomic technologies are often incomplete, marred by missing gene expression values due to various technical and biological constraints. This limitation severely impedes our ability to fully interpret these rich datasets and extract meaningful insights from them. Gene imputation emerges as a pivotal solution to this problem, aiming to fill in these missing data points, thereby enhancing the resolution, quality, and interpretability of spatial transcriptomic datasets.

Recognizing the critical importance of this task, there is a pressing need for a unified benchmarking platform that can facilitate the evaluation and comparison of gene imputation methods across a diverse array of samples, spanning multiple sampling platforms, species, and organs. Currently, the bioinformatics and spatial transcriptomics fields lack such a standardized framework, hindering progress and innovation. To address this gap, our project aims to establish a comprehensive gene imputation dataset that encompasses a wide range of conditions and parameters. We intend to reproduce known methods and assess their efficacy, providing a solid and reproducible foundation for future advancements in this domain.

Project Deliverable

A comprehensive, preprocessed benchmark dataset that spans multiple sampling platforms, species, and organs, aimed at standardizing gene imputation tasks in spatial transcriptomics.
An objective comparison of state-of-the-art gene imputation methodologies, enhancing the understanding of their performance and applicability across diverse biological contexts.
A user-friendly Python package offering a suite of gene imputation tools, designed to fulfill the research needs of the spatial transcriptomics community by improving data completeness and reproducibility.