<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Multimodal | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/multimodal/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/multimodal/index.xml" rel="self" type="application/rss+xml"/><description>Multimodal</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Thu, 29 Jan 2026 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>Multimodal</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/multimodal/</link></image><item><title>Omni-ST: Instruction-Driven Any-to-Any Multimodal Modeling for Spatial Transcriptomics</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/uci-ics/omni-st/</link><pubDate>Thu, 29 Jan 2026 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/uci-ics/omni-st/</guid><description>&lt;h2 id="project-description">Project description&lt;/h2>
&lt;p>Spatial transcriptomics (ST) integrates spatially resolved gene expression with tissue morphology, enabling the study of cellular organization, tissue architecture, and disease microenvironments. Modern ST datasets are inherently multimodal, combining histology images (H&amp;amp;E / IF), gene expression vectors, spatial graphs, cell annotations, and free-text pathology descriptions.&lt;/p>
&lt;p>However, most existing ST methods are task-specific and modality-siloed: separate models are trained for image-to-gene prediction, spatial domain identification, cell type classification, or text-based interpretation. This fragmentation limits cross-task generalization and scalability.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="Omni-ST overview" srcset="
/project/osre26/uci-ics/omni-st/omni-st-overview_hu23ddd3d57afcbc47e213a42520991f5c_1307894_4023f7915e2a557bcacee3aecd015061.webp 400w,
/project/osre26/uci-ics/omni-st/omni-st-overview_hu23ddd3d57afcbc47e213a42520991f5c_1307894_8d4e33b30dc811f95fb70a843df58532.webp 760w,
/project/osre26/uci-ics/omni-st/omni-st-overview_hu23ddd3d57afcbc47e213a42520991f5c_1307894_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre26/uci-ics/omni-st/omni-st-overview_hu23ddd3d57afcbc47e213a42520991f5c_1307894_4023f7915e2a557bcacee3aecd015061.webp"
width="760"
height="664"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;p>&lt;strong>Omni-ST&lt;/strong> proposes a single &lt;strong>instruction-driven any-to-any multimodal backbone&lt;/strong> that treats each spatial transcriptomics modality as a “language” and formulates all tasks as:&lt;/p>
&lt;p>&lt;strong>Instruction + Input Modality → Output Modality&lt;/strong>&lt;/p>
&lt;p>Natural language is elevated from auxiliary metadata to a &lt;strong>unifying interface&lt;/strong> that specifies task intent, target modality, and biological context. This paradigm enables flexible, interpretable, and extensible spatial reasoning within a single model.&lt;/p>
&lt;hr>
&lt;h3 id="project-idea-instruction-driven-any-to-any-modeling-for-spatial-transcriptomics">Project Idea: Instruction-Driven Any-to-Any Modeling for Spatial Transcriptomics&lt;/h3>
&lt;p>&lt;strong>Topics:&lt;/strong> spatial transcriptomics, multimodal learning, instruction tuning, computational pathology&lt;br>
&lt;strong>Skills:&lt;/strong> PyTorch, deep learning, Transformers, multimodal representation learning&lt;br>
&lt;strong>Difficulty:&lt;/strong> Hard&lt;br>
&lt;strong>Size:&lt;/strong> 350 hours&lt;/p>
&lt;p>&lt;strong>Mentor:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Xi Li&lt;/strong> — &lt;a href="mailto:xil43@uci.edu">mailto:xil43@uci.edu&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Essential information:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>Design a unified multimodal backbone with lightweight modality adapters for histology images, gene expression vectors, spatial graphs, and text.&lt;/li>
&lt;li>Use natural language instructions to condition model behavior, enabling any-to-any translation without task-specific heads.&lt;/li>
&lt;li>Support core tasks including image → gene expression prediction, gene expression → cell type / spatial domain identification, region → text-based biological explanation, and text-based spatial retrieval.&lt;/li>
&lt;li>Evaluate the model across multiple spatial transcriptomics tasks within a single framework, emphasizing generalization and interpretability.&lt;/li>
&lt;li>Develop visualization and interpretation tools such as spatial maps and language-grounded explanations.&lt;/li>
&lt;/ul>
&lt;p>&lt;strong>Expected deliverables:&lt;/strong>&lt;/p>
&lt;ul>
&lt;li>An open-source PyTorch implementation of the Omni-ST framework.&lt;/li>
&lt;li>Unified multitask benchmarks for spatial transcriptomics.&lt;/li>
&lt;li>Visualization and interpretation tools for spatial predictions.&lt;/li>
&lt;li>Documentation and tutorials demonstrating how to add new tasks via instructions.&lt;/li>
&lt;/ul></description></item><item><title>ReasonWorld</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/reason-world/</link><pubDate>Fri, 28 Feb 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/reason-world/</guid><description>&lt;h3 id="reasonworld-real-world-reasoning-with-a-long-term-world-model">ReasonWorld: Real-World Reasoning with a Long-Term World Model&lt;/h3>
&lt;p>A world model is essentially an internal representation of an environment that an AI system would construct based on external information to plan, reason, and interpret its surroundings. It stores the system’s understanding of relevant objects, spatial relationships, and/or states in the environment. Recent augmented reality (AR) and wearable technologies like Meta Aria glasses provide an opportunity to gather rich information from the real world in the form of vision, audio, and spatial data. Along with this, large language (LLM), vision language models (VLMs), and general machine learning algorithms have enabled nuanced understanding and processing of multimodal inputs that can label, summarize, and analyze experiences.&lt;/p>
&lt;p>With &lt;strong>ReasonWorld&lt;/strong>, we aim to utilize these technologies to enable advanced reasoning about important objects/events/spaces in real-world environments in a structured manner. With the help of wearable AR technology, the system would be able to capture real-world multimodal data. We aim to utilize this information to create a long-memory modeling toolkit that would support features like:&lt;/p>
&lt;ul>
&lt;li>Longitudinal and structured data logging: Capture and storing of multimodal data (image, video, audio, location coordinates etc.)&lt;/li>
&lt;li>Semantic summarization: Automatic scene labeling via LLMs/VLMs to identify key elements in the surroundings&lt;/li>
&lt;li>Efficient retrieval: For querying and revisiting past experiences and answering questions like “Where have I seen this painting before?”&lt;/li>
&lt;li>Adaptability: Continuously refining and understanding the environment and/or relationships between objects/locations.&lt;/li>
&lt;li>Adaptive memory prioritization: Where the pipeline can assess the contextual significance of the captured data and retrieve those that are the most significant. The model retains meaningful, structured representations rather than raw, unfiltered data.&lt;/li>
&lt;/ul>
&lt;p>This real-world reasoning framework with a long-term world model can function as a structured search engine for important objects and spaces, enabling:&lt;/p>
&lt;ul>
&lt;li>Recognizing and tracking significant objects, locations, and events&lt;/li>
&lt;li>Supporting spatial understanding and contextual analysis&lt;/li>
&lt;li>Facilitating structured documentation of environments and changes over time&lt;/li>
&lt;/ul>
&lt;h3 id="alignment-with-summer-of-reproducibility">Alignment with Summer of Reproducibility:&lt;/h3>
&lt;ul>
&lt;li>Core pipeline for AR data ingestion, event segmentation, summarization, and indexing (knowledge graph or vector database) would be made open-source.&lt;/li>
&lt;li>Clear documentation of each module and how they collaborate with one another&lt;/li>
&lt;li>The project could be tested with standardized datasets, simulated environments as well as controlled real-world scenarios, promoting reproducibility&lt;/li>
&lt;li>Opportunities for Innovation - A transparent, modular approach invites a broad community to propose novel expansions&lt;/li>
&lt;/ul>
&lt;h3 id="specific-tasks">Specific Tasks:&lt;/h3>
&lt;ul>
&lt;li>A pipeline for real-time/batch ingestion of data with the wearable AR device and cleaning&lt;/li>
&lt;li>Have an event segmentation module to classify whether the current object/event is contextually significant, filtering out the less relevant observations.&lt;/li>
&lt;li>Have VLMs/LLMs summarize the events with the vision/audio/location data to be stored and retrieved later by structured data structures like knowledge graph, vector databases etc.&lt;/li>
&lt;li>Storage optimization with prioritizing important objects and spaces, optimizing storage based on contextual significance and frequency of access.&lt;/li>
&lt;li>Implement key information retrieval mechanisms&lt;/li>
&lt;li>Ensure reproducibility by providing datasets and scripts&lt;/li>
&lt;/ul>
&lt;h3 id="reasonworld">ReasonWorld&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Augmented reality&lt;/code> &lt;code>Multimodal learning&lt;/code> &lt;code>Computer vision for AR&lt;/code> &lt;code>LLM/VLM&lt;/code> &lt;code>Efficient data indexing&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> Machine Learning and AI, Augmented Reality and Hardware integration, Data Engineering &amp;amp; Storage Optimization&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Hard&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:davisje@ucsc.edu">James Davis&lt;/a>, &lt;a href="mailto:pang@soe.ucsc.edu">Alex Pang&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>