<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Daniel Wong | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/daniel-wong/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/daniel-wong/index.xml" rel="self" type="application/rss+xml"/><description>Daniel Wong</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>Daniel Wong</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/daniel-wong/</link></image><item><title>AI for Science: Automating Domain Specific Tasks with Large Language Models</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucr/domain-automation/</link><pubDate>Sun, 23 Feb 2025 21:30:56 -0800</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucr/domain-automation/</guid><description>&lt;p>Recent advancements in Large Language Models (LLMs) have transformed various fields by demonstrating remarkable capabilities in processing and generating human-like text. This project aims to explore the development of an open-source framework that leverages LLMs to enhance discovery across specialized domains.&lt;/p>
&lt;p>The proposed framework will enable LLMs to analyze and interpret complex datasets, automate routine tasks, and uncover novel insights. A key focus will be on equipping LLMs with domain-specific expertise, particularly in areas where specialized tools &amp;ndash; such as ANDES &amp;ndash; are not widely integrated with LLM-based solutions. By bridging this gap, the framework will empower researchers and professionals to harness LLMs as intelligent assistants capable of navigating and utilizing niche computational tools effectively.&lt;/p>
&lt;h3 id="ai-for-science-automating-domain-specific-tasks-with-large-language-models">AI for Science: Automating Domain Specific Tasks with Large Language Models&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Large Language Models&lt;/code> &lt;code>AI for Science&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Python, Experience with LLMs, Prompt Engineering, Fine-Tuning, LLM Frameworks&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Medium-Difficult&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: [Daniel Wong]&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/daniel-wong/">Daniel Wong&lt;/a>, [Luanzheng &amp;ldquo;Lenny&amp;rdquo; Guo]&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/luanzheng-lenny-guo/">Luanzheng &amp;#34;Lenny&amp;#34; Guo&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="project-tasks-and-milestones">Project Tasks and Milestones&lt;/h3>
&lt;ul>
&lt;li>Designing an extensible framework that facilitates the integration of LLMs with specialized software and datasets.&lt;/li>
&lt;li>Developing methodologies for fine-tuning LLMs to act as domain experts.&lt;/li>
&lt;li>Implementing strategies for improving tool interoperability, allowing LLMs to interact seamlessly with less commonly used but critical analytical platforms.&lt;/li>
&lt;/ul></description></item><item><title>Smart Batching for Large Language Models</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucr/smartbatch/</link><pubDate>Sun, 09 Feb 2025 10:15:56 -0700</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucr/smartbatch/</guid><description>&lt;p>Sequence tokenization is a crucial step during Large Language Model training, fine-tuning, and inference. User prompts and training data are tokenized and zero-padded before being fed to the model in batches. This process allows models to interpret human language by breaking down complex sentences into simple token units that are numerically represented in a token set. However, the process of sequence padding for maintaining batch dimensions can introduce unnecessary overhead if batching is not properly done.&lt;/p>
&lt;p>In this project, we introduce Smart Batching, where we dynamically batch sequences in a fine-tuning dataset by their respective lengths. With this method, we aim to minimize the amount of zero padding required during sequence batching, which can result in improved and efficient fine-tuning and inference speeds. We also analyze this method with other commonly used batching practices (Longest Sequence, Random Shuffling) on valuable metrics such as runtime and model accuracy.&lt;/p>
&lt;h3 id="project-title">Project Title&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics&lt;/strong>: &lt;code>Large Language Models&lt;/code> &lt;code>Fine-Tuning&lt;/code> &lt;code>AI&lt;/code> &lt;code>Transformers&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills&lt;/strong>: Python, Pytorch, Large Language Models&lt;/li>
&lt;li>&lt;strong>Difficulty&lt;/strong>: Moderate&lt;/li>
&lt;li>&lt;strong>Size&lt;/strong>: Large (350 hours)&lt;/li>
&lt;li>&lt;strong>Mentor&lt;/strong>: [Daniel Wong]&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/daniel-wong/">Daniel Wong&lt;/a>, [Luanzheng &amp;ldquo;Lenny&amp;rdquo; Guo]&lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/luanzheng-lenny-guo/">Luanzheng &amp;#34;Lenny&amp;#34; Guo&lt;/a>&lt;/li>
&lt;/ul>
&lt;h3 id="project-tasks-and-milestones">Project Tasks and Milestones&lt;/h3>
&lt;ul>
&lt;li>Implement an open source smart batching framework based on HuggingFace to allow for dynamically grouping sequences of similar token lengths into batches&lt;/li>
&lt;li>Analyze runtime, padding, and model accuracy with smart batching and other commonly used batching practices&lt;/li>
&lt;li>Apply smart batching with distributed fine-tuning and observe large language model outputs&lt;/li>
&lt;/ul></description></item></channel></rss>