AI for Science: Automating Domain Specific Tasks with Large Language Models

Sun, 23 Feb 2025 21:30:56 -0800

Recent advancements in Large Language Models (LLMs) have transformed various fields by demonstrating remarkable capabilities in processing and generating human-like text. This project aims to explore the development of an open-source framework that leverages LLMs to enhance discovery across specialized domains.

The proposed framework will enable LLMs to analyze and interpret complex datasets, automate routine tasks, and uncover novel insights. A key focus will be on equipping LLMs with domain-specific expertise, particularly in areas where specialized tools – such as ANDES – are not widely integrated with LLM-based solutions. By bridging this gap, the framework will empower researchers and professionals to harness LLMs as intelligent assistants capable of navigating and utilizing niche computational tools effectively.

AI for Science: Automating Domain Specific Tasks with Large Language Models

Topics: Large Language Models AI for Science
Skills: Python, Experience with LLMs, Prompt Engineering, Fine-Tuning, LLM Frameworks
Difficulty: Medium-Difficult
Size: Large (350 hours)
Mentor: [Daniel Wong]Daniel Wong, [Luanzheng “Lenny” Guo]Luanzheng "Lenny" Guo

Project Tasks and Milestones

Designing an extensible framework that facilitates the integration of LLMs with specialized software and datasets.
Developing methodologies for fine-tuning LLMs to act as domain experts.
Implementing strategies for improving tool interoperability, allowing LLMs to interact seamlessly with less commonly used but critical analytical platforms.

Smart Batching for Large Language Models

Sun, 09 Feb 2025 10:15:56 -0700

Sequence tokenization is a crucial step during Large Language Model training, fine-tuning, and inference. User prompts and training data are tokenized and zero-padded before being fed to the model in batches. This process allows models to interpret human language by breaking down complex sentences into simple token units that are numerically represented in a token set. However, the process of sequence padding for maintaining batch dimensions can introduce unnecessary overhead if batching is not properly done.

In this project, we introduce Smart Batching, where we dynamically batch sequences in a fine-tuning dataset by their respective lengths. With this method, we aim to minimize the amount of zero padding required during sequence batching, which can result in improved and efficient fine-tuning and inference speeds. We also analyze this method with other commonly used batching practices (Longest Sequence, Random Shuffling) on valuable metrics such as runtime and model accuracy.

Project Title

Topics: Large Language Models Fine-Tuning AI Transformers
Skills: Python, Pytorch, Large Language Models
Difficulty: Moderate
Size: Large (350 hours)
Mentor: [Daniel Wong]Daniel Wong, [Luanzheng “Lenny” Guo]Luanzheng "Lenny" Guo

Project Tasks and Milestones

Implement an open source smart batching framework based on HuggingFace to allow for dynamically grouping sequences of similar token lengths into batches
Analyze runtime, padding, and model accuracy with smart batching and other commonly used batching practices
Apply smart batching with distributed fine-tuning and observe large language model outputs

Daniel Wong | UCSC OSPO

AI for Science: Automating Domain Specific Tasks with Large Language Models