<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>jayjeetc | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jayjeetc/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jayjeetc/index.xml" rel="self" type="application/rss+xml"/><description>jayjeetc</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Tue, 11 Feb 2025 13:00:00 -0800</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>jayjeetc</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/author/jayjeetc/</link></image><item><title>Vector Embeddings Dataset</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/embeddings/</link><pubDate>Tue, 11 Feb 2025 13:00:00 -0800</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/embeddings/</guid><description>&lt;h3 id="vector-embeddings-dataset">Vector Embeddings Dataset&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Topics:&lt;/strong> &lt;code>Vector Embeddings&lt;/code> &lt;code>LLMs&lt;/code> &lt;code>Transformers&lt;/code>&lt;/li>
&lt;li>&lt;strong>Skills:&lt;/strong> software development, apis, scripting, python&lt;/li>
&lt;li>&lt;strong>Difficulty:&lt;/strong> Moderate&lt;/li>
&lt;li>&lt;strong>Size:&lt;/strong> Medium or Large (175 or 350 hours)&lt;/li>
&lt;li>&lt;strong>Mentors:&lt;/strong> &lt;a href="mailto:jayjeetc@ucsc.edu">Jayjeet Chakraborty&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>To benchmark vector search algorithms (aka ANN algorithms), there are several datasets available but none of
them represent actual real world workloads. This is because they usually have small vectors of only a few hundred
dimensions. For vector search experiments to represent real world workloads, we want to have datasets with
several thousand dimensions like what is generated by OpenAIs text-embedding models. This project aims to create a
dataset with 1B embeddings from a wikipedia dataset using open source models. Ideally, we will have 3 versions of this dataset, with 1024, 4096, and 8192 sized embeddings to start with.&lt;/p></description></item></channel></rss>