<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>gsoc | UCSC OSPO</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/gsoc/</link><atom:link href="https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/gsoc/index.xml" rel="self" type="application/rss+xml"/><description>gsoc</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Mon, 29 Sep 2025 00:00:00 +0000</lastBuildDate><image><url>https://deploy-preview-1007--ucsc-ospo.netlify.app/media/logo_hub6795c39d7c5d58c9535d13299c9651f_74810_300x300_fit_lanczos_3.png</url><title>gsoc</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/tag/gsoc/</link></image><item><title>Scenic-RoboSuite Integration: Building the First Working Prototype</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/scenic/20250929-sahil-tgs/</link><pubDate>Mon, 29 Sep 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/scenic/20250929-sahil-tgs/</guid><description>&lt;p>I&amp;rsquo;m &lt;a href="https://sahiltgs.super.site/" target="_blank" rel="noopener">Sahil&lt;/a>, presenting the first working prototype of the Scenic-RoboSuite integration. This &lt;a href="https://sahiltgs.super.site/gsoc/uc-ospo-proposal" target="_blank" rel="noopener">project&lt;/a> is being mentored by &lt;a href="https://ucsc-ospo.github.io/author/daniel-fremont/" target="_blank" rel="noopener">Daniel Fremont&lt;/a> and &lt;a href="https://ucsc-ospo.github.io/author/eric-vin/" target="_blank" rel="noopener">Eric Vin&lt;/a>.&lt;/p>
&lt;p>After months of development, we have achieved a functional prototype of the &lt;a href="https://scenic-lang.org/" target="_blank" rel="noopener">Scenic&lt;/a>-&lt;a href="https://robosuite.ai/" target="_blank" rel="noopener">RoboSuite&lt;/a> interface. Researchers can now write basic declarative robotic manipulation scenarios in Scenic that execute with physics simulation in RoboSuite. While still in development, the prototype demonstrates the feasibility and potential of bridging probabilistic scenario generation with detailed robot control.&lt;/p>
&lt;h2 id="major-achievements">Major Achievements&lt;/h2>
&lt;h3 id="mjcf-xml-injection">MJCF XML Injection&lt;/h3>
&lt;p>The interface introduces direct MJCF XML support, allowing Scenic to build RoboSuite-native manipulable objects from raw XML definitions. Users can define custom objects with complex mesh geometries, textures, and physics properties directly in their Scenic scenarios:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">dragon_xml = &amp;#39;&amp;#39;&amp;#39;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;lt;mujoco&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;lt;asset&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;lt;mesh file=&amp;#34;dragon.stl&amp;#34; scale=&amp;#34;0.01 0.01 0.01&amp;#34;/&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;lt;texture file=&amp;#34;dragon_texture.png&amp;#34;/&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;lt;/asset&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;lt;worldbody&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;lt;body name=&amp;#34;object&amp;#34;&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;lt;geom mesh=&amp;#34;dragon_mesh&amp;#34; type=&amp;#34;mesh&amp;#34;/&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;lt;/body&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> &amp;lt;/worldbody&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;lt;/mujoco&amp;gt;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">&amp;#39;&amp;#39;&amp;#39;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">dragon = new CustomObject with mjcfXml dragon_xml
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The system automatically handles collision geometry generation, joint creation for physics, and asset file resolution.&lt;/p>
&lt;h3 id="complex-mesh-object-support">Complex Mesh Object Support&lt;/h3>
&lt;p>Import and manipulate arbitrary 3D models (STL, OBJ) with automatic mesh repair and texture mapping. The interface resolves file paths relative to Scenic files, copies assets to temporary directories for MuJoCo, and converts textures (JPG to PNG) when needed. This enables using custom robotic tools, industrial parts, or any 3D model in manipulation scenarios.&lt;/p>
&lt;h3 id="custom-arena-definition">Custom Arena Definition&lt;/h3>
&lt;p>Define complete custom environments using MJCF XML, extending beyond RoboSuite&amp;rsquo;s built-in arenas:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">custom_arena = new CustomArena with arenaXml localPath(&amp;#34;warehouse.xml&amp;#34;)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This allows creating specialized workspaces, factory floors, or research-specific environments while maintaining full physics simulation.&lt;/p>
&lt;h3 id="multi-robot-support">Multi-Robot Support&lt;/h3>
&lt;p>The interface handles multiple robots operating in the same workspace:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">robot1 = new Panda at (-0.5, 0, 0)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">robot2 = new UR5e at (0.5, 0, 0)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">table = new Table at (0, 0, 0.425)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Each robot maintains independent control and can execute coordinated or individual behaviors.&lt;/p>
&lt;h3 id="built-in-manipulation-behaviors">Built-in Manipulation Behaviors&lt;/h3>
&lt;p>Ready-to-use behaviors for immediate testing and development:&lt;/p>
&lt;ul>
&lt;li>&lt;code>MoveToPosition&lt;/code> - Precise end-effector positioning&lt;/li>
&lt;li>&lt;code>PickObject&lt;/code> - Automated grasping with approach and closure&lt;/li>
&lt;li>&lt;code>LiftToHeight&lt;/code> - Controlled lifting to target heights&lt;/li>
&lt;li>&lt;code>PickAndLift&lt;/code> - Complete pick-and-place sequence&lt;/li>
&lt;/ul>
&lt;p>These behaviors use Operational Space Control (OSC) for intuitive 3D movement commands.&lt;/p>
&lt;h3 id="extended-environment-configuration">Extended Environment Configuration&lt;/h3>
&lt;p>The interface extends RoboSuite&amp;rsquo;s configurability through Scenic&amp;rsquo;s parameter system:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">param controller_config = {&amp;#39;type&amp;#39;: &amp;#39;OSC_POSITION&amp;#39;, &amp;#39;impedance&amp;#39;: &amp;#39;low&amp;#39;}
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">param camera_view = &amp;#39;robot0_eye_in_hand&amp;#39;
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">param lite_physics = True # Faster simulation for testing
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="example-probabilistic-pick-and-place">Example: Probabilistic Pick-and-Place&lt;/h2>
&lt;div class="highlight">&lt;pre tabindex="0" class="chroma">&lt;code class="language-fallback" data-lang="fallback">&lt;span class="line">&lt;span class="cl">model scenic.simulators.robosuite.model
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"># Randomly position cube on table
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">table = new Table at (0.6, 0, 0.425)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">cube = new Box on table,
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> with color (1, 0, 0, 1),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> with position (Uniform(-0.2, 0.2), Uniform(-0.2, 0.2), _)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"># Robot adapts to random cube position
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">behavior AdaptivePickup():
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> do PickAndLift(cube, height=1.1)
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl">ego = new Panda at (0, 0, 0),
&lt;/span>&lt;/span>&lt;span class="line">&lt;span class="cl"> with behavior AdaptivePickup()
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Each scenario run generates a different cube position, testing the robot&amp;rsquo;s adaptive capabilities.&lt;/p>
&lt;h2 id="challenges-overcome">Challenges Overcome&lt;/h2>
&lt;h3 id="understanding-dual-architecture-paradigms">Understanding Dual Architecture Paradigms&lt;/h3>
&lt;p>RoboSuite and Scenic operate on fundamentally different principles. RoboSuite builds environments imperatively through MuJoCo XML composition, expecting complete scene specification upfront. Scenic generates scenes probabilistically through constraint solving, requiring geometric knowledge before simulation. Bridging these required developing a two-pass system where we first extract geometry from a temporary RoboSuite environment, update Scenic&amp;rsquo;s understanding, then create the final simulation. This architectural mismatch touched every aspect of the integration, from object creation to property updates.&lt;/p>
&lt;h3 id="discovering-and-extending-manipulationenv">Discovering and Extending ManipulationEnv&lt;/h3>
&lt;p>RoboSuite&amp;rsquo;s documentation focuses on using pre-built tasks, not creating custom environments. Through extensive source code analysis, we discovered that &lt;code>ManipulationEnv&lt;/code> was the key - it accepts robots as configuration while allowing customizable arenas and objects as components. This class became our foundation, but required significant extension. We implemented &lt;code>ScenicManipulationEnv&lt;/code> to intercept Scenic&amp;rsquo;s object configurations, handle dynamic arena selection (EmptyArena vs MultiTableArena based on scene content), and manage the complex initialization sequence where robots, arenas, and objects must be assembled in specific order for MuJoCo compilation.&lt;/p>
&lt;h3 id="xml-to-3d-mesh-pipeline">XML to 3D Mesh Pipeline&lt;/h3>
&lt;p>Converting MJCF XML to usable 3D meshes proved complex. MuJoCo uses XML to describe geometry, but Scenic needs actual mesh data for collision checking. We built a multi-stage pipeline: First, &lt;code>ElementTree&lt;/code> parses the XML to extract mesh references and primitive definitions. Then, we handle two paths - for mesh files, we load STL/OBJ files with trimesh and apply XML-specified transformations; for primitives (boxes, cylinders), we generate meshes programmatically. The challenge intensified with composite objects - a table might have a box tabletop and four cylinder legs. We developed &lt;code>ComponentExtractor&lt;/code> to analyze the MuJoCo scene graph, identify related geometries through naming patterns and hierarchy, and export each component as a separate GLB file with proper world transforms preserved.&lt;/p>
&lt;h3 id="file-path-resolution-discrepancies">File Path Resolution Discrepancies&lt;/h3>
&lt;p>Scenic and RoboSuite handle file paths completely differently. Scenic uses &lt;code>localPath()&lt;/code> for paths relative to the scenario file, while RoboSuite expects paths relative to its package structure or absolute paths. MJCF XML compounds this - mesh references can be relative to the XML file location, not the calling code. We implemented a sophisticated path resolution system: detect whether paths come from embedded XML (relative to Scenic file) or external XML files (relative to XML location), copy all referenced assets (meshes, textures) to temporary directories accessible to MuJoCo, and handle texture format conversion (JPG to PNG) when needed. This system transparently manages assets whether they&amp;rsquo;re in the Scenic project, RoboSuite package, or absolute paths, making the interface truly portable.&lt;/p>
&lt;h2 id="impact-and-applications">Impact and Applications&lt;/h2>
&lt;p>This bridge enables:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Research&lt;/strong>: Generate diverse manipulation scenarios for robot learning algorithms&lt;/li>
&lt;li>&lt;strong>Testing&lt;/strong>: Validate robotic systems against probabilistic task variations&lt;/li>
&lt;li>&lt;strong>Development&lt;/strong>: Rapid prototyping of manipulation tasks without manual scene setup&lt;/li>
&lt;li>&lt;strong>Education&lt;/strong>: Teach robotics concepts through declarative scenario specification&lt;/li>
&lt;/ul>
&lt;p>The integration makes complex robotic simulations accessible through Scenic&amp;rsquo;s intuitive language while preserving RoboSuite&amp;rsquo;s detailed physics and control capabilities.&lt;/p>
&lt;h2 id="documentation-and-resources">Documentation and Resources&lt;/h2>
&lt;p>The project includes:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>example scenarios&lt;/strong> demonstrating all features&lt;/li>
&lt;li>&lt;strong>Comprehensive STATUS.md&lt;/strong> tracking working features and known issues&lt;/li>
&lt;li>&lt;strong>Technical documentation&lt;/strong> in &lt;code>docs/&lt;/code> covering architecture and troubleshooting&lt;/li>
&lt;li>&lt;strong>Mesh extraction utilities&lt;/strong> for pre-processing and caching&lt;/li>
&lt;/ul>
&lt;h2 id="current-status-and-future-work">Current Status and Future Work&lt;/h2>
&lt;p>This prototype demonstrates that the Scenic-RoboSuite bridge is viable and functional. Basic features are working reliably:&lt;/p>
&lt;ul>
&lt;li>Single-robot manipulation scenarios execute successfully&lt;/li>
&lt;li>MJCF XML injection creates custom objects&lt;/li>
&lt;li>Pick-and-place behaviors operate consistently&lt;/li>
&lt;li>Multi-robot support functions in controlled scenarios&lt;/li>
&lt;/ul>
&lt;p>However, significant work remains:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Stability improvements&lt;/strong>: Some features work intermittently and need refinement&lt;/li>
&lt;li>&lt;strong>Velocity tracking&lt;/strong>: Full implementation awaits framework updates&lt;/li>
&lt;li>&lt;strong>Multi-robot coordination&lt;/strong>: Advanced synchronization primitives needed&lt;/li>
&lt;li>&lt;strong>Performance optimization&lt;/strong>: Mesh extraction and caching can be streamlined&lt;/li>
&lt;li>&lt;strong>Extended testing&lt;/strong>: More diverse scenarios and edge cases need validation&lt;/li>
&lt;/ul>
&lt;p>The prototype serves as a proof of concept, showing that probabilistic scenario specification can successfully drive physics-based robot simulation. The architecture is sound, the core features function, and the path forward is clear.&lt;/p>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>This working prototype of the Scenic-RoboSuite integration represents significant progress toward bridging probabilistic programming with robotic simulation. We&amp;rsquo;ve successfully demonstrated that declarative scenario specification can control detailed physics simulation, opening new possibilities for robotic system development and testing.&lt;/p>
&lt;p>While not yet production-ready, the prototype provides a solid foundation for future development. Researchers can begin experimenting with basic manipulation scenarios, developers can test the interface with their use cases, and the community can contribute to making this bridge more robust and feature-complete.&lt;/p>
&lt;p>The challenges overcome - from understanding dual architectures to implementing XML-to-mesh pipelines - have resulted in a functional system that validates our approach. This prototype proves that Scenic&amp;rsquo;s elegant scenario language and RoboSuite&amp;rsquo;s detailed physics can work together, setting the stage for a powerful new tool in robotics research and development.&lt;/p></description></item><item><title>Robot Manipulation with Scenic-RoboSuite</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/scenic/20250730-sahil-tgs/</link><pubDate>Wed, 30 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/scenic/20250730-sahil-tgs/</guid><description>&lt;p>We&amp;rsquo;re &lt;a href="https://sahiltgs.super.site/" target="_blank" rel="noopener">Sahil&lt;/a>, continuing work on the Scenic-RoboSuite integration for GSoC 2025. This &lt;a href="https://sahiltgs.super.site/gsoc/uc-ospo-proposal" target="_blank" rel="noopener">project&lt;/a> is mentored by &lt;a href="https://ucsc-ospo.github.io/author/daniel-fremont/" target="_blank" rel="noopener">Daniel Fremont&lt;/a> and &lt;a href="https://ucsc-ospo.github.io/author/eric-vin/" target="_blank" rel="noopener">Eric Vin&lt;/a>.&lt;/p>
&lt;p>Since the last update, the &lt;a href="https://scenic-lang.org/" target="_blank" rel="noopener">Scenic&lt;/a>-&lt;a href="https://robosuite.ai/" target="_blank" rel="noopener">RoboSuite&lt;/a> interface has made significant progress. The bidirectional bridge is now functional - robots can read sensor data and execute behaviors based on observations. However, these features are still in early stages and we&amp;rsquo;re working on making them more stable and consistent.&lt;/p>
&lt;p>We&amp;rsquo;ve integrated RoboSuite&amp;rsquo;s Operational Space Control into Scenic. This control method lets you command the robot&amp;rsquo;s hand directly in 3D space (like &amp;ldquo;move 10cm left&amp;rdquo;) instead of calculating complex joint rotations. While the integration works, it&amp;rsquo;s rough around the edges and we&amp;rsquo;re currently focused on stabilizing it across different scenarios.&lt;/p>
&lt;p>The main challenge was architectural - RoboSuite expects all robot commands bundled together each timestep, while Scenic processes them one by one. We solved this with a pending actions system that collects everything first, then executes in one go. Time synchronization was another challenge, matching Scenic&amp;rsquo;s steps with MuJoCo&amp;rsquo;s physics.&lt;/p>
&lt;p>We&amp;rsquo;ve implemented a basic pick-and-place behavior for basic testing. The robot reads sensor data, calculates where to move, and adjusts continuously. It can successfully grasp and lift objects, though consistency varies between runs. The system supports three robot models and works with RoboSuite&amp;rsquo;s pre-built environments.&lt;/p>
&lt;p>Custom world building is currently on hold. We&amp;rsquo;ve decided to focus on integrating existing RoboSuite features into Scenic first, then build Scenic&amp;rsquo;s capabilities like dynamic scenario randomization on top. For our first prototype, we&amp;rsquo;re aiming to extend the pick-and-place behavior into a full randomization demo - Scenic will randomly position the cube each run, and the robot will adapt to find and grasp it regardless of location.&lt;/p>
&lt;p>The next two weeks focus on stabilizing current features and preparing this randomized scenario prototype. Expanding the behavior library and supporting additional environments will come in future phases after we have a solid foundation.&lt;/p>
&lt;p>The core bridge between Scenic and RoboSuite is operational, but there&amp;rsquo;s significant work ahead to make it reliable and user-friendly.&lt;/p></description></item><item><title>Midway Through GSoC</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/embeddings/14072025-devadigapratham/</link><pubDate>Mon, 14 Jul 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/embeddings/14072025-devadigapratham/</guid><description>&lt;h1 id="midway-through-gsoc">Midway Through GSoC&lt;/h1>
&lt;p>Hello everyone! I’m Pratham Devadiga, and I’m thrilled to share a midterm progress update on my &lt;a href="https://summerofcode.withgoogle.com/programs/2025/projects/GcstSGAO" target="_blank" rel="noopener">GSoC 2025 project&lt;/a> with the Open Source Research Experience (OSRE). My project is focused on building the &lt;strong>first open-source billion-scale vector embeddings dataset&lt;/strong> from &lt;strong>real-world open source code&lt;/strong> to support benchmarking of Approximate Nearest Neighbor (ANN) algorithms and facilitate research in Retrieval-Augmented Generation (RAG).&lt;/p>
&lt;h2 id="project-overview">Project Overview&lt;/h2>
&lt;p>The goal of this project is to address a critical gap in the ecosystem: existing ANN benchmarks are either synthetic or limited in scale. With the explosion of code-focused LLMs and embedding models, there&amp;rsquo;s a pressing need for:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>High-volume, high-dimensional vector datasets&lt;/strong> built from real-world data (open-source codebases).&lt;/li>
&lt;li>&lt;strong>Open, reproducible benchmarks&lt;/strong> that reflect realistic RAG workloads.&lt;/li>
&lt;li>A dataset that can be used to evaluate &lt;strong>ANN libraries&lt;/strong> like FAISS, HNSW, and Annoy on massive and practical retrieval tasks.&lt;/li>
&lt;/ul>
&lt;p>Our approach is to use high-quality open-source code repositories to extract meaningful code chunks, encode them into vector embeddings using open models, and make these datasets publicly available with metadata for downstream benchmarking and analysis.&lt;/p>
&lt;h2 id="progress-so-far">Progress So Far&lt;/h2>
&lt;p>We’ve made substantial foundational progress in the first half of the coding period. Key highlights:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Tested multiple embedding models&lt;/strong> such as &lt;code>codeBERT&lt;/code>, &lt;code>MiniLM-L6-v2&lt;/code>, and &lt;code>all-mpnet-base-v2&lt;/code>, evaluating trade-offs in speed, dimensionality, and GPU memory.&lt;/li>
&lt;li>&lt;strong>Selected &lt;code>codebert-base&lt;/code>&lt;/strong> (768d) as the current model for phase one due to its stable performance and manageable resource footprint.&lt;/li>
&lt;li>Implemented and validated a complete &lt;strong>script pipeline&lt;/strong> to:
&lt;ul>
&lt;li>Traverse large open-source repositories.&lt;/li>
&lt;li>Extract and chunk code intelligently (functions, classes, modules).&lt;/li>
&lt;li>Encode code into embeddings and attach metadata (repo, file path, license).&lt;/li>
&lt;li>Store results efficiently in parquet and NumPy formats.&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>&lt;strong>Tested all components&lt;/strong> of the pipeline on sample datasets using multi-GPU setups, ensuring compatibility and robustness.&lt;/li>
&lt;/ul>
&lt;h2 id="challenges-and-learnings">Challenges and Learnings&lt;/h2>
&lt;p>Building a billion-scale dataset from real-world codebases is no small task. Here&amp;rsquo;s what we’ve encountered and learned along the way:&lt;/p>
&lt;h3 id="1-multi-gpu-pipeline-design">1. Multi-GPU Pipeline Design&lt;/h3>
&lt;p>Naively parallelizing the embedding process caused memory overflow and deadlocks due to model reloading across processes. We refactored the code using &lt;code>torch.multiprocessing&lt;/code> and pinned GPU contexts to avoid such issues, improving throughput on multi-GPU machines.&lt;/p>
&lt;h3 id="2-embedding-trade-offs">2. Embedding Trade-offs&lt;/h3>
&lt;p>We experimented with larger models but found that their generation time and memory use were too high to be practical in early phases. This helped us narrow down to scalable configurations for initial dataset generation.&lt;/p>
&lt;h3 id="3-preparing-for-scale">3. Preparing for Scale&lt;/h3>
&lt;p>Although the embeddings are not generated yet, all scripts are now &lt;strong>modular, parallelized, and reproducible&lt;/strong>, ensuring a smooth transition to billion-scale data generation in the second half.&lt;/p>
&lt;h2 id="whats-next">What’s Next&lt;/h2>
&lt;p>The second half of the project will focus on:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Scaling up embedding generation&lt;/strong> to &amp;gt;1B code chunks across hundreds of open-source repositories.&lt;/li>
&lt;li>&lt;strong>Running benchmarks&lt;/strong> using FAISS, HNSW, and Annoy on these embeddings.&lt;/li>
&lt;li>&lt;strong>Releasing the dataset&lt;/strong> on Hugging Face and AWS S3 with sharded access and metadata.&lt;/li>
&lt;li>&lt;strong>Writing a detailed benchmarking report&lt;/strong> comparing speed, accuracy, and memory trade-offs across ANN algorithms.&lt;/li>
&lt;/ul>
&lt;h2 id="final-thoughts">Final Thoughts&lt;/h2>
&lt;p>This journey so far has taught me a lot about building large-scale ML pipelines, managing real-world compute constraints, and ensuring reproducibility for research-grade datasets. I&amp;rsquo;m grateful to my mentor &lt;strong>Jayjeet Chakraborty&lt;/strong> and the OSRE team for their continuous support and guidance.&lt;/p>
&lt;p>Excited for the next half, where the real scale begins!&lt;/p>
&lt;p>Stay tuned for updates. You can find more about the project on my &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/embeddings">OSRE project page&lt;/a>.&lt;/p></description></item><item><title>Building a Billion-Scale Vector Embeddings Dataset</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/embeddings/14062025-devadigapratham/</link><pubDate>Sun, 15 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/embeddings/14062025-devadigapratham/</guid><description>&lt;h1 id="billion-vector-embeddings-dataset">Billion Vector Embeddings Dataset&lt;/h1>
&lt;p>As part of the &lt;a href="https://deploy-preview-1007--ucsc-ospo.netlify.app/project/osre25/ucsc/embeddings">Billion-Scale Embeddings Dataset project&lt;/a>, my &lt;a href="GSoC-proposal.pdf">proposal&lt;/a> under the mentorship of &lt;strong>Jayjeet Chakraborty&lt;/strong> aims to create the first large-scale, real-world vector embeddings dataset—bridging the critical gap in Approximate Nearest Neighbor (ANN) benchmarks and Retrieval-Augmented Generation (RAG) systems.&lt;/p>
&lt;h2 id="motivation">Motivation&lt;/h2>
&lt;p>Existing ANN benchmarks often fall short—they’re either synthetic (like SIFT) or too small-scale (≤1M vectors). With the rapid evolution of LLM-based vector search systems (e.g., OpenAI’s 3072d &lt;code>text-embedding-3-large&lt;/code>), there&amp;rsquo;s a growing need for:&lt;/p>
&lt;ul>
&lt;li>High-dimensional (&amp;gt;1000d), large-scale (&amp;gt;100M) embeddings&lt;/li>
&lt;li>Real-world distributions (Wikipedia-scale text)&lt;/li>
&lt;li>Open, reproducible benchmarks for the community&lt;/li>
&lt;/ul>
&lt;h2 id="project-goals">Project Goals&lt;/h2>
&lt;ul>
&lt;li>Generate &lt;strong>1 billion&lt;/strong> embeddings from English Wikipedia using open-source models.&lt;/li>
&lt;li>Create multiple dimensional variants: &lt;strong>1024d&lt;/strong>, &lt;strong>4096d&lt;/strong>, and &lt;strong>8192d&lt;/strong>.&lt;/li>
&lt;li>Deduplicate, compress, and store embeddings with rich metadata (URL, timestamps, models).&lt;/li>
&lt;li>Benchmark ANN performance on FAISS, HNSW, and Annoy.&lt;/li>
&lt;li>Distribute the dataset via HuggingFace &amp;amp; AWS S3 with shard-level access.&lt;/li>
&lt;/ul>
&lt;h2 id="open-source-impact">Open Source Impact&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>ANN Libraries&lt;/strong>: Enable reproducible benchmarking for real-world workloads.&lt;/li>
&lt;li>&lt;strong>RAG Systems&lt;/strong>: Evaluate and optimize retrieval at scale using real Wikipedia text.&lt;/li>
&lt;li>&lt;strong>Researchers&lt;/strong>: Conduct large-scale studies on dimensionality, ANN accuracy, and compression trade-offs.&lt;/li>
&lt;/ul>
&lt;hr></description></item><item><title>Introducing Scenic-RoboSuite Interface</title><link>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/scenic/20250616-sahil-tgs/</link><pubDate>Sun, 15 Jun 2025 00:00:00 +0000</pubDate><guid>https://deploy-preview-1007--ucsc-ospo.netlify.app/report/osre25/ucsc/scenic/20250616-sahil-tgs/</guid><description>&lt;p>Hey! I&amp;rsquo;m &lt;a href="https://sahiltgs.super.site/" target="_blank" rel="noopener">Sahil&lt;/a>, working on integrating Scenic with RoboSuite for GSoC 2025. My &lt;a href="https://sahiltgs.super.site/gsoc/uc-ospo-proposal" target="_blank" rel="noopener">project&lt;/a> is mentored by &lt;a href="https://ucsc-ospo.github.io/author/daniel-fremont/" target="_blank" rel="noopener">Daniel Fremont&lt;/a> and &lt;a href="https://ucsc-ospo.github.io/author/eric-vin/" target="_blank" rel="noopener">Eric Vin&lt;/a> .&lt;/p>
&lt;p>I&amp;rsquo;m connecting &lt;a href="https://scenic-lang.org/" target="_blank" rel="noopener">Scenic&lt;/a> (a probabilistic programming language for scenarios) with &lt;a href="https://robosuite.ai/" target="_blank" rel="noopener">RoboSuite&lt;/a> (a robotics simulation framework). Basically, you write simple scenario descriptions and get complex 3D robot simulations automatically.&lt;/p>
&lt;p>Currently, as I&amp;rsquo;m building things and learning how Scenic works, I have been able to get the basic skeleton for the simulator interface working. I&amp;rsquo;ve implemented the simulator class and built a world model that can translate Scenic objects into RoboSuite&amp;rsquo;s simulator (which is MuJoCo-based). The interface now handles precise object placement in the world pretty well.&lt;/p>
&lt;p>One of the trickier parts was figuring out the translation logic between Scenic and RoboSuite. I managed to overcome this by building a system that automatically detects the shape of objects when moving between the two frameworks, which lays a foundation for more complex object mapping later on.&lt;/p>
&lt;p>I&amp;rsquo;ve also built some basic example scenarios to run and test with. Currently working on more complex examples and testing Scenic&amp;rsquo;s features like probabilistic object placement, constraint satisfaction, and spatial relationships between objects.&lt;/p>
&lt;p>In summary, the &amp;ldquo;Scenic to RoboSuite&amp;rdquo; part of the interface is pretty much done. For next week, I need to work on the &amp;ldquo;RoboSuite to Scenic&amp;rdquo; part - basically getting feedback and state information flowing back from the simulation. Achieving this will make a complete bridge and give us a working simulator interface, which is the first major milestone for the project.&lt;/p></description></item></channel></rss>