ReasonWorld

Fri, 28 Feb 2025 00:00:00 +0000

ReasonWorld: Real-World Reasoning with a Long-Term World Model

A world model is essentially an internal representation of an environment that an AI system would construct based on external information to plan, reason, and interpret its surroundings. It stores the system’s understanding of relevant objects, spatial relationships, and/or states in the environment. Recent augmented reality (AR) and wearable technologies like Meta Aria glasses provide an opportunity to gather rich information from the real world in the form of vision, audio, and spatial data. Along with this, large language (LLM), vision language models (VLMs), and general machine learning algorithms have enabled nuanced understanding and processing of multimodal inputs that can label, summarize, and analyze experiences.

With ReasonWorld, we aim to utilize these technologies to enable advanced reasoning about important objects/events/spaces in real-world environments in a structured manner. With the help of wearable AR technology, the system would be able to capture real-world multimodal data. We aim to utilize this information to create a long-memory modeling toolkit that would support features like:

Longitudinal and structured data logging: Capture and storing of multimodal data (image, video, audio, location coordinates etc.)
Semantic summarization: Automatic scene labeling via LLMs/VLMs to identify key elements in the surroundings
Efficient retrieval: For querying and revisiting past experiences and answering questions like “Where have I seen this painting before?”
Adaptability: Continuously refining and understanding the environment and/or relationships between objects/locations.
Adaptive memory prioritization: Where the pipeline can assess the contextual significance of the captured data and retrieve those that are the most significant. The model retains meaningful, structured representations rather than raw, unfiltered data.

This real-world reasoning framework with a long-term world model can function as a structured search engine for important objects and spaces, enabling:

Recognizing and tracking significant objects, locations, and events
Supporting spatial understanding and contextual analysis
Facilitating structured documentation of environments and changes over time

Alignment with Summer of Reproducibility:

Core pipeline for AR data ingestion, event segmentation, summarization, and indexing (knowledge graph or vector database) would be made open-source.
Clear documentation of each module and how they collaborate with one another
The project could be tested with standardized datasets, simulated environments as well as controlled real-world scenarios, promoting reproducibility
Opportunities for Innovation - A transparent, modular approach invites a broad community to propose novel expansions

Specific Tasks:

A pipeline for real-time/batch ingestion of data with the wearable AR device and cleaning
Have an event segmentation module to classify whether the current object/event is contextually significant, filtering out the less relevant observations.
Have VLMs/LLMs summarize the events with the vision/audio/location data to be stored and retrieved later by structured data structures like knowledge graph, vector databases etc.
Storage optimization with prioritizing important objects and spaces, optimizing storage based on contextual significance and frequency of access.
Implement key information retrieval mechanisms
Ensure reproducibility by providing datasets and scripts

ReasonWorld

Topics: Augmented reality Multimodal learning Computer vision for AR LLM/VLM Efficient data indexing
Skills: Machine Learning and AI, Augmented Reality and Hardware integration, Data Engineering & Storage Optimization
Difficulty: Hard
Size: Large (350 hours)
Mentors: James Davis, Alex Pang

AR4VIP

Tue, 18 Feb 2025 00:00:00 +0000

We are interested in developing navigation aids for visually impaired people (VIP) using AR/VR technologies. Our intended use is primarily indoors or outdoors but within private confines e.g. person’s backyard. Using AR/VR headsets or smart glasses allows navigation without using a cane and frees the users’ hands for other tasks.

Continue Development on Meta Quest 3 Headset

Topics: Dynamic scenes Spatial audio Proximity detection
Skills: AR/VR familiarity, WebXR, Unity, SLAM, good communicator, good documentation skills
Difficulty: Moderate
Size: Medium or large (175 or 350 hours)
Mentors: Alex Pang, James Davis

Continue development and field testing with the Meta Quest 3 headset. See this repository page for current status.

Specific tasks:

Improve spatial audio mapping
Improve obstacle detection, at different heights, with pre-scanned geometry as well as dynamic objects e.g. other people, pets, doors
Special handling of hazards e.g. stairs, uneven floors, etc.
Explore/incorporate AI to help identify objects in the scene when requested by user

New Development on Smart Glasses

Topics: Dynamic scenes Spatial audio Proximity detection
Skills: AR/VR familiarity, WebXR, Unity, SLAM, good communicator, good documentation skills
Difficulty: Moderate
Size: Large (350 hours)
Mentors: Alex Pang, James Davis

VR headsets are bulky and awkward, but currently is more advanced than AR glasses in terms of programmability. Ultimately, the form factor of smart glasses is more practical for extended use by our target users. There are many vendors working on pushing out their version of smart glasses targetting various applications e.g. alternative for watching TV, etc. We are interested in those that provide capabilities to support spatial computing. Most of these will likely have their own brand specific APIs. This project has 2 goals: (a) develop generic brand-independent API, perhaps extensions to WebXR, to support overarching goal of navigation aid for VIP, and (b) port functionality of VR version to smart glasses while taking advantage of smart glass functionalities and sensors.

Specific tasks:

Explore current and soon-to-be-available smart glass options e.g. Snap Spectacles, Xreal Air 2 ultra, etc. and select a platform to work on (subject to cost and availability of SDK). At a minimum, glass should be microphones and speakers, and cameras. Infrared cameras or other low light capability is a plus. Sufficient battery life or option for quick exchange.
Identify support provided by SDK e.g. does it do realtime scene reconstruction? does it support spatial audio? etc. If it supports features outside of WebXR, provide generic hooks to improve portability of code to other smart glasses.
Port and extend functionalities from the Meta Quest 3 VR headsets to smart glass platform.
Add AI support if glasses support them.
Provide documentation of work.

ar/vr | UCSC OSPO

ReasonWorld

ReasonWorld: Real-World Reasoning with a Long-Term World Model

Alignment with Summer of Reproducibility:

Specific Tasks:

ReasonWorld

AR4VIP

Continue Development on Meta Quest 3 Headset

New Development on Smart Glasses