-
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D
Authors:
Sergio Arnaud,
Paul McVay,
Ada Martin,
Arjun Majumdar,
Krishna Murthy Jatavallabhula,
Phillip Thomas,
Ruslan Partsey,
Daniel Dugas,
Abha Gejji,
Alexander Sax,
Vincent-Pierre Berges,
Mikael Henaff,
Ayush Jain,
Ang Cao,
Ishita Prasad,
Mrinal Kalakrishnan,
Michael Rabbat,
Nicolas Ballas,
Mido Assran,
Oleksandr Maksymets,
Aravind Rajeswaran,
Franziska Meier
Abstract:
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world depl…
▽ More
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs
Authors:
Ang Cao,
Sergio Arnaud,
Oleksandr Maksymets,
Jianing Yang,
Ayush Jain,
Sriram Yenamandra,
Ada Martin,
Vincent-Pierre Berges,
Paul McVay,
Ruslan Partsey,
Aravind Rajeswaran,
Franziska Meier,
Justin Johnson,
Jeong Joon Park,
Alexander Sax
Abstract:
3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce $\textbf{LIFT-GS}$, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. L…
▽ More
3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce $\textbf{LIFT-GS}$, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with $25.7\%$ mAP on open-vocabulary instance segmentation (vs. $20.2\%$ prior SOTA) and consistent $10-30\%$ improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2X, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: https://liftgs.github.io
△ Less
Submitted 9 June, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Authors:
Lucas Lehnert,
Sainbayar Sukhbaatar,
DiJia Su,
Qinqing Zheng,
Paul Mcvay,
Michael Rabbat,
Yuandong Tian
Abstract:
While Transformers have enabled tremendous progress in various application settings, such architectures still trail behind traditional symbolic planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks. This is accomplished by training an encoder-decoder Transformer model to predict the search dynamics of the $A^*$ se…
▽ More
While Transformers have enabled tremendous progress in various application settings, such architectures still trail behind traditional symbolic planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks. This is accomplished by training an encoder-decoder Transformer model to predict the search dynamics of the $A^*$ search algorithm. We fine tune this model to obtain a Searchformer, a Transformer model that optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than the $A^*$ implementation that was used for training initially. In our training method, $A^*$'s search dynamics are expressed as a token sequence outlining when task states are added and removed into the search tree during symbolic planning. Searchformer significantly outperforms baselines that predict the optimal plan directly with a 5-10$\times$ smaller model size and a 10$\times$ smaller training dataset. Lastly, we demonstrate how Searchformer scales to larger and more complex decision making tasks with improved percentage of solved tasks and shortened search dynamics.
△ Less
Submitted 26 April, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
On Linear Separability under Linear Compression with Applications to Hard Support Vector Machine
Authors:
Paul McVay,
Tie Liu,
Krishna Narayanan
Abstract:
This paper investigates the theoretical problem of maintaining linear separability of the data-generating distribution under linear compression. While it has been long known that linear separability may be maintained by linear transformations that approximately preserve the inner products between the domain points, the limit to which the inner products are preserved in order to maintain linear sep…
▽ More
This paper investigates the theoretical problem of maintaining linear separability of the data-generating distribution under linear compression. While it has been long known that linear separability may be maintained by linear transformations that approximately preserve the inner products between the domain points, the limit to which the inner products are preserved in order to maintain linear separability was unknown. In this paper, we show that linear separability is maintained as long as the distortion of the inner products is smaller than the squared margin of the original data-generating distribution. The proof is mainly based on the geometry of hard support vector machines (SVM) extended from the finite set of training examples to the (possibly) infinite domain of the data-generating distribution. As applications, we derive bounds on the (i) compression length of random sub-Gaussian matrices; and (ii) generalization error for compressive learning with hard-SVM.
△ Less
Submitted 2 February, 2022;
originally announced February 2022.