-
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D
Authors:
Sergio Arnaud,
Paul McVay,
Ada Martin,
Arjun Majumdar,
Krishna Murthy Jatavallabhula,
Phillip Thomas,
Ruslan Partsey,
Daniel Dugas,
Abha Gejji,
Alexander Sax,
Vincent-Pierre Berges,
Mikael Henaff,
Ayush Jain,
Ang Cao,
Ishita Prasad,
Mrinal Kalakrishnan,
Michael Rabbat,
Nicolas Ballas,
Mido Assran,
Oleksandr Maksymets,
Aravind Rajeswaran,
Franziska Meier
Abstract:
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world depl…
▽ More
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model.
△ Less
Submitted 18 April, 2025;
originally announced April 2025.
-
From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs
Authors:
Ang Cao,
Sergio Arnaud,
Oleksandr Maksymets,
Jianing Yang,
Ayush Jain,
Sriram Yenamandra,
Ada Martin,
Vincent-Pierre Berges,
Paul McVay,
Ruslan Partsey,
Aravind Rajeswaran,
Franziska Meier,
Justin Johnson,
Jeong Joon Park,
Alexander Sax
Abstract:
3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce $\textbf{LIFT-GS}$, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. L…
▽ More
3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce $\textbf{LIFT-GS}$, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with $25.7\%$ mAP on open-vocabulary instance segmentation (vs. $20.2\%$ prior SOTA) and consistent $10-30\%$ improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2X, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: https://liftgs.github.io
△ Less
Submitted 9 June, 2025; v1 submitted 27 February, 2025;
originally announced February 2025.
-
Memory Layers at Scale
Authors:
Vincent-Pierre Berges,
Barlas Oğuz,
Daniel Haziza,
Wen-tau Yih,
Luke Zettlemoyer,
Gargi Ghosh
Abstract:
Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstre…
▽ More
Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
△ Less
Submitted 20 December, 2024; v1 submitted 12 December, 2024;
originally announced December 2024.
-
Habitat 3.0: A Co-Habitat for Humans, Avatars and Robots
Authors:
Xavier Puig,
Eric Undersander,
Andrew Szot,
Mikael Dallaire Cote,
Tsung-Yen Yang,
Ruslan Partsey,
Ruta Desai,
Alexander William Clegg,
Michal Hlavac,
So Yeon Min,
Vladimír Vondruš,
Theophile Gervet,
Vincent-Pierre Berges,
John M. Turner,
Oleksandr Maksymets,
Zsolt Kira,
Mrinal Kalakrishnan,
Jitendra Malik,
Devendra Singh Chaplot,
Unnat Jain,
Dhruv Batra,
Akshara Rai,
Roozbeh Mottaghi
Abstract:
We present Habitat 3.0: a simulation platform for studying collaborative human-robot tasks in home environments. Habitat 3.0 offers contributions across three dimensions: (1) Accurate humanoid simulation: addressing challenges in modeling complex deformable bodies and diversity in appearance and motion, all while ensuring high simulation speed. (2) Human-in-the-loop infrastructure: enabling real h…
▽ More
We present Habitat 3.0: a simulation platform for studying collaborative human-robot tasks in home environments. Habitat 3.0 offers contributions across three dimensions: (1) Accurate humanoid simulation: addressing challenges in modeling complex deformable bodies and diversity in appearance and motion, all while ensuring high simulation speed. (2) Human-in-the-loop infrastructure: enabling real human interaction with simulated robots via mouse/keyboard or a VR interface, facilitating evaluation of robot policies with human input. (3) Collaborative tasks: studying two collaborative tasks, Social Navigation and Social Rearrangement. Social Navigation investigates a robot's ability to locate and follow humanoid avatars in unseen environments, whereas Social Rearrangement addresses collaboration between a humanoid and robot while rearranging a scene. These contributions allow us to study end-to-end learned and heuristic baselines for human-robot collaboration in-depth, as well as evaluate them with humans in the loop. Our experiments demonstrate that learned robot policies lead to efficient task completion when collaborating with unseen humanoid agents and human partners that might exhibit behaviors that the robot has not seen before. Additionally, we observe emergent behaviors during collaborative task execution, such as the robot yielding space when obstructing a humanoid agent, thereby allowing the effective completion of the task by the humanoid agent. Furthermore, our experiments using the human-in-the-loop tool demonstrate that our automated evaluation with humanoids can provide an indication of the relative ordering of different policies when evaluated with real human collaborators. Habitat 3.0 unlocks interesting new features in simulators for Embodied AI, and we hope it paves the way for a new frontier of embodied human-AI interaction capabilities.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
What do we learn from a large-scale study of pre-trained visual representations in sim and real environments?
Authors:
Sneha Silwal,
Karmesh Yadav,
Tingfan Wu,
Jay Vakil,
Arjun Majumdar,
Sergio Arnaud,
Claire Chen,
Vincent-Pierre Berges,
Dhruv Batra,
Aravind Rajeswaran,
Mrinal Kalakrishnan,
Franziska Meier,
Oleksandr Maksymets
Abstract:
We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study involves five different PVRs, each trained for five distinct manipulation or indoor navigation tasks. We performed this evaluation using three different robots and two different policy learning paradigms. From this effort, we c…
▽ More
We present a large empirical investigation on the use of pre-trained visual representations (PVRs) for training downstream policies that execute real-world tasks. Our study involves five different PVRs, each trained for five distinct manipulation or indoor navigation tasks. We performed this evaluation using three different robots and two different policy learning paradigms. From this effort, we can arrive at three insights: 1) the performance trends of PVRs in the simulation are generally indicative of their trends in the real world, 2) the use of PVRs enables a first-of-its-kind result with indoor ImageNav (zero-shot transfer to a held-out scene in the real world), and 3) the benefits from variations in PVRs, primarily data-augmentation and fine-tuning, also transfer to the real-world performance. See project website for additional details and visuals.
△ Less
Submitted 13 July, 2024; v1 submitted 3 October, 2023;
originally announced October 2023.
-
Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-Per-Second
Authors:
Vincent-Pierre Berges,
Andrew Szot,
Devendra Singh Chaplot,
Aaron Gokaslan,
Roozbeh Mottaghi,
Dhruv Batra,
Eric Undersander
Abstract:
We present Galactic, a large-scale simulation and reinforcement-learning (RL) framework for robotic mobile manipulation in indoor environments. Specifically, a Fetch robot (equipped with a mobile base, 7DoF arm, RGBD camera, egomotion, and onboard sensing) is spawned in a home environment and asked to rearrange objects - by navigating to an object, picking it up, navigating to a target location, a…
▽ More
We present Galactic, a large-scale simulation and reinforcement-learning (RL) framework for robotic mobile manipulation in indoor environments. Specifically, a Fetch robot (equipped with a mobile base, 7DoF arm, RGBD camera, egomotion, and onboard sensing) is spawned in a home environment and asked to rearrange objects - by navigating to an object, picking it up, navigating to a target location, and then placing the object at the target location.
Galactic is fast. In terms of simulation speed (rendering + physics), Galactic achieves over 421,000 steps-per-second (SPS) on an 8-GPU node, which is 54x faster than Habitat 2.0 (7699 SPS). More importantly, Galactic was designed to optimize the entire rendering + physics + RL interplay since any bottleneck in the interplay slows down training. In terms of simulation+RL speed (rendering + physics + inference + learning), Galactic achieves over 108,000 SPS, which 88x faster than Habitat 2.0 (1243 SPS).
These massive speed-ups not only drastically cut the wall-clock training time of existing experiments, but also unlock an unprecedented scale of new experiments. First, Galactic can train a mobile pick skill to >80% accuracy in under 16 minutes, a 100x speedup compared to the over 24 hours it takes to train the same skill in Habitat 2.0. Second, we use Galactic to perform the largest-scale experiment to date for rearrangement using 5B steps of experience in 46 hours, which is equivalent to 20 years of robot experience. This scaling results in a single neural network composed of task-agnostic components achieving 85% success in GeometricGoal rearrangement, compared to 0% success reported in Habitat 2.0 for the same approach. The code is available at github.com/facebookresearch/galactic.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?
Authors:
Arjun Majumdar,
Karmesh Yadav,
Sergio Arnaud,
Yecheng Jason Ma,
Claire Chen,
Sneha Silwal,
Aryan Jain,
Vincent-Pierre Berges,
Pieter Abbeel,
Jitendra Malik,
Dhruv Batra,
Yixin Lin,
Oleksandr Maksymets,
Aravind Rajeswaran,
Franziska Meier
Abstract:
We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of…
▽ More
We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual 'foundation models' for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data size and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 4.3M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Next, we show that task- or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. Finally, we present real-world hardware experiments, in which VC-1 and VC-1 (adapted) outperform the strongest pre-existing PVR. Overall, this paper presents no new techniques but a rigorous systematic evaluation, a broad set of findings about PVRs (that in some cases, refute those made in narrow domains in prior work), and open-sourced code and models (that required over 10,000 GPU-hours to train) for the benefit of the research community.
△ Less
Submitted 1 February, 2024; v1 submitted 31 March, 2023;
originally announced March 2023.
-
Offline Visual Representation Learning for Embodied Navigation
Authors:
Karmesh Yadav,
Ram Ramrakhya,
Arjun Majumdar,
Vincent-Pierre Berges,
Sachit Kuhar,
Dhruv Batra,
Alexei Baevski,
Oleksandr Maksymets
Abstract:
How should we learn visual representations for embodied agents that must see and move? The status quo is tabula rasa in vivo, i.e. learning visual representations from scratch while also learning to move, potentially augmented with auxiliary tasks (e.g. predicting the action taken between two successive observations). In this paper, we show that an alternative 2-stage strategy is far more effectiv…
▽ More
How should we learn visual representations for embodied agents that must see and move? The status quo is tabula rasa in vivo, i.e. learning visual representations from scratch while also learning to move, potentially augmented with auxiliary tasks (e.g. predicting the action taken between two successive observations). In this paper, we show that an alternative 2-stage strategy is far more effective: (1) offline pretraining of visual representations with self-supervised learning (SSL) using large-scale pre-rendered images of indoor environments (Omnidata), and (2) online finetuning of visuomotor representations on specific tasks with image augmentations under long learning schedules. We call this method Offline Visual Representation Learning (OVRL). We conduct large-scale experiments - on 3 different 3D datasets (Gibson, HM3D, MP3D), 2 tasks (ImageNav, ObjectNav), and 2 policy learning algorithms (RL, IL) - and find that the OVRL representations lead to significant across-the-board improvements in state of art, on ImageNav from 29.2% to 54.2% (+25% absolute, 86% relative) and on ObjectNav from 18.1% to 23.2% (+5.1% absolute, 28% relative). Importantly, both results were achieved by the same visual encoder generalizing to datasets that were not seen during pretraining. While the benefits of pretraining sometimes diminish (or entirely disappear) with long finetuning schedules, we find that OVRL's performance gains continue to increase (not decrease) as the agent is trained for 2 billion frames of experience.
△ Less
Submitted 27 April, 2022;
originally announced April 2022.
-
On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning
Authors:
Andrew Cohen,
Ervin Teng,
Vincent-Pierre Berges,
Ruo-Ping Dong,
Hunter Henry,
Marwan Mattar,
Alexander Zook,
Sujoy Ganguly
Abstract:
The creation and destruction of agents in cooperative multi-agent reinforcement learning (MARL) is a critically under-explored area of research. Current MARL algorithms often assume that the number of agents within a group remains fixed throughout an experiment. However, in many practical problems, an agent may terminate before their teammates. This early termination issue presents a challenge: th…
▽ More
The creation and destruction of agents in cooperative multi-agent reinforcement learning (MARL) is a critically under-explored area of research. Current MARL algorithms often assume that the number of agents within a group remains fixed throughout an experiment. However, in many practical problems, an agent may terminate before their teammates. This early termination issue presents a challenge: the terminated agent must learn from the group's success or failure which occurs beyond its own existence. We refer to propagating value from rewards earned by remaining teammates to terminated agents as the Posthumous Credit Assignment problem. Current MARL methods handle this problem by placing these agents in an absorbing state until the entire group of agents reaches a termination condition. Although absorbing states enable existing algorithms and APIs to handle terminated agents without modification, practical training efficiency and resource use problems exist.
In this work, we first demonstrate that sample complexity increases with the quantity of absorbing states in a toy supervised learning task for a fully connected network, while attention is more robust to variable size input. Then, we present a novel architecture for an existing state-of-the-art MARL algorithm which uses attention instead of a fully connected layer with absorbing states. Finally, we demonstrate that this novel architecture significantly outperforms the standard architecture on tasks in which agents are created or destroyed within episodes as well as standard multi-agent coordination tasks.
△ Less
Submitted 6 June, 2022; v1 submitted 10 November, 2021;
originally announced November 2021.
-
Obstacle Tower: A Generalization Challenge in Vision, Control, and Planning
Authors:
Arthur Juliani,
Ahmed Khalifa,
Vincent-Pierre Berges,
Jonathan Harper,
Ervin Teng,
Hunter Henry,
Adam Crespi,
Julian Togelius,
Danny Lange
Abstract:
The rapid pace of recent research in AI has been driven in part by the presence of fast and challenging simulation environments. These environments often take the form of games; with tasks ranging from simple board games, to competitive video games. We propose a new benchmark - Obstacle Tower: a high fidelity, 3D, 3rd person, procedurally generated environment. An agent playing Obstacle Tower must…
▽ More
The rapid pace of recent research in AI has been driven in part by the presence of fast and challenging simulation environments. These environments often take the form of games; with tasks ranging from simple board games, to competitive video games. We propose a new benchmark - Obstacle Tower: a high fidelity, 3D, 3rd person, procedurally generated environment. An agent playing Obstacle Tower must learn to solve both low-level control and high-level planning problems in tandem while learning from pixels and a sparse reward signal. Unlike other benchmarks such as the Arcade Learning Environment, evaluation of agent performance in Obstacle Tower is based on an agent's ability to perform well on unseen instances of the environment. In this paper we outline the environment and provide a set of baseline results produced by current state-of-the-art Deep RL methods as well as human players. These algorithms fail to produce agents capable of performing near human level.
△ Less
Submitted 1 July, 2019; v1 submitted 4 February, 2019;
originally announced February 2019.
-
Unity: A General Platform for Intelligent Agents
Authors:
Arthur Juliani,
Vincent-Pierre Berges,
Ervin Teng,
Andrew Cohen,
Jonathan Harper,
Chris Elion,
Chris Goy,
Yuan Gao,
Hunter Henry,
Marwan Mattar,
Danny Lange
Abstract:
Recent advances in artificial intelligence have been driven by the presence of increasingly realistic and complex simulated environments. However, many of the existing environments provide either unrealistic visuals, inaccurate physics, low task complexity, restricted agent perspective, or a limited capacity for interaction among artificial agents. Furthermore, many platforms lack the ability to f…
▽ More
Recent advances in artificial intelligence have been driven by the presence of increasingly realistic and complex simulated environments. However, many of the existing environments provide either unrealistic visuals, inaccurate physics, low task complexity, restricted agent perspective, or a limited capacity for interaction among artificial agents. Furthermore, many platforms lack the ability to flexibly configure the simulation, making the simulated environment a black-box from the perspective of the learning system. In this work, we propose a novel taxonomy of existing simulation platforms and discuss the highest level class of general platforms which enable the development of learning environments that are rich in visual, physical, task, and social complexity. We argue that modern game engines are uniquely suited to act as general platforms and as a case study examine the Unity engine and open source Unity ML-Agents Toolkit. We then survey the research enabled by Unity and the Unity ML-Agents Toolkit, discussing the kinds of research a flexible, interactive and easily configurable general platform can facilitate.
△ Less
Submitted 6 May, 2020; v1 submitted 7 September, 2018;
originally announced September 2018.