Search | arXiv e-print repository

Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Authors: Ivan Kapelyukh, Yifei Ren, Ignacio Alzugaray, Edward Johns

Abstract: We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement w… ▽ More We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks. △ Less

Submitted 29 July, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: ICRA 2024. Project webpage with robot videos: https://www.robot-learning.uk/dream2real

arXiv:2311.08530 [pdf, other]

SceneScore: Learning a Cost Function for Object Arrangement

Authors: Ivan Kapelyukh, Edward Johns

Abstract: Arranging objects correctly is a key capability for robots which unlocks a wide range of useful tasks. A prerequisite for creating successful arrangements is the ability to evaluate the desirability of a given arrangement. Our method "SceneScore" learns a cost function for arrangements, such that desirable, human-like arrangements have a low cost. We learn the distribution of training arrangements… ▽ More Arranging objects correctly is a key capability for robots which unlocks a wide range of useful tasks. A prerequisite for creating successful arrangements is the ability to evaluate the desirability of a given arrangement. Our method "SceneScore" learns a cost function for arrangements, such that desirable, human-like arrangements have a low cost. We learn the distribution of training arrangements offline using an energy-based model, solely from example images without requiring environment interaction or human supervision. Our model is represented by a graph neural network which learns object-object relations, using graphs constructed from images. Experiments demonstrate that the learned cost function can be used to predict poses for missing objects, generalise to novel objects using semantic features, and can be composed with other cost functions to satisfy constraints at inference time. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: Presented at CoRL 2023 LEAP Workshop. Webpage: https://sites.google.com/view/scenescore

arXiv:2210.02438 [pdf, other]

doi 10.1109/LRA.2023.3272516

DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics

Authors: Ivan Kapelyukh, Vitalis Vosylius, Edward Johns

Abstract: We introduce the first work to explore web-scale diffusion models for robotics. DALL-E-Bot enables a robot to rearrange objects in a scene, by first inferring a text description of those objects, then generating an image representing a natural, human-like arrangement of those objects, and finally physically arranging the objects according to that goal image. We show that this is possible zero-shot… ▽ More We introduce the first work to explore web-scale diffusion models for robotics. DALL-E-Bot enables a robot to rearrange objects in a scene, by first inferring a text description of those objects, then generating an image representing a natural, human-like arrangement of those objects, and finally physically arranging the objects according to that goal image. We show that this is possible zero-shot using DALL-E, without needing any further example arrangements, data collection, or training. DALL-E-Bot is fully autonomous and is not restricted to a pre-defined set of objects or scenes, thanks to DALL-E's web-scale pre-training. Encouraging real-world results, with both human studies and objective metrics, show that integrating web-scale diffusion models into robotics pipelines is a promising direction for scalable, unsupervised robot learning. △ Less

Submitted 4 May, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: Webpage and videos: ( https://www.robot-learning.uk/dall-e-bot ) Published in IEEE Robotics and Automation Letters (RA-L)

arXiv:2111.03112 [pdf, other]

My House, My Rules: Learning Tidying Preferences with Graph Neural Networks

Authors: Ivan Kapelyukh, Edward Johns

Abstract: Robots that arrange household objects should do so according to the user's preferences, which are inherently subjective and difficult to model. We present NeatNet: a novel Variational Autoencoder architecture using Graph Neural Network layers, which can extract a low-dimensional latent preference vector from a user by observing how they arrange scenes. Given any set of objects, this vector can the… ▽ More Robots that arrange household objects should do so according to the user's preferences, which are inherently subjective and difficult to model. We present NeatNet: a novel Variational Autoencoder architecture using Graph Neural Network layers, which can extract a low-dimensional latent preference vector from a user by observing how they arrange scenes. Given any set of objects, this vector can then be used to generate an arrangement which is tailored to that user's spatial preferences, with word embeddings used for generalisation to new objects. We develop a tidying simulator to gather rearrangement examples from 75 users, and demonstrate empirically that our method consistently produces neat and personalised arrangements across a variety of rearrangement scenarios. △ Less

Submitted 4 November, 2021; originally announced November 2021.

Comments: Published at CoRL 2021. Webpage and video: https://www.robot-learning.uk/my-house-my-rules

Showing 1–4 of 4 results for author: Kapelyukh, I