Search | arXiv e-print repository

One-Shot Dual-Arm Imitation Learning

Abstract: We introduce One-Shot Dual-Arm Imitation Learning (ODIL), which enables dual-arm robots to learn precise and coordinated everyday tasks from just a single demonstration of the task. ODIL uses a new three-stage visual servoing (3-VS) method for precise alignment between the end-effector and target object, after which replay of the demonstration trajectory is sufficient to perform the task. This is… ▽ More We introduce One-Shot Dual-Arm Imitation Learning (ODIL), which enables dual-arm robots to learn precise and coordinated everyday tasks from just a single demonstration of the task. ODIL uses a new three-stage visual servoing (3-VS) method for precise alignment between the end-effector and target object, after which replay of the demonstration trajectory is sufficient to perform the task. This is achieved without requiring prior task or object knowledge, or additional data collection and training following the single demonstration. Furthermore, we propose a new dual-arm coordination paradigm for learning dual-arm tasks from a single demonstration. ODIL was tested on a real-world dual-arm robot, demonstrating state-of-the-art performance across six precise and coordinated tasks in both 4-DoF and 6-DoF settings, and showing robustness in the presence of distractor objects and partial occlusions. Videos are available at: https://www.robot-learning.uk/one-shot-dual-arm. △ Less

Submitted 9 March, 2025; originally announced March 2025.

Comments: Accepted at ICRA 2025. Project Webpage: https://www.robot-learning.uk/one-shot-dual-arm

arXiv:2411.12633 [pdf, other]

Instant Policy: In-Context Imitation Learning via Graph Diffusion

Authors: Vitalis Vosylius, Edward Johns

Abstract: Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly (without further training) from just one or two demonstrations, achieving ICIL through two key components. First, we introduce inductive biases through a graph representation… ▽ More Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly (without further training) from just one or two demonstrations, achieving ICIL through two key components. First, we introduce inductive biases through a graph representation and model ICIL as a graph generation problem with a learned diffusion process, enabling structured reasoning over demonstrations, observations, and actions. Second, we show that such a model can be trained using pseudo-demonstrations - arbitrary trajectories generated in simulation - as a virtually infinite pool of training data. Simulated and real experiments show that Instant Policy enables rapid learning of various everyday robot tasks. We also show how it can serve as a foundation for cross-embodiment and zero-shot transfer to language-defined tasks. Code and videos are available at https://www.robot-learning.uk/instant-policy. △ Less

Submitted 25 April, 2025; v1 submitted 19 November, 2024; originally announced November 2024.

Comments: Code and videos are available on our project webpage at https://www.robot-learning.uk/instant-policy

arXiv:2410.19693 [pdf, other]

MILES: Making Imitation Learning Easy with Self-Supervision

Authors: Georgios Papagiannis, Edward Johns

Abstract: Data collection in imitation learning often requires significant, laborious human supervision, such as numerous demonstrations, and/or frequent environment resets for methods that incorporate reinforcement learning. In this work, we propose an alternative approach, MILES: a fully autonomous, self-supervised data collection paradigm, and we show that this enables efficient policy learning from just… ▽ More Data collection in imitation learning often requires significant, laborious human supervision, such as numerous demonstrations, and/or frequent environment resets for methods that incorporate reinforcement learning. In this work, we propose an alternative approach, MILES: a fully autonomous, self-supervised data collection paradigm, and we show that this enables efficient policy learning from just a single demonstration and a single environment reset. MILES autonomously learns a policy for returning to and then following the single demonstration, whilst being self-guided during data collection, eliminating the need for additional human interventions. We evaluated MILES across several real-world tasks, including tasks that require precise contact-rich manipulation such as locking a lock with a key. We found that, under the constraints of a single demonstration and no repeated environment resetting, MILES significantly outperforms state-of-the-art alternatives like imitation learning methods that leverage reinforcement learning. Videos of our experiments and code can be found on our webpage: www.robot-learning.uk/miles. △ Less

Submitted 25 October, 2024; originally announced October 2024.

Comments: Published at the Conference on Robot Learning (CoRL) 2024

arXiv:2408.00178 [pdf, other]

Adapting Skills to Novel Grasps: A Self-Supervised Approach

Authors: Georgios Papagiannis, Kamil Dreczkowski, Vitalis Vosylius, Edward Johns

Abstract: In this paper, we study the problem of adapting manipulation trajectories involving grasped objects (e.g. tools) defined for a single grasp pose to novel grasp poses. A common approach to address this is to define a new trajectory for each possible grasp explicitly, but this is highly inefficient. Instead, we propose a method to adapt such trajectories directly while only requiring a period of sel… ▽ More In this paper, we study the problem of adapting manipulation trajectories involving grasped objects (e.g. tools) defined for a single grasp pose to novel grasp poses. A common approach to address this is to define a new trajectory for each possible grasp explicitly, but this is highly inefficient. Instead, we propose a method to adapt such trajectories directly while only requiring a period of self-supervised data collection, during which a camera observes the robot's end-effector moving with the object rigidly grasped. Importantly, our method requires no prior knowledge of the grasped object (such as a 3D CAD model), it can work with RGB images, depth images, or both, and it requires no camera calibration. Through a series of real-world experiments involving 1360 evaluations, we find that self-supervised RGB data consistently outperforms alternatives that rely on depth images including several state-of-the-art pose estimation methods. Compared to the best-performing baseline, our method results in an average of 28.5% higher success rate when adapting manipulation trajectories to novel grasps on several everyday tasks. Videos of the experiments are available on our webpage at https://www.robot-learning.uk/adapting-skills △ Less

Submitted 31 July, 2024; originally announced August 2024.

Comments: Accepted at IROS 2024

arXiv:2407.12957 [pdf, other]

R+X: Retrieval and Execution from Everyday Human Videos

Authors: Georgios Papagiannis, Norman Di Palo, Pietro Vitiello, Edward Johns

Abstract: We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model… ▽ More We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given a language command from a human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method (KAT) on this behaviour. By leveraging a Vision Language Model (VLM) for retrieval, R+X does not require any manual annotation of the videos, and by leveraging in-context learning for execution, robots can perform commanded skills immediately, without requiring a period of training on the retrieved videos. Experiments studying a range of everyday household tasks show that R+X succeeds at translating unlabelled human videos into robust robot skills, and that R+X outperforms several recent alternative methods. Videos and code are available at https://www.robot-learning.uk/r-plus-x. △ Less

Submitted 3 April, 2025; v1 submitted 17 July, 2024; originally announced July 2024.

Comments: Published at the IEEE International Conference on Robotics and Automation (ICRA) 2025

arXiv:2404.00117 [pdf, other]

Spectral approaches to stress relaxation in epithelial monolayers

Authors: Natasha Cowley, Christopher K. Revell, Emma Johns, Sarah Woolner, Oliver E. Jensen

Abstract: We investigate the viscoelastic relaxation to equilibrium of a disordered planar epithelium described using the cell vertex model. In its standard form, the model is formulated as coupled evolution equations for the locations of vertices of confluent polygonal cells. Exploiting the model's gradient-flow structure, we use singular-value decomposition to project modes of deformation of vertices onto… ▽ More We investigate the viscoelastic relaxation to equilibrium of a disordered planar epithelium described using the cell vertex model. In its standard form, the model is formulated as coupled evolution equations for the locations of vertices of confluent polygonal cells. Exploiting the model's gradient-flow structure, we use singular-value decomposition to project modes of deformation of vertices onto modes of deformation of cells. We show how eigenmodes of discrete Laplacian operators (specified by constitutive assumptions related to dissipation and mechanical energy) provide a spatial basis for evolving fields, and demonstrate how the operators can incorporate approximations of conventional spatial derivatives. We relate the spectrum of relaxation times to the eigenvalues of the Laplacians, modified by corrections that account for the fact that the cell network (and therefore the Laplacians) evolve during relaxation to an equilibrium prestressed state, providing the monolayer with geometric stiffness. While dilational modes of the Laplacians capture rapid relaxation in some circumstances, showing diffusive dynamics, geometric stiffness is typically a dominant source of monolayer rigidity, as we illustrate for monolayers exposed to unsteady stretching deformations. △ Less

Submitted 29 March, 2024; originally announced April 2024.

Comments: 8 figures

arXiv:2403.19578 [pdf, other]

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

Authors: Norman Di Palo, Edward Johns

Abstract: We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Tur… ▽ More We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping visual observations to action sequences that emulate the demonstrator's behaviour. We achieve this by transforming visual observations (inputs) and trajectories of actions (outputs) into sequences of tokens that a text-pretrained Transformer (GPT-4 Turbo) can ingest and generate, via a framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories, performing on par or better than state-of-the-art imitation learning (diffusion policies) in the low-data regime on a suite of real-world, everyday tasks. Rather than operating in the language domain as is typical, KAT leverages text-based Transformers to operate in the vision and action domains to learn general patterns in demonstration data for highly efficient imitation learning, indicating promising new avenues for repurposing natural language models for embodied tasks. Videos are available at https://www.robot-learning.uk/keypoint-action-tokens. △ Less

Submitted 17 October, 2024; v1 submitted 28 March, 2024; originally announced March 2024.

Comments: Published at Robotics: Science and Systems (RSS) 2024

arXiv:2402.13181 [pdf, other]

DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models

Authors: Norman Di Palo, Edward Johns

Abstract: We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align… ▽ More We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align its end-effector with the novel object to enable effective interaction. Through a series of real-world experiments on everyday tasks, we show that exploiting both the image-level and pixel-level properties of vision foundation models enables unprecedented learning efficiency and generalisation. Videos and code are available at https://www.robot-learning.uk/dinobot. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: To appear at 2024 IEEE International Conference on Robotics and Automation (ICRA)

arXiv:2312.12345 [pdf, other]

On the Effectiveness of Retrieval, Alignment, and Replay in Manipulation

Authors: Norman Di Palo, Edward Johns

Abstract: Imitation learning with visual observations is notoriously inefficient when addressed with end-to-end behavioural cloning methods. In this paper, we explore an alternative paradigm which decomposes reasoning into three phases. First, a retrieval phase, which informs the robot what it can do with an object. Second, an alignment phase, which informs the robot where to interact with the object. And t… ▽ More Imitation learning with visual observations is notoriously inefficient when addressed with end-to-end behavioural cloning methods. In this paper, we explore an alternative paradigm which decomposes reasoning into three phases. First, a retrieval phase, which informs the robot what it can do with an object. Second, an alignment phase, which informs the robot where to interact with the object. And third, a replay phase, which informs the robot how to interact with the object. Through a series of real-world experiments on everyday tasks, such as grasping, pouring, and inserting objects, we show that this decomposition brings unprecedented learning efficiency, and effective inter- and intra-class generalisation. Videos are available at https://www.robot-learning.uk/retrieval-alignment-replay. △ Less

Submitted 19 December, 2023; originally announced December 2023.

Comments: Published in IEEE Robotics and Automation Letters (RA-L). (Accepted December 2023)

arXiv:2312.10807 [pdf, other]

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

Authors: Hongkuan Zhou, Xiangtong Yao, Oier Mees, Yuan Meng, Ted Xiao, Yonatan Bisk, Jean Oh, Edward Johns, Mohit Shridhar, Dhruv Shah, Jesse Thomason, Kai Huang, Joyce Chai, Zhenshan Bing, Alois Knoll

Abstract: Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robotic ac… ▽ More Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robotic actions. In this comprehensive survey, we systematically explore recent advancements in language-conditioned robotic manipulation. We categorize existing methods into language-conditioned reward shaping, language-conditioned policy learning, neuro-symbolic artificial intelligence, and the utilization of foundational models (FMs) such as large language models (LLMs) and vision-language models (VLMs). Specifically, we analyze state-of-the-art techniques concerning semantic information extraction, environment and evaluation, auxiliary tasks, and task representation strategies. By conducting a comparative analysis, we highlight the strengths and limitations of current approaches in bridging language instructions with robot actions. Finally, we discuss open challenges and future research directions, focusing on potentially enhancing generalization capabilities and addressing safety issues in language-conditioned robot manipulators. △ Less

Submitted 17 February, 2025; v1 submitted 17 December, 2023; originally announced December 2023.

Comments: 37 pages, 15 figures, 4 tables, 354 citations

arXiv:2312.04533 [pdf, other]

Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models

Authors: Ivan Kapelyukh, Yifei Ren, Ignacio Alzugaray, Edward Johns

Abstract: We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement w… ▽ More We introduce Dream2Real, a robotics framework which integrates vision-language models (VLMs) trained on 2D data into a 3D object rearrangement pipeline. This is achieved by the robot autonomously constructing a 3D representation of the scene, where objects can be rearranged virtually and an image of the resulting arrangement rendered. These renders are evaluated by a VLM, so that the arrangement which best satisfies the user instruction is selected and recreated in the real world with pick-and-place. This enables language-conditioned rearrangement to be performed zero-shot, without needing to collect a training dataset of example arrangements. Results on a series of real-world tasks show that this framework is robust to distractors, controllable by language, capable of understanding complex multi-object relations, and readily applicable to both tabletop and 6-DoF rearrangement tasks. △ Less

Submitted 29 July, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: ICRA 2024. Project webpage with robot videos: https://www.robot-learning.uk/dream2real

arXiv:2311.08530 [pdf, other]

SceneScore: Learning a Cost Function for Object Arrangement

Authors: Ivan Kapelyukh, Edward Johns

Abstract: Arranging objects correctly is a key capability for robots which unlocks a wide range of useful tasks. A prerequisite for creating successful arrangements is the ability to evaluate the desirability of a given arrangement. Our method "SceneScore" learns a cost function for arrangements, such that desirable, human-like arrangements have a low cost. We learn the distribution of training arrangements… ▽ More Arranging objects correctly is a key capability for robots which unlocks a wide range of useful tasks. A prerequisite for creating successful arrangements is the ability to evaluate the desirability of a given arrangement. Our method "SceneScore" learns a cost function for arrangements, such that desirable, human-like arrangements have a low cost. We learn the distribution of training arrangements offline using an energy-based model, solely from example images without requiring environment interaction or human supervision. Our model is represented by a graph neural network which learns object-object relations, using graphs constructed from images. Experiments demonstrate that the learned cost function can be used to predict poses for missing objects, generalise to novel objects using semantic features, and can be composed with other cost functions to satisfy constraints at inference time. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: Presented at CoRL 2023 LEAP Workshop. Webpage: https://sites.google.com/view/scenescore

arXiv:2310.12238 [pdf, other]

Few-Shot In-Context Imitation Learning via Implicit Graph Alignment

Authors: Vitalis Vosylius, Edward Johns

Abstract: Consider the following problem: given a few demonstrations of a task across a few different objects, how can a robot learn to perform that same task on new, previously unseen objects? This is challenging because the large variety of objects within a class makes it difficult to infer the task-relevant relationship between the new objects and the objects in the demonstrations. We address this by for… ▽ More Consider the following problem: given a few demonstrations of a task across a few different objects, how can a robot learn to perform that same task on new, previously unseen objects? This is challenging because the large variety of objects within a class makes it difficult to infer the task-relevant relationship between the new objects and the objects in the demonstrations. We address this by formulating imitation learning as a conditional alignment problem between graph representations of objects. Consequently, we show that this conditioning allows for in-context learning, where a robot can perform a task on a set of new objects immediately after the demonstrations, without any prior knowledge about the object class or any further training. In our experiments, we explore and validate our design choices, and we show that our method is highly effective for few-shot learning of several real-world, everyday tasks, whilst outperforming baselines. Videos are available on our project webpage at https://www.robot-learning.uk/implicit-graph-alignment. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: Published at CoRL 2023. Videos are available on our project webpage at https://www.robot-learning.uk/implicit-graph-alignment

arXiv:2310.12077 [pdf, other]

One-Shot Imitation Learning: A Pose Estimation Perspective

Authors: Pietro Vitiello, Kamil Dreczkowski, Edward Johns

Abstract: In this paper, we study imitation learning under the challenging setting of: (1) only a single demonstration, (2) no further data collection, and (3) no prior task or object knowledge. We show how, with these constraints, imitation learning can be formulated as a combination of trajectory transfer and unseen object pose estimation. To explore this idea, we provide an in-depth study on how state-of… ▽ More In this paper, we study imitation learning under the challenging setting of: (1) only a single demonstration, (2) no further data collection, and (3) no prior task or object knowledge. We show how, with these constraints, imitation learning can be formulated as a combination of trajectory transfer and unseen object pose estimation. To explore this idea, we provide an in-depth study on how state-of-the-art unseen object pose estimators perform for one-shot imitation learning on ten real-world tasks, and we take a deep dive into the effects that camera calibration, pose estimation error, and spatial generalisation have on task success rates. For videos, please visit https://www.robot-learning.uk/pose-estimation-perspective. △ Less

Submitted 18 October, 2023; originally announced October 2023.

Comments: Published at the 7th Conference on Robot Learning (CoRL 2023). For more details please visit https://www.robot-learning.uk/pose-estimation-perspective

arXiv:2310.11604 [pdf, other]

doi 10.1109/LRA.2024.3410155

Language Models as Zero-Shot Trajectory Generators

Authors: Teyun Kwon, Norman Di Palo, Edward Johns

Abstract: Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence o… ▽ More Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation tasks, when given access to only object detection and segmentation vision models. We designed a single, task-agnostic prompt, without any in-context examples, motion primitives, or external trajectory optimisers. Then we studied how well it can perform across 30 real-world language-based tasks, such as "open the bottle cap" and "wipe the plate with the sponge", and we investigated which design choices in this prompt are the most important. Our conclusions raise the assumed limit of LLMs for robotics, and we reveal for the first time that LLMs do indeed possess an understanding of low-level robot control sufficient for a range of common tasks, and that they can additionally detect failures and then re-plan trajectories accordingly. Videos, prompts, and code are available at: https://www.robot-learning.uk/language-models-trajectory-generators. △ Less

Submitted 17 June, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: Published in IEEE Robotics and Automation Letters (Volume: 9, Issue: 7, July 2024, Pages: 6728-6735); 10 pages, 12 figures

Journal ref: IEEE Robotics and Automation Letters (Volume: 9, Issue: 7, July 2024, Pages: 6728-6735)

arXiv:2310.08864 [pdf, other]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Authors: Open X-Embodiment Collaboration, Abby O'Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie , et al. (269 additional authors not shown)

Abstract: Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method… ▽ More Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io. △ Less

Submitted 14 May, 2025; v1 submitted 13 October, 2023; originally announced October 2023.

Comments: Project website: https://robotics-transformer-x.github.io

arXiv:2305.09870

Crossing the Reality Gap in Tactile-Based Learning

Authors: Ya-Yen Tsai, Bidan Huang, Yu Zheng, Lei Han, Wang Wei Lee, Edward Johns

Abstract: Tactile sensors are believed to be essential in robotic manipulation, and prior works often rely on experts to reason the sensor feedback and design a controller. With the recent advancement in data-driven approaches, complicated manipulation can be realised, but an accurate and efficient tactile simulation is necessary for policy training. To this end, we present an approach to model a commonly u… ▽ More Tactile sensors are believed to be essential in robotic manipulation, and prior works often rely on experts to reason the sensor feedback and design a controller. With the recent advancement in data-driven approaches, complicated manipulation can be realised, but an accurate and efficient tactile simulation is necessary for policy training. To this end, we present an approach to model a commonly used pressure sensor array in simulation and to train a tactile-based manipulation policy with sim-to-real transfer in mind. Each taxel in our model is represented as a mass-spring-damper system, in which the parameters are iteratively identified as plausible ranges. This allows a policy to be trained with domain randomisation which improves its robustness to different environments. Then, we introduce encoders to further align the critical tactile features in a latent space. Finally, our experiments answer questions on tactile-based manipulation, tactile modelling and sim-to-real performance. △ Less

Submitted 22 May, 2023; v1 submitted 16 May, 2023; originally announced May 2023.

Comments: This work requires further improvement

arXiv:2303.02506 [pdf, other]

Prismer: A Vision-Language Model with Multi-Task Experts

Authors: Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar

Abstract: Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the maj… ▽ More Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from multiple readily-available, pre-trained experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer. △ Less

Submitted 18 January, 2024; v1 submitted 4 March, 2023; originally announced March 2023.

Comments: Published at TMLR 2024. Project Page: https://shikun.io/projects/prismer Code: https://github.com/NVlabs/prismer

arXiv:2212.06111 [pdf, other]

Where To Start? Transferring Simple Skills to Complex Environments

Authors: Vitalis Vosylius, Edward Johns

Abstract: Robot learning provides a number of ways to teach robots simple skills, such as grasping. However, these skills are usually trained in open, clutter-free environments, and therefore would likely cause undesirable collisions in more complex, cluttered environments. In this work, we introduce an affordance model based on a graph representation of an environment, which is optimised during deployment… ▽ More Robot learning provides a number of ways to teach robots simple skills, such as grasping. However, these skills are usually trained in open, clutter-free environments, and therefore would likely cause undesirable collisions in more complex, cluttered environments. In this work, we introduce an affordance model based on a graph representation of an environment, which is optimised during deployment to find suitable robot configurations to start a skill from, such that the skill can be executed without any collisions. We demonstrate that our method can generalise a priori acquired skills to previously unseen cluttered and constrained environments, in simulation and in the real world, for both a grasping and a placing task. △ Less

Submitted 12 December, 2022; originally announced December 2022.

Comments: Accepted at CoRL 2022. Videos are available on our project webpage at https://www.robot-learning.uk/where-to-start

arXiv:2210.17325 [pdf, other]

Real-time Mapping of Physical Scene Properties with an Autonomous Robot Experimenter

Authors: Iain Haughton, Edgar Sucar, Andre Mouton, Edward Johns, Andrew J. Davison

Abstract: Neural fields can be trained from scratch to represent the shape and appearance of 3D scenes efficiently. It has also been shown that they can densely map correlated properties such as semantics, via sparse interactions from a human labeller. In this work, we show that a robot can densely annotate a scene with arbitrary discrete or continuous physical properties via its own fully-autonomous experi… ▽ More Neural fields can be trained from scratch to represent the shape and appearance of 3D scenes efficiently. It has also been shown that they can densely map correlated properties such as semantics, via sparse interactions from a human labeller. In this work, we show that a robot can densely annotate a scene with arbitrary discrete or continuous physical properties via its own fully-autonomous experimental interactions, as it simultaneously scans and maps it with an RGB-D camera. A variety of scene interactions are possible, including poking with force sensing to determine rigidity, measuring local material type with single-pixel spectroscopy or predicting force distributions by pushing. Sparse experimental interactions are guided by entropy to enable high efficiency, with tabletop scene properties densely mapped from scratch in a few minutes from a few tens of interactions. △ Less

Submitted 31 October, 2022; originally announced October 2022.

arXiv:2210.02438 [pdf, other]

doi 10.1109/LRA.2023.3272516

DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics

Authors: Ivan Kapelyukh, Vitalis Vosylius, Edward Johns

Abstract: We introduce the first work to explore web-scale diffusion models for robotics. DALL-E-Bot enables a robot to rearrange objects in a scene, by first inferring a text description of those objects, then generating an image representing a natural, human-like arrangement of those objects, and finally physically arranging the objects according to that goal image. We show that this is possible zero-shot… ▽ More We introduce the first work to explore web-scale diffusion models for robotics. DALL-E-Bot enables a robot to rearrange objects in a scene, by first inferring a text description of those objects, then generating an image representing a natural, human-like arrangement of those objects, and finally physically arranging the objects according to that goal image. We show that this is possible zero-shot using DALL-E, without needing any further example arrangements, data collection, or training. DALL-E-Bot is fully autonomous and is not restricted to a pre-defined set of objects or scenes, thanks to DALL-E's web-scale pre-training. Encouraging real-world results, with both human studies and objective metrics, show that integrating web-scale diffusion models into robotics pipelines is a promising direction for scalable, unsupervised robot learning. △ Less

Submitted 4 May, 2023; v1 submitted 5 October, 2022; originally announced October 2022.

Comments: Webpage and videos: ( https://www.robot-learning.uk/dall-e-bot ) Published in IEEE Robotics and Automation Letters (RA-L)

arXiv:2209.02789 [pdf]

doi 10.1016/j.pocean.2022.102876

Lagrangian coherence and source of water of Loop Current Frontal Eddies in the Gulf of Mexico

Authors: Luna Hiron, Philippe Miron, Lynn K. Shay, William E. Johns, Eric P. Chassignet, Alexandra Bozec

Abstract: Loop Current Frontal Eddies (LCFEs) are known to intensify and assist in the Loop Current (LC) eddy shedding. These eddies can also modify the circulation in the eastern Gulf of Mexico (GoM) by attracting water and passive tracers such as chlorophyll and pollutants to the LC-LCFE front. During the 2010 Deepwater Horizon oil spill, part of the oil was entrained not only in the LC-LCFE front but als… ▽ More Loop Current Frontal Eddies (LCFEs) are known to intensify and assist in the Loop Current (LC) eddy shedding. These eddies can also modify the circulation in the eastern Gulf of Mexico (GoM) by attracting water and passive tracers such as chlorophyll and pollutants to the LC-LCFE front. During the 2010 Deepwater Horizon oil spill, part of the oil was entrained not only in the LC-LCFE front but also inside an LCFE, where it remained for weeks. This study assesses the ability of the LCFEs to transport water and passive tracers without exchange with the exterior (i.e., Lagrangian coherence) using altimetry and a high-resolution model. The following open questions are answered: (1) How long can the LCFEs remain Lagrangian coherent at and below the surface? (2) What is the source of water for the formation of LCFEs? (3) Can the formation of LCFEs attract shelf water? The results show that LCFEs are composed of waters originating from the outer band of the LC front, the region north of the LC, and the shelf, and potentially drive cross-shelf exchange of particles, water properties, and nutrients. At depth (~180 m), most LCFE water comes from the outer band of the LC front in the form of smaller frontal eddies. Once formed, LCFEs can transport water and passive tracers in their interior without exchange with the exterior for weeks: these eddies remained Lagrangian coherent for up to 25 days in the altimetry dataset and 18 days at the surface and 29 days at depth (~180 m) in the simulation. LCFE can remain Lagrangian coherent up to a depth of ~560 m. Additional analyses confirm that the LCFE involved in the oil spill formed from water near the oil rig location. Temperature-salinity diagrams show that LCFEs are composed of GoM water as opposed to LC water. Thus, LCFE formation modify the surrounding circulation and the transport of oil and other passive tracers in the eastern GoM. △ Less

Submitted 6 September, 2022; originally announced September 2022.

Journal ref: Progress in Oceanography, 102876, ISSN 0079-6611 (2022)

arXiv:2208.11658 [pdf, other]

AGO-Net: Association-Guided 3D Point Cloud Object Detection Network

Authors: Liang Du, Xiaoqing Ye, Xiao Tan, Edward Johns, Bo Chen, Errui Ding, Xiangyang Xue, Jianfeng Feng

Abstract: The human brain can effortlessly recognize and localize objects, whereas current 3D object detection methods based on LiDAR point clouds still report inferior performance for detecting occluded and distant objects: the point cloud appearance varies greatly due to occlusion, and has inherent variance in point densities along the distance to sensors. Therefore, designing feature representations robu… ▽ More The human brain can effortlessly recognize and localize objects, whereas current 3D object detection methods based on LiDAR point clouds still report inferior performance for detecting occluded and distant objects: the point cloud appearance varies greatly due to occlusion, and has inherent variance in point densities along the distance to sensors. Therefore, designing feature representations robust to such point clouds is critical. Inspired by human associative recognition, we propose a novel 3D detection framework that associates intact features for objects via domain adaptation. We bridge the gap between the perceptual domain, where features are derived from real scenes with sub-optimal representations, and the conceptual domain, where features are extracted from augmented scenes that consist of non-occlusion objects with rich detailed information. A feasible method is investigated to construct conceptual scenes without external datasets. We further introduce an attention-based re-weighting module that adaptively strengthens the feature adaptation of more informative regions. The network's feature enhancement ability is exploited without introducing extra cost during inference, which is plug-and-play in various 3D detection frameworks. We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed. Experiments on nuScenes and Waymo datasets also validate the versatility of our method. △ Less

Submitted 24 August, 2022; originally announced August 2022.

Comments: 12 pages

arXiv:2204.02863 [pdf, other]

Demonstrate Once, Imitate Immediately (DOME): Learning Visual Servoing for One-Shot Imitation Learning

Authors: Eugene Valassakis, Georgios Papagiannis, Norman Di Palo, Edward Johns

Abstract: We present DOME, a novel method for one-shot imitation learning, where a task can be learned from just a single demonstration and then be deployed immediately, without any further data collection or training. DOME does not require prior task or object knowledge, and can perform the task in novel object configurations and with distractors. At its core, DOME uses an image-conditioned object segmenta… ▽ More We present DOME, a novel method for one-shot imitation learning, where a task can be learned from just a single demonstration and then be deployed immediately, without any further data collection or training. DOME does not require prior task or object knowledge, and can perform the task in novel object configurations and with distractors. At its core, DOME uses an image-conditioned object segmentation network followed by a learned visual servoing network, to move the robot's end-effector to the same relative pose to the object as during the demonstration, after which the task can be completed by replaying the demonstration's end-effector velocities. We show that DOME achieves near 100% success rate on 7 real-world everyday tasks, and we perform several studies to thoroughly understand each individual component of DOME. Videos and supplementary material are available at: https://www.robot-learning.uk/dome . △ Less

Submitted 27 July, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

Comments: To be published at IROS 2022. 7 figures, 8 pages. Videos and supplementary material are available at: https://www.robot-learning.uk/dome

arXiv:2202.03091 [pdf, other]

Auto-Lambda: Disentangling Dynamic Task Relationships

Authors: Shikun Liu, Stephen James, Andrew J. Davison, Edward Johns

Abstract: Understanding the structure of multiple related tasks allows for multi-task learning to improve the generalisation ability of one or all of them. However, it usually requires training each pairwise combination of tasks together in order to capture task relationships, at an extremely high computational cost. In this work, we learn task relationships via an automated weighting framework, named Auto-… ▽ More Understanding the structure of multiple related tasks allows for multi-task learning to improve the generalisation ability of one or all of them. However, it usually requires training each pairwise combination of tasks together in order to capture task relationships, at an extremely high computational cost. In this work, we learn task relationships via an automated weighting framework, named Auto-Lambda. Unlike previous methods where task relationships are assumed to be fixed, Auto-Lambda is a gradient-based meta learning framework which explores continuous, dynamic task relationships via task-specific weightings, and can optimise any choice of combination of tasks through the formulation of a meta-loss; where the validation loss automatically influences task weightings throughout training. We apply the proposed framework to both multi-task and auxiliary learning problems in computer vision and robotics, and show that Auto-Lambda achieves state-of-the-art performance, even when compared to optimisation strategies designed specifically for each problem and data domain. Finally, we observe that Auto-Lambda can discover interesting learning behaviors, leading to new insights in multi-task learning. Code is available at https://github.com/lorenmt/auto-lambda. △ Less

Submitted 2 June, 2022; v1 submitted 7 February, 2022; originally announced February 2022.

Comments: Published at TMLR 2022. Project Page: https://shikun.io/projects/auto-lambda Code: https://github.com/lorenmt/auto-lambda

arXiv:2111.12867 [pdf, other]

Back to Reality for Imitation Learning

Authors: Edward Johns

Abstract: Imitation learning, and robot learning in general, emerged due to breakthroughs in machine learning, rather than breakthroughs in robotics. As such, evaluation metrics for robot learning are deeply rooted in those for machine learning, and focus primarily on data efficiency. We believe that a better metric for real-world robot learning is time efficiency, which better models the true cost to human… ▽ More Imitation learning, and robot learning in general, emerged due to breakthroughs in machine learning, rather than breakthroughs in robotics. As such, evaluation metrics for robot learning are deeply rooted in those for machine learning, and focus primarily on data efficiency. We believe that a better metric for real-world robot learning is time efficiency, which better models the true cost to humans. This is a call to arms to the robot learning community to develop our own evaluation metrics, tailored towards the long-term goals of real-world robotics. △ Less

Submitted 24 November, 2021; originally announced November 2021.

Comments: Published at CoRL 2021, blue sky oral track

arXiv:2111.07447 [pdf, other]

Learning Multi-Stage Tasks with One Demonstration via Self-Replay

Authors: Norman Di Palo, Edward Johns

Abstract: In this work, we introduce a novel method to learn everyday-like multi-stage tasks from a single human demonstration, without requiring any prior object knowledge. Inspired by the recent Coarse-to-Fine Imitation Learning method, we model imitation learning as a learned object reaching phase followed by an open-loop replay of the demonstrator's actions. We build upon this for multi-stage tasks wher… ▽ More In this work, we introduce a novel method to learn everyday-like multi-stage tasks from a single human demonstration, without requiring any prior object knowledge. Inspired by the recent Coarse-to-Fine Imitation Learning method, we model imitation learning as a learned object reaching phase followed by an open-loop replay of the demonstrator's actions. We build upon this for multi-stage tasks where, following the human demonstration, the robot can autonomously collect image data for the entire multi-stage task, by reaching the next object in the sequence and then replaying the demonstration, and then repeating in a loop for all stages of the task. We evaluate with real-world experiments on a set of everyday-like multi-stage tasks, which we show that our method can solve from a single demonstration. Videos and supplementary material can be found at https://www.robot-learning.uk/self-replay. △ Less

Submitted 14 November, 2021; originally announced November 2021.

Comments: Published at the 5th Conference on Robot Learning (CoRL) 2021

arXiv:2111.03112 [pdf, other]

My House, My Rules: Learning Tidying Preferences with Graph Neural Networks

Authors: Ivan Kapelyukh, Edward Johns

Abstract: Robots that arrange household objects should do so according to the user's preferences, which are inherently subjective and difficult to model. We present NeatNet: a novel Variational Autoencoder architecture using Graph Neural Network layers, which can extract a low-dimensional latent preference vector from a user by observing how they arrange scenes. Given any set of objects, this vector can the… ▽ More Robots that arrange household objects should do so according to the user's preferences, which are inherently subjective and difficult to model. We present NeatNet: a novel Variational Autoencoder architecture using Graph Neural Network layers, which can extract a low-dimensional latent preference vector from a user by observing how they arrange scenes. Given any set of objects, this vector can then be used to generate an arrangement which is tailored to that user's spatial preferences, with word embeddings used for generalisation to new objects. We develop a tidying simulator to gather rearrangement examples from 75 users, and demonstrate empirically that our method consistently produces neat and personalised arrangements across a variety of rearrangement scenarios. △ Less

Submitted 4 November, 2021; originally announced November 2021.

Comments: Published at CoRL 2021. Webpage and video: https://www.robot-learning.uk/my-house-my-rules

arXiv:2111.01245 [pdf, other]

Learning Eye-in-Hand Camera Calibration from a Single Image

Authors: Eugene Valassakis, Kamil Dreczkowski, Edward Johns

Abstract: Eye-in-hand camera calibration is a fundamental and long-studied problem in robotics. We present a study on using learning-based methods for solving this problem online from a single RGB image, whilst training our models with entirely synthetic data. We study three main approaches: one direct regression model that directly predicts the extrinsic matrix from an image, one sparse correspondence mode… ▽ More Eye-in-hand camera calibration is a fundamental and long-studied problem in robotics. We present a study on using learning-based methods for solving this problem online from a single RGB image, whilst training our models with entirely synthetic data. We study three main approaches: one direct regression model that directly predicts the extrinsic matrix from an image, one sparse correspondence model that regresses 2D keypoints and then uses PnP, and one dense correspondence model that uses regressed depth and segmentation maps to enable ICP pose estimation. In our experiments, we benchmark these methods against each other and against well-established classical methods, to find the surprising result that direct regression outperforms other approaches, and we perform noise-sensitivity analysis to gain further insights into these results. △ Less

Submitted 3 November, 2021; v1 submitted 1 November, 2021; originally announced November 2021.

Comments: Published at the 2021 Conference on Robot Learning (CoRL). Webpage and video: https://www.robot-learning.uk/learning-eye-in-hand-calibration

arXiv:2109.07559 [pdf, other]

Hybrid ICP

Authors: Kamil Dreczkowski, Edward Johns

Abstract: ICP algorithms typically involve a fixed choice of data association method and a fixed choice of error metric. In this paper, we propose Hybrid ICP, a novel and flexible ICP variant which dynamically optimises both the data association method and error metric based on the live image of an object and the current ICP estimate. We show that when used for object pose estimation, Hybrid ICP is more acc… ▽ More ICP algorithms typically involve a fixed choice of data association method and a fixed choice of error metric. In this paper, we propose Hybrid ICP, a novel and flexible ICP variant which dynamically optimises both the data association method and error metric based on the live image of an object and the current ICP estimate. We show that when used for object pose estimation, Hybrid ICP is more accurate and more robust to noise than other commonly used ICP variants. We also consider the setting where ICP is applied sequentially with a moving camera, and we study the trade-off between the accuracy of each ICP estimate and the number of ICP estimates available within a fixed amount of time. △ Less

Submitted 15 September, 2021; originally announced September 2021.

Comments: Published at IROS 2021. Webpage and video: https://www.robot-learning.uk/hybrid-icp

arXiv:2105.11283 [pdf, other]

Coarse-to-Fine for Sim-to-Real: Sub-Millimetre Precision Across Wide Task Spaces

Authors: Eugene Valassakis, Norman Di Palo, Edward Johns

Abstract: In this paper, we study the problem of zero-shot sim-to-real when the task requires both highly precise control with sub-millimetre error tolerance, and wide task space generalisation. Our framework involves a coarse-to-fine controller, where trajectories begin with classical motion planning using ICP-based pose estimation, and transition to a learned end-to-end controller which maps images to act… ▽ More In this paper, we study the problem of zero-shot sim-to-real when the task requires both highly precise control with sub-millimetre error tolerance, and wide task space generalisation. Our framework involves a coarse-to-fine controller, where trajectories begin with classical motion planning using ICP-based pose estimation, and transition to a learned end-to-end controller which maps images to actions and is trained in simulation with domain randomisation. In this way, we achieve precise control whilst also generalising the controller across wide task spaces, and keeping the robustness of vision-based, end-to-end control. Real-world experiments on a range of different tasks show that, by exploiting the best of both worlds, our framework significantly outperforms purely motion planning methods, and purely learning-based methods. Furthermore, we answer a range of questions on best practices for precise sim-to-real transfer, such as how different image sensor modalities and image feature representations perform. △ Less

Submitted 29 July, 2021; v1 submitted 24 May, 2021; originally announced May 2021.

Comments: To be published at IROS 2021. 8 pages, 6 figures

arXiv:2105.06411 [pdf, other]

Coarse-to-Fine Imitation Learning: Robot Manipulation from a Single Demonstration

Authors: Edward Johns

Abstract: We introduce a simple new method for visual imitation learning, which allows a novel robot manipulation task to be learned from a single human demonstration, without requiring any prior knowledge of the object being interacted with. Our method models imitation learning as a state estimation problem, with the state defined as the end-effector's pose at the point where object interaction begins, as… ▽ More We introduce a simple new method for visual imitation learning, which allows a novel robot manipulation task to be learned from a single human demonstration, without requiring any prior knowledge of the object being interacted with. Our method models imitation learning as a state estimation problem, with the state defined as the end-effector's pose at the point where object interaction begins, as observed from the demonstration. By then modelling a manipulation task as a coarse, approach trajectory followed by a fine, interaction trajectory, this state estimator can be trained in a self-supervised manner, by automatically moving the end-effector's camera around the object. At test time, the end-effector moves to the estimated state through a linear path, at which point the original demonstration's end-effector velocities are simply replayed. This enables convenient acquisition of a complex interaction trajectory, without actually needing to explicitly learn a policy. Real-world experiments on 8 everyday tasks show that our method can learn a diverse range of skills from a single human demonstration, whilst also yielding a stable and interpretable controller. △ Less

Submitted 10 June, 2021; v1 submitted 13 May, 2021; originally announced May 2021.

Comments: Published at ICRA 2021. Webpage and video: https://www.robot-learning.uk/coarse-to-fine-imitation-learning

arXiv:2104.04465 [pdf, other]

Bootstrapping Semantic Segmentation with Regional Contrast

Authors: Shikun Liu, Shuaifeng Zhi, Edward Johns, Andrew J. Davison

Abstract: We present ReCo, a contrastive learning framework designed at a regional level to assist learning in semantic segmentation. ReCo performs semi-supervised or supervised pixel-level contrastive learning on a sparse set of hard negative pixels, with minimal additional memory footprint. ReCo is easy to implement, being built on top of off-the-shelf segmentation networks, and consistently improves perf… ▽ More We present ReCo, a contrastive learning framework designed at a regional level to assist learning in semantic segmentation. ReCo performs semi-supervised or supervised pixel-level contrastive learning on a sparse set of hard negative pixels, with minimal additional memory footprint. ReCo is easy to implement, being built on top of off-the-shelf segmentation networks, and consistently improves performance in both semi-supervised and supervised semantic segmentation methods, achieving smoother segmentation boundaries and faster convergence. The strongest effect is in semi-supervised learning with very few labels. With ReCo, we achieve high-quality semantic segmentation models, requiring only 5 examples of each semantic class. Code is available at https://github.com/lorenmt/reco. △ Less

Submitted 31 January, 2022; v1 submitted 9 April, 2021; originally announced April 2021.

Comments: Published at ICLR 2022. Project Page: https://shikun.io/projects/regional-contrast. Code: https://github.com/lorenmt/reco

arXiv:2102.11003 [pdf, other]

DROID: Minimizing the Reality Gap using Single-Shot Human Demonstration

Authors: Ya-Yen Tsai, Hui Xu, Zihan Ding, Chong Zhang, Edward Johns, Bidan Huang

Abstract: Reinforcement learning (RL) has demonstrated great success in the past several years. However, most of the scenarios focus on simulated environments. One of the main challenges of transferring the policy learned in a simulated environment to real world, is the discrepancy between the dynamics of the two environments. In prior works, Domain Randomization (DR) has been used to address the reality ga… ▽ More Reinforcement learning (RL) has demonstrated great success in the past several years. However, most of the scenarios focus on simulated environments. One of the main challenges of transferring the policy learned in a simulated environment to real world, is the discrepancy between the dynamics of the two environments. In prior works, Domain Randomization (DR) has been used to address the reality gap for both robotic locomotion and manipulation tasks. In this paper, we propose Domain Randomization Optimization IDentification (DROID), a novel framework to exploit single-shot human demonstration for identifying the simulator's distribution of dynamics parameters, and apply it to training a policy on a door opening task. Our results show that the proposed framework can identify the difference in dynamics between the simulated and the real worlds, and thus improve policy transfer by optimizing the simulator's randomization ranges. We further illustrate that based on these same identified parameters, our method can generalize the learned policy to different but related tasks. △ Less

Submitted 23 February, 2021; v1 submitted 22 February, 2021; originally announced February 2021.

Comments: paper accepted and to be published in RA-L 2021

arXiv:2011.09586 [pdf, other]

SAFARI: Safe and Active Robot Imitation Learning with Imagination

Authors: Norman Di Palo, Edward Johns

Abstract: One of the main issues in Imitation Learning is the erroneous behavior of an agent when facing out-of-distribution situations, not covered by the set of demonstrations given by the expert. In this work, we tackle this problem by introducing a novel active learning and control algorithm, SAFARI. During training, it allows an agent to request further human demonstrations when these out-of-distributi… ▽ More One of the main issues in Imitation Learning is the erroneous behavior of an agent when facing out-of-distribution situations, not covered by the set of demonstrations given by the expert. In this work, we tackle this problem by introducing a novel active learning and control algorithm, SAFARI. During training, it allows an agent to request further human demonstrations when these out-of-distribution situations are met. At deployment, it combines model-free acting using behavioural cloning with model-based planning to reduce state-distribution shift, using future state reconstruction as a test for state familiarity. We empirically demonstrate how this method increases the performance on a set of manipulation tasks with respect to passive Imitation Learning, by gathering more informative demonstrations and by minimizing state-distribution shift at test time. We also show how this method enables the agent to autonomously predict failure rapidly and safely. △ Less

Submitted 18 November, 2020; originally announced November 2020.

arXiv:2011.07112 [pdf, other]

Benchmarking Domain Randomisation for Visual Sim-to-Real Transfer

Authors: Raghad Alghonaim, Edward Johns

Abstract: Domain randomisation is a very popular method for visual sim-to-real transfer in robotics, due to its simplicity and ability to achieve transfer without any real-world images at all. Nonetheless, a number of design choices must be made to achieve optimal transfer. In this paper, we perform a comprehensive benchmarking study on these different choices, with two key experiments evaluated on a real-w… ▽ More Domain randomisation is a very popular method for visual sim-to-real transfer in robotics, due to its simplicity and ability to achieve transfer without any real-world images at all. Nonetheless, a number of design choices must be made to achieve optimal transfer. In this paper, we perform a comprehensive benchmarking study on these different choices, with two key experiments evaluated on a real-world object pose estimation task. First, we study the rendering quality, and find that a small number of high-quality images is superior to a large number of low-quality images. Second, we study the type of randomisation, and find that both distractors and textures are important for generalisation to novel environments. △ Less

Submitted 21 May, 2021; v1 submitted 13 November, 2020; originally announced November 2020.

Comments: Published at ICRA 2021. For project page, please visit: https://www.robot-learning.uk/benchmarking-domain-randomisation

arXiv:2008.06686 [pdf, other]

Crossing The Gap: A Deep Dive into Zero-Shot Sim-to-Real Transfer for Dynamics

Authors: Eugene Valassakis, Zihan Ding, Edward Johns

Abstract: Zero-shot sim-to-real transfer of tasks with complex dynamics is a highly challenging and unsolved problem. A number of solutions have been proposed in recent years, but we have found that many works do not present a thorough evaluation in the real world, or underplay the significant engineering effort and task-specific fine tuning that is required to achieve the published results. In this paper,… ▽ More Zero-shot sim-to-real transfer of tasks with complex dynamics is a highly challenging and unsolved problem. A number of solutions have been proposed in recent years, but we have found that many works do not present a thorough evaluation in the real world, or underplay the significant engineering effort and task-specific fine tuning that is required to achieve the published results. In this paper, we dive deeper into the sim-to-real transfer challenge, investigate why this is such a difficult problem, and present objective evaluations of a number of transfer methods across a range of real-world tasks. Surprisingly, we found that a method which simply injects random forces into the simulation performs just as well as more complex methods, such as those which randomise the simulator's dynamics parameters, or adapt a policy online using recurrent network architectures. △ Less

Submitted 15 August, 2020; originally announced August 2020.

Comments: To be published at IROS 2020. 8 pages, 6 figures. For supplementary material and code, please visit : https://www.robot-learning.uk/crossing-the-gap

arXiv:2008.03285 [pdf, other]

Physics-Based Dexterous Manipulations with Estimated Hand Poses and Residual Reinforcement Learning

Authors: Guillermo Garcia-Hernando, Edward Johns, Tae-Kyun Kim

Abstract: Dexterous manipulation of objects in virtual environments with our bare hands, by using only a depth sensor and a state-of-the-art 3D hand pose estimator (HPE), is challenging. While virtual environments are ruled by physics, e.g. object weights and surface frictions, the absence of force feedback makes the task challenging, as even slight inaccuracies on finger tips or contact points from HPE may… ▽ More Dexterous manipulation of objects in virtual environments with our bare hands, by using only a depth sensor and a state-of-the-art 3D hand pose estimator (HPE), is challenging. While virtual environments are ruled by physics, e.g. object weights and surface frictions, the absence of force feedback makes the task challenging, as even slight inaccuracies on finger tips or contact points from HPE may make the interactions fail. Prior arts simply generate contact forces in the direction of the fingers' closures, when finger joints penetrate virtual objects. Although useful for simple grasping scenarios, they cannot be applied to dexterous manipulations such as in-hand manipulation. Existing reinforcement learning (RL) and imitation learning (IL) approaches train agents that learn skills by using task-specific rewards, without considering any online user input. In this work, we propose to learn a model that maps noisy input hand poses to target virtual poses, which introduces the needed contacts to accomplish the tasks on a physics simulator. The agent is trained in a residual setting by using a model-free hybrid RL+IL approach. A 3D hand pose estimation reward is introduced leading to an improvement on HPE accuracy when the physics-guided corrected target poses are remapped to the input space. As the model corrects HPE errors by applying minor but crucial joint displacements for contacts, this helps to keep the generated motion visually close to the user input. Since HPE sequences performing successful virtual interactions do not exist, a data generation scheme to train and evaluate the system is proposed. We test our framework in two applications that use hand pose estimates for dexterous manipulations: hand-object interactions in VR and hand-object motion reconstruction in-the-wild. △ Less

Submitted 7 August, 2020; originally announced August 2020.

Comments: To appear in IROS2020

arXiv:2008.00892 [pdf, other]

Shape Adaptor: A Learnable Resizing Module

Authors: Shikun Liu, Zhe Lin, Yilin Wang, Jianming Zhang, Federico Perazzi, Edward Johns

Abstract: We present a novel resizing module for neural networks: shape adaptor, a drop-in enhancement built on top of traditional resizing layers, such as pooling, bilinear sampling, and strided convolution. Whilst traditional resizing layers have fixed and deterministic reshaping factors, our module allows for a learnable reshaping factor. Our implementation enables shape adaptors to be trained end-to-end… ▽ More We present a novel resizing module for neural networks: shape adaptor, a drop-in enhancement built on top of traditional resizing layers, such as pooling, bilinear sampling, and strided convolution. Whilst traditional resizing layers have fixed and deterministic reshaping factors, our module allows for a learnable reshaping factor. Our implementation enables shape adaptors to be trained end-to-end without any additional supervision, through which network architectures can be optimised for each individual task, in a fully automated way. We performed experiments across seven image classification datasets, and results show that by simply using a set of our shape adaptors instead of the original resizing layers, performance increases consistently over human-designed networks, across all datasets. Additionally, we show the effectiveness of shape adaptors on two other applications: network compression and transfer learning. The source code is available at: https://github.com/lorenmt/shape-adaptor. △ Less

Submitted 10 August, 2020; v1 submitted 3 August, 2020; originally announced August 2020.

Comments: Published at ECCV 2020

arXiv:2004.00716 [pdf, other]

doi 10.1109/LRA.2020.2965392

Constrained-Space Optimization and Reinforcement Learning for Complex Tasks

Authors: Ya-Yen Tsai, Bo Xiao, Edward Johns, Guang-Zhong Yang

Abstract: Learning from Demonstration is increasingly used for transferring operator manipulation skills to robots. In practice, it is important to cater for limited data and imperfect human demonstrations, as well as underlying safety constraints. This paper presents a constrained-space optimization and reinforcement learning scheme for managing complex tasks. Through interactions within the constrained sp… ▽ More Learning from Demonstration is increasingly used for transferring operator manipulation skills to robots. In practice, it is important to cater for limited data and imperfect human demonstrations, as well as underlying safety constraints. This paper presents a constrained-space optimization and reinforcement learning scheme for managing complex tasks. Through interactions within the constrained space, the reinforcement learning agent is trained to optimize the manipulation skills according to a defined reward function. After learning, the optimal policy is derived from the well-trained reinforcement learning agent, which is then implemented to guide the robot to conduct tasks that are similar to the experts' demonstrations. The effectiveness of the proposed method is verified with a robotic suturing task, demonstrating that the learned policy outperformed the experts' demonstrations in terms of the smoothness of the joint motion and end-effector trajectories, as well as the overall task completion time. △ Less

Submitted 1 April, 2020; originally announced April 2020.

Comments: Accepted for publication in RA-Letters and at ICRA 2020

Journal ref: IEEE Robotics and Automation Letters, 5(2) (2020) 682-689

arXiv:2004.00136 [pdf, other]

Sim-to-Real Transfer for Optical Tactile Sensing

Authors: Zihan Ding, Nathan F. Lepora, Edward Johns

Abstract: Deep learning and reinforcement learning methods have been shown to enable learning of flexible and complex robot controllers. However, the reliance on large amounts of training data often requires data collection to be carried out in simulation, with a number of sim-to-real transfer methods being developed in recent years. In this paper, we study these techniques for tactile sensing using the Tac… ▽ More Deep learning and reinforcement learning methods have been shown to enable learning of flexible and complex robot controllers. However, the reliance on large amounts of training data often requires data collection to be carried out in simulation, with a number of sim-to-real transfer methods being developed in recent years. In this paper, we study these techniques for tactile sensing using the TacTip optical tactile sensor, which consists of a deformable tip with a camera observing the positions of pins inside this tip. We designed a model for soft body simulation which was implemented using the Unity physics engine, and trained a neural network to predict the locations and angles of edges when in contact with the sensor. Using domain randomisation techniques for sim-to-real transfer, we show how this framework can be used to accurately predict edges with less than 1 mm prediction error in real-world testing, without any real-world data at all. △ Less

Submitted 31 March, 2020; originally announced April 2020.

Comments: Accepted for publication at ICRA 2020. Website: https://www.robot-learning.uk/sim-to-real-tactile-icra-2020

arXiv:1910.10799 [pdf, other]

doi 10.1098/rspa.2019.0716

Force networks, torque balance and Airy stress in the planar vertex model of a confluent epithelium

Authors: Oliver E. Jensen, Emma Johns, Sarah Woolner

Abstract: The vertex model is a popular framework for modelling tightly packed biological cells, such as confluent epithelia. Cells are described by convex polygons tiling the plane and their equilibrium is found by minimizing a global mechanical energy, with vertex locations treated as degrees of freedom. Drawing on analogies with granular materials, we describe the force network for a localized monolayer… ▽ More The vertex model is a popular framework for modelling tightly packed biological cells, such as confluent epithelia. Cells are described by convex polygons tiling the plane and their equilibrium is found by minimizing a global mechanical energy, with vertex locations treated as degrees of freedom. Drawing on analogies with granular materials, we describe the force network for a localized monolayer and derive the corresponding discrete Airy stress function, expressed for each $N$-sided cell as $N$ scalars defined over kites covering the cell. We show how a torque balance (commonly overlooked in implementations of the vertex model) requires each internal vertex to lie at the orthocentre of the triangle formed by neighbouring edge centroids. Torque balance also places a geometric constraint on the stress in the neighbourhood of cellular trijunctions, and requires cell edges to be orthogonal to the links of a dual network that connect neighbouring cell centres and thereby triangulate the monolayer. We show how the Airy stress function depends on cell shape when a standard energy functional is adopted, and discuss implications for computational implementations of the model. △ Less

Submitted 2 April, 2020; v1 submitted 23 October, 2019; originally announced October 2019.

arXiv:1901.08933 [pdf, other]

Self-Supervised Generalisation with Meta Auxiliary Learning

Authors: Shikun Liu, Andrew J. Davison, Edward Johns

Abstract: Learning with auxiliary tasks can improve the ability of a primary task to generalise. However, this comes at the cost of manually labelling auxiliary data. We propose a new method which automatically learns appropriate labels for an auxiliary task, such that any supervised learning task can be improved without requiring access to any further data. The approach is to train two neural networks: a l… ▽ More Learning with auxiliary tasks can improve the ability of a primary task to generalise. However, this comes at the cost of manually labelling auxiliary data. We propose a new method which automatically learns appropriate labels for an auxiliary task, such that any supervised learning task can be improved without requiring access to any further data. The approach is to train two neural networks: a label-generation network to predict the auxiliary labels, and a multi-task network to train the primary task alongside the auxiliary task. The loss for the label-generation network incorporates the loss of the multi-task network, and so this interaction between the two networks can be seen as a form of meta learning with a double gradient. We show that our proposed method, Meta AuXiliary Learning (MAXL), outperforms single-task learning on 7 image datasets, without requiring any additional data. We also show that MAXL outperforms several other baselines for generating auxiliary labels, and is even competitive when compared with human-defined auxiliary labels. The self-supervised nature of our method leads to a promising new direction towards automated generalisation. Source code can be found at https://github.com/lorenmt/maxl. △ Less

Submitted 26 November, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

Comments: Published at Conference on Neural Information Processing Systems 2019

arXiv:1803.10704 [pdf, other]

End-to-End Multi-Task Learning with Attention

Authors: Shikun Liu, Edward Johns, Andrew J. Davison

Abstract: We propose a novel multi-task learning architecture, which allows learning of task-specific feature-level attention. Our design, the Multi-Task Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with a soft-attention module for each task. These modules allow for learning of task-specific features from the global features, whilst simultaneously… ▽ More We propose a novel multi-task learning architecture, which allows learning of task-specific feature-level attention. Our design, the Multi-Task Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with a soft-attention module for each task. These modules allow for learning of task-specific features from the global features, whilst simultaneously allowing for features to be shared across different tasks. The architecture can be trained end-to-end and can be built upon any feed-forward neural network, is simple to implement, and is parameter efficient. We evaluate our approach on a variety of datasets, across both image-to-image predictions and image classification tasks. We show that our architecture is state-of-the-art in multi-task learning compared to existing methods, and is also less sensitive to various weighting schemes in the multi-task loss function. Code is available at https://github.com/lorenmt/mtan. △ Less

Submitted 5 April, 2019; v1 submitted 28 March, 2018; originally announced March 2018.

Comments: Accepted at Computer Vision and Pattern Recognition (CVPR), 2019

arXiv:1711.02909 [pdf, other]

doi 10.1103/PhysRevE.97.052409

Mechanical characterization of disordered and anisotropic cellular monolayers

Authors: Alexander Nestor-Bergmann, Emma Johns, Sarah Woolner, Oliver E. Jensen

Abstract: We consider a cellular monolayer, described using a vertex-based model, for which cells form a spatially disordered array of convex polygons that tile the plane. Equilibrium cell configurations are assumed to minimize a global energy defined in terms of cell areas and perimeters; energy is dissipated via dynamic area and length changes, as well as cell neighbour exchanges. The model captures our o… ▽ More We consider a cellular monolayer, described using a vertex-based model, for which cells form a spatially disordered array of convex polygons that tile the plane. Equilibrium cell configurations are assumed to minimize a global energy defined in terms of cell areas and perimeters; energy is dissipated via dynamic area and length changes, as well as cell neighbour exchanges. The model captures our observations of an epithelium from a Xenopus embryo showing that uniaxial stretching induces spatial ordering, with cells under net tension (compression) tending to align with (against) the direction of stretch, but with the stress remaining heterogeneous at the single-cell level. We use the vertex model to derive the linearized relation between tissue-level stress, strain and strain-rate about a deformed base state, which can be used to characterize the tissue's anisotropic mechanical properties; expressions for viscoelastic tissue moduli are given as direct sums over cells. When the base state is isotropic, the model predicts that tissue properties can be tuned to a regime with high elastic shear resistance but low resistance to area changes, or vice versa. △ Less

Submitted 24 February, 2018; v1 submitted 8 November, 2017; originally announced November 2017.

Comments: 9 figures

Journal ref: Phys. Rev. E 97, 052409 (2018)

arXiv:1707.02267 [pdf, other]

Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task

Authors: Stephen James, Andrew J. Davison, Edward Johns

Abstract: End-to-end control for robot manipulation and grasping is emerging as an attractive alternative to traditional pipelined approaches. However, end-to-end methods tend to either be slow to train, exhibit little or no generalisability, or lack the ability to accomplish long-horizon or multi-stage tasks. In this paper, we show how two simple techniques can lead to end-to-end (image to velocity) execut… ▽ More End-to-end control for robot manipulation and grasping is emerging as an attractive alternative to traditional pipelined approaches. However, end-to-end methods tend to either be slow to train, exhibit little or no generalisability, or lack the ability to accomplish long-horizon or multi-stage tasks. In this paper, we show how two simple techniques can lead to end-to-end (image to velocity) execution of a multi-stage task, which is analogous to a simple tidying routine, without having seen a single real image. This involves locating, reaching for, and grasping a cube, then locating a basket and dropping the cube inside. To achieve this, robot trajectories are computed in a simulator, to collect a series of control velocities which accomplish the task. Then, a CNN is trained to map observed images to velocities, using domain randomisation to enable generalisation to real world images. Results show that we are able to successfully accomplish the task in the real world with the ability to generalise to novel environments, including those with dynamic lighting conditions, distractor objects, and moving objects, including the basket itself. We believe our approach to be simple, highly scalable, and capable of learning long-horizon tasks that have until now not been shown with the state-of-the-art in end-to-end robot control. △ Less

Submitted 17 October, 2017; v1 submitted 7 July, 2017; originally announced July 2017.

Comments: 1st Conference on Robot Learning (CoRL 2017), Mountain View, United States

arXiv:1705.08260 [pdf]

Self-Supervised Siamese Learning on Stereo Image Pairs for Depth Estimation in Robotic Surgery

Authors: Menglong Ye, Edward Johns, Ankur Handa, Lin Zhang, Philip Pratt, Guang-Zhong Yang

Abstract: Robotic surgery has become a powerful tool for performing minimally invasive procedures, providing advantages in dexterity, precision, and 3D vision, over traditional surgery. One popular robotic system is the da Vinci surgical platform, which allows preoperative information to be incorporated into live procedures using Augmented Reality (AR). Scene depth estimation is a prerequisite for AR, as ac… ▽ More Robotic surgery has become a powerful tool for performing minimally invasive procedures, providing advantages in dexterity, precision, and 3D vision, over traditional surgery. One popular robotic system is the da Vinci surgical platform, which allows preoperative information to be incorporated into live procedures using Augmented Reality (AR). Scene depth estimation is a prerequisite for AR, as accurate registration requires 3D correspondences between preoperative and intraoperative organ models. In the past decade, there has been much progress on depth estimation for surgical scenes, such as using monocular or binocular laparoscopes [1,2]. More recently, advances in deep learning have enabled depth estimation via Convolutional Neural Networks (CNNs) [3], but training requires a large image dataset with ground truth depths. Inspired by [4], we propose a deep learning framework for surgical scene depth estimation using self-supervision for scalable data acquisition. Our framework consists of an autoencoder for depth prediction, and a differentiable spatial transformer for training the autoencoder on stereo image pairs without ground truth depths. Validation was conducted on stereo videos collected in robotic partial nephrectomy. △ Less

Submitted 17 May, 2017; originally announced May 2017.

Comments: A two-page short report to be presented at the Hamlyn Symposium on Medical Robotics 2017. An extension of this work is on progress

arXiv:1609.03759 [pdf, other]

3D Simulation for Robot Arm Control with Deep Q-Learning

Authors: Stephen James, Edward Johns

Abstract: Recent trends in robot arm control have seen a shift towards end-to-end solutions, using deep reinforcement learning to learn a controller directly from raw sensor data, rather than relying on a hand-crafted, modular pipeline. However, the high dimensionality of the state space often means that it is impractical to generate sufficient training data with real-world experiments. As an alternative so… ▽ More Recent trends in robot arm control have seen a shift towards end-to-end solutions, using deep reinforcement learning to learn a controller directly from raw sensor data, rather than relying on a hand-crafted, modular pipeline. However, the high dimensionality of the state space often means that it is impractical to generate sufficient training data with real-world experiments. As an alternative solution, we propose to learn a robot controller in simulation, with the potential of then transferring this to a real robot. Building upon the recent success of deep Q-networks, we present an approach which uses 3D simulations to train a 7-DOF robotic arm in a control task without any prior knowledge. The controller accepts images of the environment as its only input, and outputs motor actions for the task of locating and grasping a cube, over a range of initial configurations. To encourage efficient learning, a structured reward function is designed with intermediate rewards. We also present preliminary results in direct transfer of policies over to a real robot, without any further training. △ Less

Submitted 13 December, 2016; v1 submitted 13 September, 2016; originally announced September 2016.

Comments: In NIPS 2016 Workshop: Deep Learning for Action and Interaction (https://sites.google.com/site/nips16interaction/)

arXiv:1608.02239 [pdf, other]

Deep Learning a Grasp Function for Grasping under Gripper Pose Uncertainty

Authors: Edward Johns, Stefan Leutenegger, Andrew J. Davison

Abstract: This paper presents a new method for parallel-jaw grasping of isolated objects from depth images, under large gripper pose uncertainty. Whilst most approaches aim to predict the single best grasp pose from an image, our method first predicts a score for every possible grasp pose, which we denote the grasp function. With this, it is possible to achieve grasping robust to the gripper's pose uncertai… ▽ More This paper presents a new method for parallel-jaw grasping of isolated objects from depth images, under large gripper pose uncertainty. Whilst most approaches aim to predict the single best grasp pose from an image, our method first predicts a score for every possible grasp pose, which we denote the grasp function. With this, it is possible to achieve grasping robust to the gripper's pose uncertainty, by smoothing the grasp function with the pose uncertainty function. Therefore, if the single best pose is adjacent to a region of poor grasp quality, that pose will no longer be chosen, and instead a pose will be chosen which is surrounded by a region of high grasp quality. To learn this function, we train a Convolutional Neural Network which takes as input a single depth image of an object, and outputs a score for each grasp pose across the image. Training data for this is generated by use of physics simulation and depth image simulation with 3D object meshes, to enable acquisition of sufficient data without requiring exhaustive real-world experiments. We evaluate with both synthetic and real experiments, and show that the learned grasp score is more robust to gripper pose uncertainty than when this uncertainty is not accounted for. △ Less

Submitted 7 August, 2016; originally announced August 2016.

Comments: IROS 2016

arXiv:1605.08359 [pdf, other]

Pairwise Decomposition of Image Sequences for Active Multi-View Recognition

Authors: Edward Johns, Stefan Leutenegger, Andrew J. Davison

Abstract: A multi-view image sequence provides a much richer capacity for object recognition than from a single image. However, most existing solutions to multi-view recognition typically adopt hand-crafted, model-based geometric methods, which do not readily embrace recent trends in deep learning. We propose to bring Convolutional Neural Networks to generic multi-view recognition, by decomposing an image s… ▽ More A multi-view image sequence provides a much richer capacity for object recognition than from a single image. However, most existing solutions to multi-view recognition typically adopt hand-crafted, model-based geometric methods, which do not readily embrace recent trends in deep learning. We propose to bring Convolutional Neural Networks to generic multi-view recognition, by decomposing an image sequence into a set of image pairs, classifying each pair independently, and then learning an object classifier by weighting the contribution of each pair. This allows for recognition over arbitrary camera trajectories, without requiring explicit training over the potentially infinite number of camera paths and lengths. Building these pairwise relationships then naturally extends to the next-best-view problem in an active recognition framework. To achieve this, we train a second Convolutional Neural Network to map directly from an observed image to next viewpoint. Finally, we incorporate this into a trajectory optimisation task, whereby the best recognition confidence is sought for a given trajectory length. We present state-of-the-art results in both guided and unguided multi-view recognition on the ModelNet dataset, and show how our method can be used with depth images, greyscale images, or both. △ Less

Submitted 26 May, 2016; originally announced May 2016.

Comments: CVPR 2016 (oral)

Showing 1–50 of 58 results for author: Johns, E