-
Gemini Robotics: Bringing AI into the Physical World
Authors:
Gemini Robotics Team,
Saminda Abeyruwan,
Joshua Ainslie,
Jean-Baptiste Alayrac,
Montserrat Gonzalez Arenas,
Travis Armstrong,
Ashwin Balakrishna,
Robert Baruch,
Maria Bauza,
Michiel Blokzijl,
Steven Bohez,
Konstantinos Bousmalis,
Anthony Brohan,
Thomas Buschmann,
Arunkumar Byravan,
Serkan Cabi,
Ken Caluwaerts,
Federico Casarini,
Oscar Chang,
Jose Enrique Chen,
Xi Chen,
Hao-Tien Lewis Chiang,
Krzysztof Choromanski,
David D'Ambrosio,
Sudeep Dasari
, et al. (93 additional authors not shown)
Abstract:
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Lang…
▽ More
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini's multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications. We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realizes AI's potential in the physical world.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Learning from negative feedback, or positive feedback or both
Authors:
Abbas Abdolmaleki,
Bilal Piot,
Bobak Shahriari,
Jost Tobias Springenberg,
Tim Hertweck,
Rishabh Joshi,
Junhyuk Oh,
Michael Bloesch,
Thomas Lampe,
Nicolas Heess,
Jonas Buchli,
Martin Riedmiller
Abstract:
Existing preference optimization methods often assume scenarios where paired preference feedback (preferred/positive vs. dis-preferred/negative examples) is available. This requirement limits their applicability in scenarios where only unpaired feedback--for example, either positive or negative--is available. To address this, we introduce a novel approach that decouples learning from positive and…
▽ More
Existing preference optimization methods often assume scenarios where paired preference feedback (preferred/positive vs. dis-preferred/negative examples) is available. This requirement limits their applicability in scenarios where only unpaired feedback--for example, either positive or negative--is available. To address this, we introduce a novel approach that decouples learning from positive and negative feedback. This decoupling enables control over the influence of each feedback type and, importantly, allows learning even when only one feedback type is present. A key contribution is demonstrating stable learning from negative feedback alone, a capability not well-addressed by current methods. Our approach builds upon the probabilistic framework introduced in (Dayan and Hinton, 1997), which uses expectation-maximization (EM) to directly optimize the probability of positive outcomes (as opposed to classic expected reward maximization). We address a key limitation in current EM-based methods: they solely maximize the likelihood of positive examples, while neglecting negative ones. We show how to extend EM algorithms to explicitly incorporate negative examples, leading to a theoretically grounded algorithm that offers an intuitive and versatile way to learn from both positive and negative feedback. We evaluate our approach for training language models based on human feedback as well as training policies for sequential decision-making problems, where learned value functions are available.
△ Less
Submitted 7 March, 2025; v1 submitted 5 October, 2024;
originally announced October 2024.
-
Game On: Towards Language Models as RL Experimenters
Authors:
Jingwei Zhang,
Thomas Lampe,
Abbas Abdolmaleki,
Jost Tobias Springenberg,
Martin Riedmiller
Abstract:
We propose an agent architecture that automates parts of the common reinforcement learning experiment workflow, to enable automated mastery of control domains for embodied agents. To do so, it leverages a VLM to perform some of the capabilities normally required of a human experimenter, including the monitoring and analysis of experiment progress, the proposition of new tasks based on past success…
▽ More
We propose an agent architecture that automates parts of the common reinforcement learning experiment workflow, to enable automated mastery of control domains for embodied agents. To do so, it leverages a VLM to perform some of the capabilities normally required of a human experimenter, including the monitoring and analysis of experiment progress, the proposition of new tasks based on past successes and failures of the agent, decomposing tasks into a sequence of subtasks (skills), and retrieval of the skill to execute - enabling our system to build automated curricula for learning. We believe this is one of the first proposals for a system that leverages a VLM throughout the full experiment cycle of reinforcement learning. We provide a first prototype of this system, and examine the feasibility of current models and techniques for the desired level of automation. For this, we use a standard Gemini model, without additional fine-tuning, to provide a curriculum of skills to a language-conditioned Actor-Critic algorithm, in order to steer data collection so as to aid learning new skills. Data collected in this way is shown to be useful for learning and iteratively improving control policies in a robotics domain. Additional examination of the ability of the system to build a growing library of skills, and to judge the progress of the training of those skills, also shows promising results, suggesting that the proposed architecture provides a potential recipe for fully automated mastery of tasks and domains for embodied agents.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
Real-World Fluid Directed Rigid Body Control via Deep Reinforcement Learning
Authors:
Mohak Bhardwaj,
Thomas Lampe,
Michael Neunert,
Francesco Romano,
Abbas Abdolmaleki,
Arunkumar Byravan,
Markus Wulfmeier,
Martin Riedmiller,
Jonas Buchli
Abstract:
Recent advances in real-world applications of reinforcement learning (RL) have relied on the ability to accurately simulate systems at scale. However, domains such as fluid dynamical systems exhibit complex dynamic phenomena that are hard to simulate at high integration rates, limiting the direct application of modern deep RL algorithms to often expensive or safety critical hardware. In this work,…
▽ More
Recent advances in real-world applications of reinforcement learning (RL) have relied on the ability to accurately simulate systems at scale. However, domains such as fluid dynamical systems exhibit complex dynamic phenomena that are hard to simulate at high integration rates, limiting the direct application of modern deep RL algorithms to often expensive or safety critical hardware. In this work, we introduce "Box o Flows", a novel benchtop experimental control system for systematically evaluating RL algorithms in dynamic real-world scenarios. We describe the key components of the Box o Flows, and through a series of experiments demonstrate how state-of-the-art model-free RL algorithms can synthesize a variety of complex behaviors via simple reward specifications. Furthermore, we explore the role of offline RL in data-efficient hypothesis testing by reusing past experiences. We believe that the insights gained from this preliminary study and the availability of systems like the Box o Flows support the way forward for developing systematic RL algorithms that can be generally applied to complex, dynamical systems. Supplementary material and videos of experiments are available at https://sites.google.com/view/box-o-flows/home.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Offline Actor-Critic Reinforcement Learning Scales to Large Models
Authors:
Jost Tobias Springenberg,
Abbas Abdolmaleki,
Jingwei Zhang,
Oliver Groth,
Michael Bloesch,
Thomas Lampe,
Philemon Brakel,
Sarah Bechtle,
Steven Kapturowski,
Roland Hafner,
Nicolas Heess,
Martin Riedmiller
Abstract:
We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We…
▽ More
We show that offline actor-critic reinforcement learning can scale to large models - such as transformers - and follows similar scaling laws as supervised learning. We find that offline actor-critic algorithms can outperform strong, supervised, behavioral cloning baselines for multi-task training on a large dataset containing both sub-optimal and expert behavior on 132 continuous control tasks. We introduce a Perceiver-based actor-critic model and elucidate the key model features needed to make offline RL work with self- and cross-attention modules. Overall, we find that: i) simple offline actor critic algorithms are a natural choice for gradually moving away from the currently predominant paradigm of behavioral cloning, and ii) via offline RL it is possible to learn multi-task policies that master many domains simultaneously, including real robotics tasks, from sub-optimal demonstrations or self-generated data.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
Mastering Stacking of Diverse Shapes with Large-Scale Iterative Reinforcement Learning on Real Robots
Authors:
Thomas Lampe,
Abbas Abdolmaleki,
Sarah Bechtle,
Sandy H. Huang,
Jost Tobias Springenberg,
Michael Bloesch,
Oliver Groth,
Roland Hafner,
Tim Hertweck,
Michael Neunert,
Markus Wulfmeier,
Jingwei Zhang,
Francesco Nori,
Nicolas Heess,
Martin Riedmiller
Abstract:
Reinforcement learning solely from an agent's self-generated data is often believed to be infeasible for learning on real robots, due to the amount of data needed. However, if done right, agents learning from real data can be surprisingly efficient through re-using previously collected sub-optimal data. In this paper we demonstrate how the increased understanding of off-policy learning methods and…
▽ More
Reinforcement learning solely from an agent's self-generated data is often believed to be infeasible for learning on real robots, due to the amount of data needed. However, if done right, agents learning from real data can be surprisingly efficient through re-using previously collected sub-optimal data. In this paper we demonstrate how the increased understanding of off-policy learning methods and their embedding in an iterative online/offline scheme (``collect and infer'') can drastically improve data-efficiency by using all the collected experience, which empowers learning from real robot experience only. Moreover, the resulting policy improves significantly over the state of the art on a recently proposed real robot manipulation benchmark. Our approach learns end-to-end, directly from pixels, and does not rely on additional human domain knowledge such as a simulator or demonstrations.
△ Less
Submitted 18 December, 2023;
originally announced December 2023.
-
Replay across Experiments: A Natural Extension of Off-Policy RL
Authors:
Dhruva Tirumala,
Thomas Lampe,
Jose Enrique Chen,
Tuomas Haarnoja,
Sandy Huang,
Guy Lever,
Ben Moran,
Tim Hertweck,
Leonard Hasenclever,
Martin Riedmiller,
Nicolas Heess,
Markus Wulfmeier
Abstract:
Replaying data is a principal mechanism underlying the stability and data efficiency of off-policy reinforcement learning (RL). We present an effective yet simple framework to extend the use of replays across multiple experiments, minimally adapting the RL workflow for sizeable improvements in controller performance and research iteration times. At its core, Replay Across Experiments (RaE) involve…
▽ More
Replaying data is a principal mechanism underlying the stability and data efficiency of off-policy reinforcement learning (RL). We present an effective yet simple framework to extend the use of replays across multiple experiments, minimally adapting the RL workflow for sizeable improvements in controller performance and research iteration times. At its core, Replay Across Experiments (RaE) involves reusing experience from previous experiments to improve exploration and bootstrap learning while reducing required changes to a minimum in comparison to prior work. We empirically show benefits across a number of RL algorithms and challenging control domains spanning both locomotion and manipulation, including hard exploration tasks from egocentric vision. Through comprehensive ablations, we demonstrate robustness to the quality and amount of data available and various hyperparameter choices. Finally, we discuss how our approach can be applied more broadly across research life cycles and can increase resilience by reloading data across random seeds or hyperparameter variations.
△ Less
Submitted 28 November, 2023; v1 submitted 27 November, 2023;
originally announced November 2023.
-
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Authors:
Konstantinos Bousmalis,
Giulia Vezzani,
Dushyant Rao,
Coline Devin,
Alex X. Lee,
Maria Bauza,
Todor Davchev,
Yuxiang Zhou,
Agrim Gupta,
Akhil Raju,
Antoine Laurens,
Claudio Fantacci,
Valentin Dalibard,
Martina Zambelli,
Murilo Martins,
Rugile Pevceviciute,
Michiel Blokzijl,
Misha Denil,
Nathan Batchelor,
Thomas Lampe,
Emilio Parisotto,
Konrad Żołna,
Scott Reed,
Sergio Gómez Colmenarejo,
Jon Scholz
, et al. (14 additional authors not shown)
Abstract:
The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task generalist agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned de…
▽ More
The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task generalist agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.
△ Less
Submitted 22 December, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration
Authors:
Giulia Vezzani,
Dhruva Tirumala,
Markus Wulfmeier,
Dushyant Rao,
Abbas Abdolmaleki,
Ben Moran,
Tuomas Haarnoja,
Jan Humplik,
Roland Hafner,
Michael Neunert,
Claudio Fantacci,
Tim Hertweck,
Thomas Lampe,
Fereshteh Sadeghi,
Nicolas Heess,
Martin Riedmiller
Abstract:
The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert be…
▽ More
The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.
△ Less
Submitted 11 January, 2023; v1 submitted 24 November, 2022;
originally announced November 2022.
-
How to Spend Your Robot Time: Bridging Kickstarting and Offline Reinforcement Learning for Vision-based Robotic Manipulation
Authors:
Alex X. Lee,
Coline Devin,
Jost Tobias Springenberg,
Yuxiang Zhou,
Thomas Lampe,
Abbas Abdolmaleki,
Konstantinos Bousmalis
Abstract:
Reinforcement learning (RL) has been shown to be effective at learning control from experience. However, RL typically requires a large amount of online interaction with the environment. This limits its applicability to real-world settings, such as in robotics, where such interaction is expensive. In this work we investigate ways to minimize online interactions in a target task, by reusing a subopt…
▽ More
Reinforcement learning (RL) has been shown to be effective at learning control from experience. However, RL typically requires a large amount of online interaction with the environment. This limits its applicability to real-world settings, such as in robotics, where such interaction is expensive. In this work we investigate ways to minimize online interactions in a target task, by reusing a suboptimal policy we might have access to, for example from training on related prior tasks, or in simulation. To this end, we develop two RL algorithms that can speed up training by using not only the action distributions of teacher policies, but also data collected by such policies on the task at hand. We conduct a thorough experimental study of how to use suboptimal teachers on a challenging robotic manipulation benchmark on vision-based stacking with diverse objects. We compare our methods to offline, online, offline-to-online, and kickstarting RL algorithms. By doing so, we find that training on data from both the teacher and student, enables the best performance for limited data budgets. We examine how to best allocate a limited data budget -- on the target task -- between the teacher and the student policy, and report experiments using varying budgets, two teachers with different degrees of suboptimality, and five stacking tasks that require a diverse set of behaviors. Our analysis, both in simulation and in the real world, shows that our approach is the best across data budgets, while standard offline RL from teacher rollouts is surprisingly effective when enough data is given.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes
Authors:
Alex X. Lee,
Coline Devin,
Yuxiang Zhou,
Thomas Lampe,
Konstantinos Bousmalis,
Jost Tobias Springenberg,
Arunkumar Byravan,
Abbas Abdolmaleki,
Nimrod Gileadi,
David Khosid,
Claudio Fantacci,
Jose Enrique Chen,
Akhil Raju,
Rae Jeong,
Michael Neunert,
Antoine Laurens,
Stefano Saliceti,
Federico Casarini,
Martin Riedmiller,
Raia Hadsell,
Francesco Nori
Abstract:
We study the problem of robotic stacking with objects of complex geometry. We propose a challenging and diverse set of such objects that was carefully designed to require strategies beyond a simple "pick-and-place" solution. Our method is a reinforcement learning (RL) approach combined with vision-based interactive policy distillation and simulation-to-reality transfer. Our learned policies can ef…
▽ More
We study the problem of robotic stacking with objects of complex geometry. We propose a challenging and diverse set of such objects that was carefully designed to require strategies beyond a simple "pick-and-place" solution. Our method is a reinforcement learning (RL) approach combined with vision-based interactive policy distillation and simulation-to-reality transfer. Our learned policies can efficiently handle multiple object combinations in the real world and exhibit a large variety of stacking skills. In a large experimental study, we investigate what choices matter for learning such general vision-based agents in simulation, and what affects optimal transfer to the real robot. We then leverage data collected by such policies and improve upon them with offline RL. A video and a blog post of our work are provided as supplementary material.
△ Less
Submitted 3 November, 2021; v1 submitted 12 October, 2021;
originally announced October 2021.
-
Representation Matters: Improving Perception and Exploration for Robotics
Authors:
Markus Wulfmeier,
Arunkumar Byravan,
Tim Hertweck,
Irina Higgins,
Ankush Gupta,
Tejas Kulkarni,
Malcolm Reynolds,
Denis Teplyashin,
Roland Hafner,
Thomas Lampe,
Martin Riedmiller
Abstract:
Projecting high-dimensional environment observations into lower-dimensional structured representations can considerably improve data-efficiency for reinforcement learning in domains with limited data such as robotics. Can a single generally useful representation be found? In order to answer this question, it is important to understand how the representation will be used by the agent and what prope…
▽ More
Projecting high-dimensional environment observations into lower-dimensional structured representations can considerably improve data-efficiency for reinforcement learning in domains with limited data such as robotics. Can a single generally useful representation be found? In order to answer this question, it is important to understand how the representation will be used by the agent and what properties such a 'good' representation should have. In this paper we systematically evaluate a number of common learnt and hand-engineered representations in the context of three robotics tasks: lifting, stacking and pushing of 3D blocks. The representations are evaluated in two use-cases: as input to the agent, or as a source of auxiliary tasks. Furthermore, the value of each representation is evaluated in terms of three properties: dimensionality, observability and disentanglement. We can significantly improve performance in both use-cases and demonstrate that some representations can perform commensurate to simulator states as agent inputs. Finally, our results challenge common intuitions by demonstrating that: 1) dimensionality strongly matters for task generation, but is negligible for inputs, 2) observability of task-relevant aspects mostly affects the input representation use-case, and 3) disentanglement leads to better auxiliary tasks, but has only limited benefits for input representations. This work serves as a step towards a more systematic understanding of what makes a 'good' representation for control in robotics, enabling practitioners to make more informed choices for developing new learned or hand-engineered representations.
△ Less
Submitted 21 March, 2021; v1 submitted 3 November, 2020;
originally announced November 2020.
-
"What, not how": Solving an under-actuated insertion task from scratch
Authors:
Giulia Vezzani,
Michael Neunert,
Markus Wulfmeier,
Rae Jeong,
Thomas Lampe,
Noah Siegel,
Roland Hafner,
Abbas Abdolmaleki,
Martin Riedmiller,
Francesco Nori
Abstract:
Robot manipulation requires a complex set of skills that need to be carefully combined and coordinated to solve a task. Yet, most ReinforcementLearning (RL) approaches in robotics study tasks which actually consist only of a single manipulation skill, such as grasping an object or inserting a pre-grasped object. As a result the skill ('how' to solve the task) but not the actual goal of a complete…
▽ More
Robot manipulation requires a complex set of skills that need to be carefully combined and coordinated to solve a task. Yet, most ReinforcementLearning (RL) approaches in robotics study tasks which actually consist only of a single manipulation skill, such as grasping an object or inserting a pre-grasped object. As a result the skill ('how' to solve the task) but not the actual goal of a complete manipulation ('what' to solve) is specified. In contrast, we study a complex manipulation goal that requires an agent to learn and combine diverse manipulation skills. We propose a challenging, highly under-actuated peg-in-hole task with a free, rotational asymmetrical peg, requiring a broad range of manipulation skills. While correct peg (re-)orientation is a requirement for successful insertion, there is no reward associated with it. Hence an agent needs to understand this pre-condition and learn the skill to fulfil it. The final insertion reward is sparse, allowing freedom in the solution and leading to complex emerging behaviour not envisioned during the task design. We tackle the problem in a multi-task RL framework using Scheduled Auxiliary Control (SAC-X) combined with Regularized Hierarchical Policy Optimization (RHPO) which successfully solves the task in simulation and from scratch on a single robot where data is severely limited.
△ Less
Submitted 30 October, 2020; v1 submitted 29 October, 2020;
originally announced October 2020.
-
Data-efficient Hindsight Off-policy Option Learning
Authors:
Markus Wulfmeier,
Dushyant Rao,
Roland Hafner,
Thomas Lampe,
Abbas Abdolmaleki,
Tim Hertweck,
Michael Neunert,
Dhruva Tirumala,
Noah Siegel,
Nicolas Heess,
Martin Riedmiller
Abstract:
We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm. Given any trajectory, HO2 infers likely option choices and backpropagates through the dynamic programming inference procedure to robustly train all policy components off-policy and end-to-end. The approach outperforms existing option learning methods on common benchmarks. To better understand the option fr…
▽ More
We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm. Given any trajectory, HO2 infers likely option choices and backpropagates through the dynamic programming inference procedure to robustly train all policy components off-policy and end-to-end. The approach outperforms existing option learning methods on common benchmarks. To better understand the option framework and disentangle benefits from both temporal and action abstraction, we evaluate ablations with flat policies and mixture policies with comparable optimization. The results highlight the importance of both types of abstraction as well as off-policy training and trust-region constraints, particularly in challenging, simulated 3D robot manipulation tasks from raw pixel inputs. Finally, we intuitively adapt the inference step to investigate the effect of increased temporal abstraction on training with pre-trained options and from scratch.
△ Less
Submitted 15 June, 2021; v1 submitted 30 July, 2020;
originally announced July 2020.
-
Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning
Authors:
Noah Y. Siegel,
Jost Tobias Springenberg,
Felix Berkenkamp,
Abbas Abdolmaleki,
Michael Neunert,
Thomas Lampe,
Roland Hafner,
Nicolas Heess,
Martin Riedmiller
Abstract:
Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set (batch) of environment interactions is available and no new experience can be acquired. This property makes these algorithms appealing for real world problems such as robot control. In practice, however, standard off-policy algorithms fail in the batch setting for continuous control. In th…
▽ More
Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set (batch) of environment interactions is available and no new experience can be acquired. This property makes these algorithms appealing for real world problems such as robot control. In practice, however, standard off-policy algorithms fail in the batch setting for continuous control. In this paper, we propose a simple solution to this problem. It admits the use of data generated by arbitrary behavior policies and uses a learned prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Our method can be seen as an extension of recent work on batch-RL that enables stable learning from conflicting data-sources. We find improvements on competitive baselines in a variety of RL tasks -- including standard continuous control benchmarks and multi-task learning for simulated and real-world robots.
△ Less
Submitted 17 June, 2020; v1 submitted 19 February, 2020;
originally announced February 2020.
-
Continuous-Discrete Reinforcement Learning for Hybrid Control in Robotics
Authors:
Michael Neunert,
Abbas Abdolmaleki,
Markus Wulfmeier,
Thomas Lampe,
Jost Tobias Springenberg,
Roland Hafner,
Francesco Romano,
Jonas Buchli,
Nicolas Heess,
Martin Riedmiller
Abstract:
Many real-world control problems involve both discrete decision variables - such as the choice of control modes, gear switching or digital outputs - as well as continuous decision variables - such as velocity setpoints, control gains or analogue outputs. However, when defining the corresponding optimal control or reinforcement learning problem, it is commonly approximated with fully continuous or…
▽ More
Many real-world control problems involve both discrete decision variables - such as the choice of control modes, gear switching or digital outputs - as well as continuous decision variables - such as velocity setpoints, control gains or analogue outputs. However, when defining the corresponding optimal control or reinforcement learning problem, it is commonly approximated with fully continuous or fully discrete action spaces. These simplifications aim at tailoring the problem to a particular algorithm or solver which may only support one type of action space. Alternatively, expert heuristics are used to remove discrete actions from an otherwise continuous space. In contrast, we propose to treat hybrid problems in their 'native' form by solving them with hybrid reinforcement learning, which optimizes for discrete and continuous actions simultaneously. In our experiments, we first demonstrate that the proposed approach efficiently solves such natively hybrid reinforcement learning problems. We then show, both in simulation and on robotic hardware, the benefits of removing possibly imperfect expert-designed heuristics. Lastly, hybrid reinforcement learning encourages us to rethink problem definitions. We propose reformulating control problems, e.g. by adding meta actions, to improve exploration or reduce mechanical wear and tear.
△ Less
Submitted 2 January, 2020;
originally announced January 2020.
-
Modelling Generalized Forces with Reinforcement Learning for Sim-to-Real Transfer
Authors:
Rae Jeong,
Jackie Kay,
Francesco Romano,
Thomas Lampe,
Tom Rothorl,
Abbas Abdolmaleki,
Tom Erez,
Yuval Tassa,
Francesco Nori
Abstract:
Learning robotic control policies in the real world gives rise to challenges in data efficiency, safety, and controlling the initial condition of the system. On the other hand, simulations are a useful alternative as they provide an abundant source of data without the restrictions of the real world. Unfortunately, simulations often fail to accurately model complex real-world phenomena. Traditional…
▽ More
Learning robotic control policies in the real world gives rise to challenges in data efficiency, safety, and controlling the initial condition of the system. On the other hand, simulations are a useful alternative as they provide an abundant source of data without the restrictions of the real world. Unfortunately, simulations often fail to accurately model complex real-world phenomena. Traditional system identification techniques are limited in expressiveness by the analytical model parameters, and usually are not sufficient to capture such phenomena. In this paper we propose a general framework for improving the analytical model by optimizing state dependent generalized forces. State dependent generalized forces are expressive enough to model constraints in the equations of motion, while maintaining a clear physical meaning and intuition. We use reinforcement learning to efficiently optimize the mapping from states to generalized forces over a discounted infinite horizon. We show that using only minutes of real world data improves the sim-to-real control policy transfer. We demonstrate the feasibility of our approach by validating it on a nonprehensile manipulation task on the Sawyer robot.
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Self-Supervised Sim-to-Real Adaptation for Visual Robotic Manipulation
Authors:
Rae Jeong,
Yusuf Aytar,
David Khosid,
Yuxiang Zhou,
Jackie Kay,
Thomas Lampe,
Konstantinos Bousmalis,
Francesco Nori
Abstract:
Collecting and automatically obtaining reward signals from real robotic visual data for the purposes of training reinforcement learning algorithms can be quite challenging and time-consuming. Methods for utilizing unlabeled data can have a huge potential to further accelerate robotic learning. We consider here the problem of performing manipulation tasks from pixels. In such tasks, choosing an app…
▽ More
Collecting and automatically obtaining reward signals from real robotic visual data for the purposes of training reinforcement learning algorithms can be quite challenging and time-consuming. Methods for utilizing unlabeled data can have a huge potential to further accelerate robotic learning. We consider here the problem of performing manipulation tasks from pixels. In such tasks, choosing an appropriate state representation is crucial for planning and control. This is even more relevant with real images where noise, occlusions and resolution affect the accuracy and reliability of state estimation. In this work, we learn a latent state representation implicitly with deep reinforcement learning in simulation, and then adapt it to the real domain using unlabeled real robot data. We propose to do so by optimizing sequence-based self supervised objectives. These exploit the temporal nature of robot experience, and can be common in both the simulated and real domains, without assuming any alignment of underlying states in simulated and unlabeled real images. We propose Contrastive Forward Dynamics loss, which combines dynamics model learning with time-contrastive techniques. The learned state representation that results from our methods can be used to robustly solve a manipulation task in simulation and to successfully transfer the learned skill on a real system. We demonstrate the effectiveness of our approaches by training a vision-based reinforcement learning agent for cube stacking. Agents trained with our method, using only 5 hours of unlabeled real robot data for adaptation, shows a clear improvement over domain randomization, and standard visual domain adaptation techniques for sim-to-real transfer.
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models
Authors:
Arunkumar Byravan,
Jost Tobias Springenberg,
Abbas Abdolmaleki,
Roland Hafner,
Michael Neunert,
Thomas Lampe,
Noah Siegel,
Nicolas Heess,
Martin Riedmiller
Abstract:
Humans are masters at quickly learning many complex tasks, relying on an approximate understanding of the dynamics of their environments. In much the same way, we would like our learning agents to quickly adapt to new tasks. In this paper, we explore how model-based Reinforcement Learning (RL) can facilitate transfer to new tasks. We develop an algorithm that learns an action-conditional, predicti…
▽ More
Humans are masters at quickly learning many complex tasks, relying on an approximate understanding of the dynamics of their environments. In much the same way, we would like our learning agents to quickly adapt to new tasks. In this paper, we explore how model-based Reinforcement Learning (RL) can facilitate transfer to new tasks. We develop an algorithm that learns an action-conditional, predictive model of expected future observations, rewards and values from which a policy can be derived by following the gradient of the estimated value along imagined trajectories. We show how robust policy optimization can be achieved in robot manipulation tasks even with approximate models that are learned directly from vision and proprioception. We evaluate the efficacy of our approach in a transfer learning scenario, re-using previously learned models on tasks with different reward structures and visual distractors, and show a significant improvement in learning speed compared to strong off-policy baselines. Videos with results can be found at https://sites.google.com/view/ivg-corl19
△ Less
Submitted 9 October, 2019;
originally announced October 2019.
-
Compositional Transfer in Hierarchical Reinforcement Learning
Authors:
Markus Wulfmeier,
Abbas Abdolmaleki,
Roland Hafner,
Jost Tobias Springenberg,
Michael Neunert,
Tim Hertweck,
Thomas Lampe,
Noah Siegel,
Nicolas Heess,
Martin Riedmiller
Abstract:
The successful application of general reinforcement learning algorithms to real-world robotics applications is often limited by their high data requirements. We introduce Regularized Hierarchical Policy Optimization (RHPO) to improve data-efficiency for domains with multiple dominant tasks and ultimately reduce required platform time. To this end, we employ compositional inductive biases on multip…
▽ More
The successful application of general reinforcement learning algorithms to real-world robotics applications is often limited by their high data requirements. We introduce Regularized Hierarchical Policy Optimization (RHPO) to improve data-efficiency for domains with multiple dominant tasks and ultimately reduce required platform time. To this end, we employ compositional inductive biases on multiple levels and corresponding mechanisms for sharing off-policy transition data across low-level controllers and tasks as well as scheduling of tasks. The presented algorithm enables stable and fast learning for complex, real-world domains in the parallel multitask and sequential transfer case. We show that the investigated types of hierarchy enable positive transfer while partially mitigating negative interference and evaluate the benefits of additional incentives for efficient, compositional task solutions in single task domains. Finally, we demonstrate substantial data-efficiency and final performance gains over competitive baselines in a week-long, physical robot stacking experiment.
△ Less
Submitted 19 May, 2020; v1 submitted 26 June, 2019;
originally announced June 2019.
-
Simultaneously Learning Vision and Feature-based Control Policies for Real-world Ball-in-a-Cup
Authors:
Devin Schwab,
Tobias Springenberg,
Murilo F. Martins,
Thomas Lampe,
Michael Neunert,
Abbas Abdolmaleki,
Tim Hertweck,
Roland Hafner,
Francesco Nori,
Martin Riedmiller
Abstract:
We present a method for fast training of vision based control policies on real robots. The key idea behind our method is to perform multi-task Reinforcement Learning with auxiliary tasks that differ not only in the reward to be optimized but also in the state-space in which they operate. In particular, we allow auxiliary task policies to utilize task features that are available only at training-ti…
▽ More
We present a method for fast training of vision based control policies on real robots. The key idea behind our method is to perform multi-task Reinforcement Learning with auxiliary tasks that differ not only in the reward to be optimized but also in the state-space in which they operate. In particular, we allow auxiliary task policies to utilize task features that are available only at training-time. This allows for fast learning of auxiliary policies, which subsequently generate good data for training the main, vision-based control policies. This method can be seen as an extension of the Scheduled Auxiliary Control (SAC-X) framework. We demonstrate the efficacy of our method by using both a simulated and real-world Ball-in-a-Cup game controlled by a robot arm. In simulation, our approach leads to significant learning speed-ups when compared to standard SAC-X. On the real robot we show that the task can be learned from-scratch, i.e., with no transfer from simulation and no imitation learning. Videos of our learned policies running on the real robot can be found at https://sites.google.com/view/rss-2019-sawyer-bic/.
△ Less
Submitted 18 February, 2019; v1 submitted 12 February, 2019;
originally announced February 2019.
-
Learning by Playing - Solving Sparse Reward Tasks from Scratch
Authors:
Martin Riedmiller,
Roland Hafner,
Thomas Lampe,
Michael Neunert,
Jonas Degrave,
Tom Van de Wiele,
Volodymyr Mnih,
Nicolas Heess,
Jost Tobias Springenberg
Abstract:
We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors - from scratch - in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is t…
▽ More
We propose Scheduled Auxiliary Control (SAC-X), a new learning paradigm in the context of Reinforcement Learning (RL). SAC-X enables learning of complex behaviors - from scratch - in the presence of multiple sparse reward signals. To this end, the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment - enabling it to excel at sparse reward RL. Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach.
△ Less
Submitted 28 February, 2018;
originally announced February 2018.
-
Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards
Authors:
Mel Vecerik,
Todd Hester,
Jonathan Scholz,
Fumin Wang,
Olivier Pietquin,
Bilal Piot,
Nicolas Heess,
Thomas Rothörl,
Thomas Lampe,
Martin Riedmiller
Abstract:
We propose a general and model-free approach for Reinforcement Learning (RL) on real robotics with sparse rewards. We build upon the Deep Deterministic Policy Gradient (DDPG) algorithm to use demonstrations. Both demonstrations and actual interactions are used to fill a replay buffer and the sampling ratio between demonstrations and transitions is automatically tuned via a prioritized replay mecha…
▽ More
We propose a general and model-free approach for Reinforcement Learning (RL) on real robotics with sparse rewards. We build upon the Deep Deterministic Policy Gradient (DDPG) algorithm to use demonstrations. Both demonstrations and actual interactions are used to fill a replay buffer and the sampling ratio between demonstrations and transitions is automatically tuned via a prioritized replay mechanism. Typically, carefully engineered shaping rewards are required to enable the agents to efficiently explore on high dimensional control problems such as robotics. They are also required for model-based acceleration methods relying on local solvers such as iLQG (e.g. Guided Policy Search and Normalized Advantage Function). The demonstrations replace the need for carefully engineered rewards, and reduce the exploration problem encountered by classical RL approaches in these domains. Demonstrations are collected by a robot kinesthetically force-controlled by a human demonstrator. Results on four simulated insertion tasks show that DDPG from demonstrations out-performs DDPG, and does not require engineered rewards. Finally, we demonstrate the method on a real robotics task consisting of inserting a clip (flexible object) into a rigid object.
△ Less
Submitted 8 October, 2018; v1 submitted 27 July, 2017;
originally announced July 2017.
-
Data-efficient Deep Reinforcement Learning for Dexterous Manipulation
Authors:
Ivaylo Popov,
Nicolas Heess,
Timothy Lillicrap,
Roland Hafner,
Gabriel Barth-Maron,
Matej Vecerik,
Thomas Lampe,
Yuval Tassa,
Tom Erez,
Martin Riedmiller
Abstract:
Deep learning and reinforcement learning methods have recently been used to solve a variety of problems in continuous control domains. An obvious application of these techniques is dexterous manipulation tasks in robotics which are difficult to solve using traditional control theory or hand-engineered approaches. One example of such a task is to grasp an object and precisely stack it on another. S…
▽ More
Deep learning and reinforcement learning methods have recently been used to solve a variety of problems in continuous control domains. An obvious application of these techniques is dexterous manipulation tasks in robotics which are difficult to solve using traditional control theory or hand-engineered approaches. One example of such a task is to grasp an object and precisely stack it on another. Solving this difficult and practically relevant problem in the real world is an important long-term goal for the field of robotics. Here we take a step towards this goal by examining the problem in simulation and providing models and techniques aimed at solving it. We introduce two extensions to the Deep Deterministic Policy Gradient algorithm (DDPG), a model-free Q-learning based method, which make it significantly more data-efficient and scalable. Our results show that by making extensive use of off-policy data and replay, it is possible to find control policies that robustly grasp objects and stack them. Further, our results hint that it may soon be feasible to train successful stacking policies by collecting interactions on real robots.
△ Less
Submitted 10 April, 2017;
originally announced April 2017.
-
TACTICS: TACTICal Service Oriented Architecture
Authors:
Alessandro Aloisio,
Marco Autili,
Alfredo D'Angelo,
Antti Viidanoja,
Jérémie Leguay,
Tobias Ginzler,
Thorsten Lampe,
Luca Spagnolo,
Stephen Wolthusen,
Adam Flizikowski,
Joanna Sliwa
Abstract:
Due to the increasing complexity and heterogeneity of contemporary Command, Control, Communications, Computers, & Intelligence systems at all levels within military organizations, the adoption of the Service Oriented Architectures (SOA) principles and concepts is becoming essential. SOA provides flexibility and interoperability of services enabling the realization of efficient and modular informat…
▽ More
Due to the increasing complexity and heterogeneity of contemporary Command, Control, Communications, Computers, & Intelligence systems at all levels within military organizations, the adoption of the Service Oriented Architectures (SOA) principles and concepts is becoming essential. SOA provides flexibility and interoperability of services enabling the realization of efficient and modular information infrastructure for command and control systems. However, within a tactical domain, the presence of potentially highly mobile actors equipped with constrained communications media (i.e., unreliable radio networks with limited bandwidth) limits the applicability of traditional SOA technologies. The TACTICS project aims at the definition and experimental demonstration of a Tactical Services Infrastructure enabling tactical radio networks (without any modifications of the radio part of those networks) to participate in SOA infrastructures and provide, as well as consume, services to and from the strategic domain independently of the user's location.
△ Less
Submitted 28 April, 2015;
originally announced April 2015.