-
Gemini Robotics: Bringing AI into the Physical World
Authors:
Gemini Robotics Team,
Saminda Abeyruwan,
Joshua Ainslie,
Jean-Baptiste Alayrac,
Montserrat Gonzalez Arenas,
Travis Armstrong,
Ashwin Balakrishna,
Robert Baruch,
Maria Bauza,
Michiel Blokzijl,
Steven Bohez,
Konstantinos Bousmalis,
Anthony Brohan,
Thomas Buschmann,
Arunkumar Byravan,
Serkan Cabi,
Ken Caluwaerts,
Federico Casarini,
Oscar Chang,
Jose Enrique Chen,
Xi Chen,
Hao-Tien Lewis Chiang,
Krzysztof Choromanski,
David D'Ambrosio,
Sudeep Dasari
, et al. (93 additional authors not shown)
Abstract:
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Lang…
▽ More
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini's multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications. We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realizes AI's potential in the physical world.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
GATS: Gather-Attend-Scatter
Authors:
Konrad Zolna,
Serkan Cabi,
Yutian Chen,
Eric Lau,
Claudio Fantacci,
Jurgis Pasukonis,
Jost Tobias Springenberg,
Sergio Gomez Colmenarejo
Abstract:
As the AI community increasingly adopts large-scale models, it is crucial to develop general and flexible tools to integrate them. We introduce Gather-Attend-Scatter (GATS), a novel module that enables seamless combination of pretrained foundation models, both trainable and frozen, into larger multimodal networks. GATS empowers AI systems to process and generate information across multiple modalit…
▽ More
As the AI community increasingly adopts large-scale models, it is crucial to develop general and flexible tools to integrate them. We introduce Gather-Attend-Scatter (GATS), a novel module that enables seamless combination of pretrained foundation models, both trainable and frozen, into larger multimodal networks. GATS empowers AI systems to process and generate information across multiple modalities at different rates. In contrast to traditional fine-tuning, GATS allows for the original component models to remain frozen, avoiding the risk of them losing important knowledge acquired during the pretraining phase. We demonstrate the utility and versatility of GATS with a few experiments across games, robotics, and multimodal input-output systems.
△ Less
Submitted 16 January, 2024;
originally announced January 2024.
-
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Authors:
Konstantinos Bousmalis,
Giulia Vezzani,
Dushyant Rao,
Coline Devin,
Alex X. Lee,
Maria Bauza,
Todor Davchev,
Yuxiang Zhou,
Agrim Gupta,
Akhil Raju,
Antoine Laurens,
Claudio Fantacci,
Valentin Dalibard,
Martina Zambelli,
Murilo Martins,
Rugile Pevceviciute,
Michiel Blokzijl,
Misha Denil,
Nathan Batchelor,
Thomas Lampe,
Emilio Parisotto,
Konrad Żołna,
Scott Reed,
Sergio Gómez Colmenarejo,
Jon Scholz
, et al. (14 additional authors not shown)
Abstract:
The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task generalist agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned de…
▽ More
The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task generalist agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.
△ Less
Submitted 22 December, 2023; v1 submitted 20 June, 2023;
originally announced June 2023.
-
$\pi2\text{vec}$: Policy Representations with Successor Features
Authors:
Gianluca Scarpellini,
Ksenia Konyushkova,
Claudio Fantacci,
Tom Le Paine,
Yutian Chen,
Misha Denil
Abstract:
This paper describes $\pi2\text{vec}$, a method for representing behaviors of black box policies as feature vectors. The policy representations capture how the statistics of foundation model features change in response to the policy behavior in a task agnostic way, and can be trained from offline data, allowing them to be used in offline policy selection. This work provides a key piece of a recipe…
▽ More
This paper describes $\pi2\text{vec}$, a method for representing behaviors of black box policies as feature vectors. The policy representations capture how the statistics of foundation model features change in response to the policy behavior in a task agnostic way, and can be trained from offline data, allowing them to be used in offline policy selection. This work provides a key piece of a recipe for fusing together three modern lines of research: Offline policy evaluation as a counterpart to offline RL, foundation models as generic and powerful state representations, and efficient policy selection in resource constrained environments.
△ Less
Submitted 24 January, 2024; v1 submitted 16 June, 2023;
originally announced June 2023.
-
Lossless Adaptation of Pretrained Vision Models For Robotic Manipulation
Authors:
Mohit Sharma,
Claudio Fantacci,
Yuxiang Zhou,
Skanda Koppula,
Nicolas Heess,
Jon Scholz,
Yusuf Aytar
Abstract:
Recent works have shown that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks. While prior work on robotic manipulation has predominantly used frozen pretrained features, we demonstrate that in robotics this approach can fail to reach optimal performance…
▽ More
Recent works have shown that large models pretrained on common visual learning tasks can provide useful representations for a wide range of specialized perception problems, as well as a variety of robotic manipulation tasks. While prior work on robotic manipulation has predominantly used frozen pretrained features, we demonstrate that in robotics this approach can fail to reach optimal performance, and that fine-tuning of the full model can lead to significantly better results. Unfortunately, fine-tuning disrupts the pretrained visual representation, and causes representational drift towards the fine-tuned task thus leading to a loss of the versatility of the original model. We introduce "lossless adaptation" to address this shortcoming of classical fine-tuning. We demonstrate that appropriate placement of our parameter efficient adapters can significantly reduce the performance gap between frozen pretrained representations and full end-to-end fine-tuning without changes to the original representation and thus preserving original capabilities of the pretrained model. We perform a comprehensive investigation across three major model architectures (ViTs, NFNets, and ResNets), supervised (ImageNet-1K classification) and self-supervised pretrained weights (CLIP, BYOL, Visual MAE) in 3 task domains and 35 individual tasks, and demonstrate that our claims are strongly validated in various settings.
△ Less
Submitted 13 April, 2023;
originally announced April 2023.
-
SkillS: Adaptive Skill Sequencing for Efficient Temporally-Extended Exploration
Authors:
Giulia Vezzani,
Dhruva Tirumala,
Markus Wulfmeier,
Dushyant Rao,
Abbas Abdolmaleki,
Ben Moran,
Tuomas Haarnoja,
Jan Humplik,
Roland Hafner,
Michael Neunert,
Claudio Fantacci,
Tim Hertweck,
Thomas Lampe,
Fereshteh Sadeghi,
Nicolas Heess,
Martin Riedmiller
Abstract:
The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert be…
▽ More
The ability to effectively reuse prior knowledge is a key requirement when building general and flexible Reinforcement Learning (RL) agents. Skill reuse is one of the most common approaches, but current methods have considerable limitations.For example, fine-tuning an existing policy frequently fails, as the policy can degrade rapidly early in training. In a similar vein, distillation of expert behavior can lead to poor results when given sub-optimal experts. We compare several common approaches for skill transfer on multiple domains including changes in task and system dynamics. We identify how existing methods can fail and introduce an alternative approach to mitigate these problems. Our approach learns to sequence existing temporally-extended skills for exploration but learns the final policy directly from the raw experience. This conceptual split enables rapid adaptation and thus efficient data collection but without constraining the final solution.It significantly outperforms many classical methods across a suite of evaluation tasks and we use a broad set of ablations to highlight the importance of differentc omponents of our method.
△ Less
Submitted 11 January, 2023; v1 submitted 24 November, 2022;
originally announced November 2022.
-
Beyond Pick-and-Place: Tackling Robotic Stacking of Diverse Shapes
Authors:
Alex X. Lee,
Coline Devin,
Yuxiang Zhou,
Thomas Lampe,
Konstantinos Bousmalis,
Jost Tobias Springenberg,
Arunkumar Byravan,
Abbas Abdolmaleki,
Nimrod Gileadi,
David Khosid,
Claudio Fantacci,
Jose Enrique Chen,
Akhil Raju,
Rae Jeong,
Michael Neunert,
Antoine Laurens,
Stefano Saliceti,
Federico Casarini,
Martin Riedmiller,
Raia Hadsell,
Francesco Nori
Abstract:
We study the problem of robotic stacking with objects of complex geometry. We propose a challenging and diverse set of such objects that was carefully designed to require strategies beyond a simple "pick-and-place" solution. Our method is a reinforcement learning (RL) approach combined with vision-based interactive policy distillation and simulation-to-reality transfer. Our learned policies can ef…
▽ More
We study the problem of robotic stacking with objects of complex geometry. We propose a challenging and diverse set of such objects that was carefully designed to require strategies beyond a simple "pick-and-place" solution. Our method is a reinforcement learning (RL) approach combined with vision-based interactive policy distillation and simulation-to-reality transfer. Our learned policies can efficiently handle multiple object combinations in the real world and exhibit a large variety of stacking skills. In a large experimental study, we investigate what choices matter for learning such general vision-based agents in simulation, and what affects optimal transfer to the real robot. We then leverage data collected by such policies and improve upon them with offline RL. A video and a blog post of our work are provided as supplementary material.
△ Less
Submitted 3 November, 2021; v1 submitted 12 October, 2021;
originally announced October 2021.
-
In Situ Translational Hand-Eye Calibration of Laser Profile Sensors using Arbitrary Objects
Authors:
Prajval Kumar Murali,
Ines Sorrentino,
Angelo Rendiniello,
Claudio Fantacci,
Enrico Villagrossi,
Andrea Polo,
Alessandro Ardesi,
Marco Maggiali,
Lorenzo Natale,
Daniele Pucci,
Silvio Traversaro
Abstract:
Hand-eye calibration of laser profile sensors is the process of extracting the homogeneous transformation between the laser profile sensor frame and the end-effector frame of a robot in order to express the data extracted by the sensor in the robot's global coordinate system. For laser profile scanners this is a challenging procedure, as they provide data only in two dimensions and state-of-the-ar…
▽ More
Hand-eye calibration of laser profile sensors is the process of extracting the homogeneous transformation between the laser profile sensor frame and the end-effector frame of a robot in order to express the data extracted by the sensor in the robot's global coordinate system. For laser profile scanners this is a challenging procedure, as they provide data only in two dimensions and state-of-the-art calibration procedures require the use of specialised calibration targets. This paper presents a novel method to extract the translation-part of the hand-eye calibration matrix with rotation-part known a priori in a target-agnostic way. Our methodology is applicable to any 2D image or 3D object as a calibration target and can also be performed in situ in the final application. The method is experimentally validated on a real robot-sensor setup with 2D and 3D targets.
△ Less
Submitted 22 March, 2021;
originally announced March 2021.
-
Markerless visual servoing on unknown objects for humanoid robot platforms
Authors:
Claudio Fantacci,
Giulia Vezzani,
Ugo Pattacini,
Vadim Tikhanoff,
Lorenzo Natale
Abstract:
To precisely reach for an object with a humanoid robot, it is of central importance to have good knowledge of both end-effector, object pose and shape. In this work we propose a framework for markerless visual servoing on unknown objects, which is divided in four main parts: I) a least-squares minimization problem is formulated to find the volume of the object graspable by the robot's hand using i…
▽ More
To precisely reach for an object with a humanoid robot, it is of central importance to have good knowledge of both end-effector, object pose and shape. In this work we propose a framework for markerless visual servoing on unknown objects, which is divided in four main parts: I) a least-squares minimization problem is formulated to find the volume of the object graspable by the robot's hand using its stereo vision; II) a recursive Bayesian filtering technique, based on Sequential Monte Carlo (SMC) filtering, estimates the 6D pose (position and orientation) of the robot's end-effector without the use of markers; III) a nonlinear constrained optimization problem is formulated to compute the desired graspable pose about the object; IV) an image-based visual servo control commands the robot's end-effector toward the desired pose. We demonstrate effectiveness and robustness of our approach with extensive experiments on the iCub humanoid robot platform, achieving real-time computation, smooth trajectories and sub-pixel precisions.
△ Less
Submitted 12 October, 2017;
originally announced October 2017.
-
Visual end-effector tracking using a 3D model-aided particle filter for humanoid robot platforms
Authors:
Claudio Fantacci,
Ugo Pattacini,
Vadim Tikhanoff,
Lorenzo Natale
Abstract:
This paper addresses recursive markerless estimation of a robot's end-effector using visual observations from its cameras. The problem is formulated into the Bayesian framework and addressed using Sequential Monte Carlo (SMC) filtering. We use a 3D rendering engine and Computer Aided Design (CAD) schematics of the robot to virtually create images from the robot's camera viewpoints. These images ar…
▽ More
This paper addresses recursive markerless estimation of a robot's end-effector using visual observations from its cameras. The problem is formulated into the Bayesian framework and addressed using Sequential Monte Carlo (SMC) filtering. We use a 3D rendering engine and Computer Aided Design (CAD) schematics of the robot to virtually create images from the robot's camera viewpoints. These images are then used to extract information and estimate the pose of the end-effector. To this aim, we developed a particle filter for estimating the position and orientation of the robot's end-effector using the Histogram of Oriented Gradient (HOG) descriptors to capture robust characteristic features of shapes in both cameras and rendered images. We implemented the algorithm on the iCub humanoid robot and employed it in a closed-loop reaching scenario. We demonstrate that the tracking is robust to clutter, allows compensating for errors in the robot kinematics and servoing the arm in closed loop using vision.
△ Less
Submitted 4 August, 2017; v1 submitted 14 March, 2017;
originally announced March 2017.
-
An Overview of Particle Methods for Random Finite Set Models
Authors:
Branko Ristic,
Michael Beard,
Claudio Fantacci
Abstract:
This overview paper describes the particle methods developed for the implementation of the a class of Bayes filters formulated using the random finite set formalism. It is primarily intended for the readership already familiar with the particle methods in the context of the standard Bayes filter. The focus in on the Bernoulli particle filter, the probability hypothesis density (PHD) particle filte…
▽ More
This overview paper describes the particle methods developed for the implementation of the a class of Bayes filters formulated using the random finite set formalism. It is primarily intended for the readership already familiar with the particle methods in the context of the standard Bayes filter. The focus in on the Bernoulli particle filter, the probability hypothesis density (PHD) particle filter and the generalised labelled multi-Bernoulli (GLMB) particle filter. The performance of the described filters is demonstrated in the context of bearings-only target tracking application.
△ Less
Submitted 11 February, 2016;
originally announced February 2016.
-
Distributed multi-object tracking over sensor networks: a random finite set approach
Authors:
Claudio Fantacci
Abstract:
The aim of the present dissertation is to address distributed tracking over a network of heterogeneous and geographically dispersed nodes (or agents) with sensing, communication and processing capabilities. Tracking is carried out in the Bayesian framework and its extension to a distributed context is made possible via an information-theoretic approach to data fusion which exploits consensus algor…
▽ More
The aim of the present dissertation is to address distributed tracking over a network of heterogeneous and geographically dispersed nodes (or agents) with sensing, communication and processing capabilities. Tracking is carried out in the Bayesian framework and its extension to a distributed context is made possible via an information-theoretic approach to data fusion which exploits consensus algorithms and the notion of Kullback-Leibler Average (KLA) of the Probability Density Functions (PDFs) to be fused. The first step toward distributed tracking considers a single moving object. Consensus takes place in each agent for spreading information over the network so that each node can track the object. To achieve such a goal, consensus is carried out on the local single-object posterior distribution, which is the result of local data processing, in the Bayesian setting, exploiting the last available measurement about the object. The next step is in the direction of distributed estimation of multiple moving objects. In order to model, in a rigorous and elegant way, a possibly time-varying number of objects present in a given area of interest, the Random Finite Set (RFS) formulation is adopted since it provides the notion of probability density for multi-object states that allows to directly extend existing tools in distributed estimation to multi-object tracking. The last theoretical step of the present dissertation is toward distributed filtering with the further requirement of unique object identities. To this end the labeled RFS framework is adopted as it provides a tractable approach to the multi-object Bayesian recursion. A generalization of the KLA to the labeled RFS framework, enables the development of novel consensus multi-object tracking filters which are fully distributed, scalable and computationally efficient.
△ Less
Submitted 12 July, 2015;
originally announced August 2015.
-
Consensus Labeled Random Finite Set Filtering for Distributed Multi-Object Tracking
Authors:
C. Fantacci,
B. -N. Vo,
B. -T. Vo,
G. Battistelli,
L. Chisci
Abstract:
This paper addresses distributed multi-object tracking over a network of heterogeneous and geographically dispersed nodes with sensing, communication and processing capabilities. The main contribution is an approach to distributed multi-object estimation based on labeled Random Finite Sets (RFSs) and dynamic Bayesian inference, which enables the development of two novel consensus tracking filters,…
▽ More
This paper addresses distributed multi-object tracking over a network of heterogeneous and geographically dispersed nodes with sensing, communication and processing capabilities. The main contribution is an approach to distributed multi-object estimation based on labeled Random Finite Sets (RFSs) and dynamic Bayesian inference, which enables the development of two novel consensus tracking filters, namely a Consensus Marginalized $δ$-Generalized Labeled Multi-Bernoulli and Consensus Labeled Multi-Bernoulli tracking filter. The proposed algorithms provide fully distributed, scalable and computationally efficient solutions for multi-object tracking. Simulation experiments via Gaussian mixture implementations confirm the effectiveness of the proposed approach on challenging scenarios.
△ Less
Submitted 9 June, 2016; v1 submitted 7 January, 2015;
originally announced January 2015.
-
The Marginalized $δ$-GLMB Filter
Authors:
C. Fantacci,
B. -T. Vo,
F. Papi,
B. -N. Vo
Abstract:
The multi-target Bayes filter proposed by Mahler is a principled solution to recursive Bayesian tracking based on RFS or FISST. The $δ$-GLMB filter is an exact closed form solution to the multi-target Bayes recursion which yields joint state and label or trajectory estimates in the presence of clutter, missed detections and association uncertainty. Due to presence of explicit data associations in…
▽ More
The multi-target Bayes filter proposed by Mahler is a principled solution to recursive Bayesian tracking based on RFS or FISST. The $δ$-GLMB filter is an exact closed form solution to the multi-target Bayes recursion which yields joint state and label or trajectory estimates in the presence of clutter, missed detections and association uncertainty. Due to presence of explicit data associations in the $δ$-GLMB filter, the number of components in the posterior grows without bound in time. In this work we propose an efficient approximation to the $δ$-GLMB filter which preserves both the PHD and cardinality distribution of the labeled posterior. This approximation also facilitates efficient multi-sensor tracking with detection-based measurements. Simulation results are presented to verify the proposed approach.
△ Less
Submitted 6 April, 2017; v1 submitted 5 January, 2015;
originally announced January 2015.
-
Generalized Labeled Multi-Bernoulli Approximation of Multi-Object Densities
Authors:
Francesco Papi,
Ba-Ngu Vo,
Ba-Tuong Vo,
Claudio Fantacci,
Michael Beard
Abstract:
In multi-object inference, the multi-object probability density captures the uncertainty in the number and the states of the objects as well as the statistical dependence between the objects. Exact computation of the multi-object density is generally intractable and tractable implementations usually require statistical independence assumptions between objects. In this paper we propose a tractable…
▽ More
In multi-object inference, the multi-object probability density captures the uncertainty in the number and the states of the objects as well as the statistical dependence between the objects. Exact computation of the multi-object density is generally intractable and tractable implementations usually require statistical independence assumptions between objects. In this paper we propose a tractable multi-object density approximation that can capture statistical dependence between objects. In particular, we derive a tractable Generalized Labeled Multi-Bernoulli (GLMB) density that matches the cardinality distribution and the first moment of the labeled multi-object distribution of interest. It is also shown that the proposed approximation minimizes the Kullback-Leibler divergence over a special tractable class of GLMB densities. Based on the proposed GLMB approximation we further demonstrate a tractable multi-object tracking algorithm for generic measurement models. Simulation results for a multi-object Track-Before-Detect example using radar measurements in low signal-to-noise ratio (SNR) scenarios verify the applicability of the proposed approach.
△ Less
Submitted 6 July, 2015; v1 submitted 17 December, 2014;
originally announced December 2014.