-
Gemini Robotics: Bringing AI into the Physical World
Authors:
Gemini Robotics Team,
Saminda Abeyruwan,
Joshua Ainslie,
Jean-Baptiste Alayrac,
Montserrat Gonzalez Arenas,
Travis Armstrong,
Ashwin Balakrishna,
Robert Baruch,
Maria Bauza,
Michiel Blokzijl,
Steven Bohez,
Konstantinos Bousmalis,
Anthony Brohan,
Thomas Buschmann,
Arunkumar Byravan,
Serkan Cabi,
Ken Caluwaerts,
Federico Casarini,
Oscar Chang,
Jose Enrique Chen,
Xi Chen,
Hao-Tien Lewis Chiang,
Krzysztof Choromanski,
David D'Ambrosio,
Sudeep Dasari
, et al. (93 additional authors not shown)
Abstract:
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Lang…
▽ More
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini's multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications. We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realizes AI's potential in the physical world.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Achieving Human Level Competitive Robot Table Tennis
Authors:
David B. D'Ambrosio,
Saminda Abeyruwan,
Laura Graesser,
Atil Iscen,
Heni Ben Amor,
Alex Bewley,
Barney J. Reed,
Krista Reymann,
Leila Takayama,
Yuval Tassa,
Krzysztof Choromanski,
Erwin Coumans,
Deepali Jain,
Navdeep Jaitly,
Natasha Jaques,
Satoshi Kataoka,
Yuheng Kuang,
Nevena Lazic,
Reza Mahjourian,
Sherry Moore,
Kenneth Oslund,
Anish Shankar,
Vikas Sindhwani,
Vincent Vanhoucke,
Grace Vesom
, et al. (2 additional authors not shown)
Abstract:
Achieving human-level speed and performance on real world tasks is a north star for the robotics research community. This work takes a step towards that goal and presents the first learned robot agent that reaches amateur human-level performance in competitive table tennis. Table tennis is a physically demanding sport which requires human players to undergo years of training to achieve an advanced…
▽ More
Achieving human-level speed and performance on real world tasks is a north star for the robotics research community. This work takes a step towards that goal and presents the first learned robot agent that reaches amateur human-level performance in competitive table tennis. Table tennis is a physically demanding sport which requires human players to undergo years of training to achieve an advanced level of proficiency. In this paper, we contribute (1) a hierarchical and modular policy architecture consisting of (i) low level controllers with their detailed skill descriptors which model the agent's capabilities and help to bridge the sim-to-real gap and (ii) a high level controller that chooses the low level skills, (2) techniques for enabling zero-shot sim-to-real including an iterative approach to defining the task distribution that is grounded in the real-world and defines an automatic curriculum, and (3) real time adaptation to unseen opponents. Policy performance was assessed through 29 robot vs. human matches of which the robot won 45% (13/29). All humans were unseen players and their skill level varied from beginner to tournament level. Whilst the robot lost all matches vs. the most advanced players it won 100% matches vs. beginners and 55% matches vs. intermediate players, demonstrating solidly amateur human-level performance. Videos of the matches can be viewed at https://sites.google.com/view/competitive-robot-table-tennis
△ Less
Submitted 1 May, 2025; v1 submitted 7 August, 2024;
originally announced August 2024.
-
Learning to Learn Faster from Human Feedback with Language Model Predictive Control
Authors:
Jacky Liang,
Fei Xia,
Wenhao Yu,
Andy Zeng,
Montserrat Gonzalez Arenas,
Maria Attarian,
Maria Bauza,
Matthew Bennice,
Alex Bewley,
Adil Dostmohamed,
Chuyuan Kelly Fu,
Nimrod Gileadi,
Marissa Giustina,
Keerthana Gopalakrishnan,
Leonard Hasenclever,
Jan Humplik,
Jasmine Hsu,
Nikhil Joshi,
Ben Jyenis,
Chase Kew,
Sean Kirmani,
Tsang-Wei Edward Lee,
Kuang-Huei Lee,
Assaf Hurwitz Michaely,
Joss Moore
, et al. (25 additional authors not shown)
Abstract:
Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for o…
▽ More
Large language models (LLMs) have been shown to exhibit a wide range of capabilities, such as writing robot code from language commands -- enabling non-experts to direct robot behaviors, modify them based on feedback, or compose them to perform new tasks. However, these capabilities (driven by in-context learning) are limited to short-term interactions, where users' feedback remains relevant for only as long as it fits within the context size of the LLM, and can be forgotten over longer interactions. In this work, we investigate fine-tuning the robot code-writing LLMs, to remember their in-context interactions and improve their teachability i.e., how efficiently they adapt to human inputs (measured by average number of corrections before the user considers the task successful). Our key observation is that when human-robot interactions are viewed as a partially observable Markov decision process (in which human language inputs are observations, and robot code outputs are actions), then training an LLM to complete previous interactions is training a transition dynamics model -- that can be combined with classic robotics techniques such as model predictive control (MPC) to discover shorter paths to success. This gives rise to Language Model Predictive Control (LMPC), a framework that fine-tunes PaLM 2 to improve its teachability on 78 tasks across 5 robot embodiments -- improving non-expert teaching success rates of unseen tasks by 26.9% while reducing the average number of human corrections from 2.4 to 1.9. Experiments show that LMPC also produces strong meta-learners, improving the success rate of in-context learning new tasks on unseen robot embodiments and APIs by 31.5%. See videos, code, and demos at: https://robot-teaching.github.io/.
△ Less
Submitted 31 May, 2024; v1 submitted 17 February, 2024;
originally announced February 2024.
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Authors:
Open X-Embodiment Collaboration,
Abby O'Neill,
Abdul Rehman,
Abhinav Gupta,
Abhiram Maddukuri,
Abhishek Gupta,
Abhishek Padalkar,
Abraham Lee,
Acorn Pooley,
Agrim Gupta,
Ajay Mandlekar,
Ajinkya Jain,
Albert Tung,
Alex Bewley,
Alex Herzog,
Alex Irpan,
Alexander Khazatsky,
Anant Rai,
Anchit Gupta,
Andrew Wang,
Andrey Kolobov,
Anikait Singh,
Animesh Garg,
Aniruddha Kembhavi,
Annie Xie
, et al. (269 additional authors not shown)
Abstract:
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method…
▽ More
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.
△ Less
Submitted 14 May, 2025; v1 submitted 13 October, 2023;
originally announced October 2023.
-
Robotic Table Tennis: A Case Study into a High Speed Learning System
Authors:
David B. D'Ambrosio,
Jonathan Abelian,
Saminda Abeyruwan,
Michael Ahn,
Alex Bewley,
Justin Boyd,
Krzysztof Choromanski,
Omar Cortes,
Erwin Coumans,
Tianli Ding,
Wenbo Gao,
Laura Graesser,
Atil Iscen,
Navdeep Jaitly,
Deepali Jain,
Juhana Kangaspunta,
Satoshi Kataoka,
Gus Kouretas,
Yuheng Kuang,
Nevena Lazic,
Corey Lynch,
Reza Mahjourian,
Sherry Q. Moore,
Thinh Nguyen,
Ken Oslund
, et al. (10 additional authors not shown)
Abstract:
We present a deep-dive into a real-world robotic learning system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real w…
▽ More
We present a deep-dive into a real-world robotic learning system that, in previous work, was shown to be capable of hundreds of table tennis rallies with a human and has the ability to precisely return the ball to desired targets. This system puts together a highly optimized perception subsystem, a high-speed low-latency robot controller, a simulation paradigm that can prevent damage in the real world and also train policies for zero-shot transfer, and automated real world environment resets that enable autonomous training and evaluation on physical robots. We complement a complete system description, including numerous design decisions that are typically not widely disseminated, with a collection of studies that clarify the importance of mitigating various sources of latency, accounting for training and deployment distribution shifts, robustness of the perception system, sensitivity to policy hyper-parameters, and choice of action space. A video demonstrating the components of the system and details of experimental results can be found at https://youtu.be/uFcnWjB42I0.
△ Less
Submitted 19 February, 2025; v1 submitted 6 September, 2023;
originally announced September 2023.
-
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Authors:
Anthony Brohan,
Noah Brown,
Justice Carbajal,
Yevgen Chebotar,
Xi Chen,
Krzysztof Choromanski,
Tianli Ding,
Danny Driess,
Avinava Dubey,
Chelsea Finn,
Pete Florence,
Chuyuan Fu,
Montse Gonzalez Arenas,
Keerthana Gopalakrishnan,
Kehang Han,
Karol Hausman,
Alexander Herzog,
Jasmine Hsu,
Brian Ichter,
Alex Irpan,
Nikhil Joshi,
Ryan Julian,
Dmitry Kalashnikov,
Yuheng Kuang,
Isabel Leal
, et al. (29 additional authors not shown)
Abstract:
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web.…
▽ More
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control to boost generalization and enable emergent semantic reasoning. Our goal is to enable a single end-to-end trained model to both learn to map robot observations to actions and enjoy the benefits of large-scale pretraining on language and vision-language data from the web. To this end, we propose to co-fine-tune state-of-the-art vision-language models on both robotic trajectory data and Internet-scale vision-language tasks, such as visual question answering. In contrast to other approaches, we propose a simple, general recipe to achieve this goal: in order to fit both natural language responses and robotic actions into the same format, we express the actions as text tokens and incorporate them directly into the training set of the model in the same way as natural language tokens. We refer to such category of models as vision-language-action models (VLA) and instantiate an example of such a model, which we call RT-2. Our extensive evaluation (6k evaluation trials) shows that our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training. This includes significantly improved generalization to novel objects, the ability to interpret commands not present in the robot training data (such as placing an object onto a particular number or icon), and the ability to perform rudimentary reasoning in response to user commands (such as picking up the smallest or largest object, or the one closest to another object). We further show that incorporating chain of thought reasoning allows RT-2 to perform multi-stage semantic reasoning, for example figuring out which object to pick up for use as an improvised hammer (a rock), or which type of drink is best suited for someone who is tired (an energy drink).
△ Less
Submitted 28 July, 2023;
originally announced July 2023.
-
Barkour: Benchmarking Animal-level Agility with Quadruped Robots
Authors:
Ken Caluwaerts,
Atil Iscen,
J. Chase Kew,
Wenhao Yu,
Tingnan Zhang,
Daniel Freeman,
Kuang-Huei Lee,
Lisa Lee,
Stefano Saliceti,
Vincent Zhuang,
Nathan Batchelor,
Steven Bohez,
Federico Casarini,
Jose Enrique Chen,
Omar Cortes,
Erwin Coumans,
Adil Dostmohamed,
Gabriel Dulac-Arnold,
Alejandro Escontrela,
Erik Frey,
Roland Hafner,
Deepali Jain,
Bauyrjan Jyenis,
Yuheng Kuang,
Edward Lee
, et al. (19 additional authors not shown)
Abstract:
Animals have evolved various agile locomotion strategies, such as sprinting, leaping, and jumping. There is a growing interest in developing legged robots that move like their biological counterparts and show various agile skills to navigate complex environments quickly. Despite the interest, the field lacks systematic benchmarks to measure the performance of control policies and hardware in agili…
▽ More
Animals have evolved various agile locomotion strategies, such as sprinting, leaping, and jumping. There is a growing interest in developing legged robots that move like their biological counterparts and show various agile skills to navigate complex environments quickly. Despite the interest, the field lacks systematic benchmarks to measure the performance of control policies and hardware in agility. We introduce the Barkour benchmark, an obstacle course to quantify agility for legged robots. Inspired by dog agility competitions, it consists of diverse obstacles and a time based scoring mechanism. This encourages researchers to develop controllers that not only move fast, but do so in a controllable and versatile way. To set strong baselines, we present two methods for tackling the benchmark. In the first approach, we train specialist locomotion skills using on-policy reinforcement learning methods and combine them with a high-level navigation controller. In the second approach, we distill the specialist skills into a Transformer-based generalist locomotion policy, named Locomotion-Transformer, that can handle various terrains and adjust the robot's gait based on the perceived environment and robot states. Using a custom-built quadruped robot, we demonstrate that our method can complete the course at half the speed of a dog. We hope that our work represents a step towards creating controllers that enable robots to reach animal-level agility.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
PaLM-E: An Embodied Multimodal Language Model
Authors:
Danny Driess,
Fei Xia,
Mehdi S. M. Sajjadi,
Corey Lynch,
Aakanksha Chowdhery,
Brian Ichter,
Ayzaan Wahid,
Jonathan Tompson,
Quan Vuong,
Tianhe Yu,
Wenlong Huang,
Yevgen Chebotar,
Pierre Sermanet,
Daniel Duckworth,
Sergey Levine,
Vincent Vanhoucke,
Karol Hausman,
Marc Toussaint,
Klaus Greff,
Andy Zeng,
Igor Mordatch,
Pete Florence
Abstract:
Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model ar…
▽ More
Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises the challenge of grounding. We propose embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. Input to our embodied language model are multi-modal sentences that interleave visual, continuous state estimation, and textual input encodings. We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks including sequential robotic manipulation planning, visual question answering, and captioning. Our evaluations show that PaLM-E, a single large embodied multimodal model, can address a variety of embodied reasoning tasks, from a variety of observation modalities, on multiple embodiments, and further, exhibits positive transfer: the model benefits from diverse joint training across internet-scale language, vision, and visual-language domains. Our largest model, PaLM-E-562B with 562B parameters, in addition to being trained on robotics tasks, is a visual-language generalist with state-of-the-art performance on OK-VQA, and retains generalist language capabilities with increasing scale.
△ Less
Submitted 6 March, 2023;
originally announced March 2023.
-
RT-1: Robotics Transformer for Real-World Control at Scale
Authors:
Anthony Brohan,
Noah Brown,
Justice Carbajal,
Yevgen Chebotar,
Joseph Dabis,
Chelsea Finn,
Keerthana Gopalakrishnan,
Karol Hausman,
Alex Herzog,
Jasmine Hsu,
Julian Ibarz,
Brian Ichter,
Alex Irpan,
Tomas Jackson,
Sally Jesmonth,
Nikhil J Joshi,
Ryan Julian,
Dmitry Kalashnikov,
Yuheng Kuang,
Isabel Leal,
Kuang-Huei Lee,
Sergey Levine,
Yao Lu,
Utsav Malla,
Deeksha Manjunath
, et al. (26 additional authors not shown)
Abstract:
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, wher…
▽ More
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer1.github.io
△ Less
Submitted 11 August, 2023; v1 submitted 13 December, 2022;
originally announced December 2022.
-
Google Scanned Objects: A High-Quality Dataset of 3D Scanned Household Items
Authors:
Laura Downs,
Anthony Francis,
Nate Koenig,
Brandon Kinman,
Ryan Hickman,
Krista Reymann,
Thomas B. McHugh,
Vincent Vanhoucke
Abstract:
Interactive 3D simulations have enabled breakthroughs in robotics and computer vision, but simulating the broad diversity of environments needed for deep learning requires large corpora of photo-realistic 3D object models. To address this need, we present Google Scanned Objects, an open-source collection of over one thousand 3D-scanned household items released under a Creative Commons license; the…
▽ More
Interactive 3D simulations have enabled breakthroughs in robotics and computer vision, but simulating the broad diversity of environments needed for deep learning requires large corpora of photo-realistic 3D object models. To address this need, we present Google Scanned Objects, an open-source collection of over one thousand 3D-scanned household items released under a Creative Commons license; these models are preprocessed for use in Ignition Gazebo and the Bullet simulation platforms, but are easily adaptable to other simulators. We describe our object scanning and curation pipeline, then provide statistics about the contents of the dataset and its usage. We hope that the diversity, quality, and flexibility of Google Scanned Objects will lead to advances in interactive simulation, synthetic perception, and robotic learning.
△ Less
Submitted 25 April, 2022;
originally announced April 2022.
-
Learning to Fold Real Garments with One Arm: A Case Study in Cloud-Based Robotics Research
Authors:
Ryan Hoque,
Kaushik Shivakumar,
Shrey Aeron,
Gabriel Deza,
Aditya Ganapathi,
Adrian Wong,
Johnny Lee,
Andy Zeng,
Vincent Vanhoucke,
Ken Goldberg
Abstract:
Autonomous fabric manipulation is a longstanding challenge in robotics, but evaluating progress is difficult due to the cost and diversity of robot hardware. Using Reach, a cloud robotics platform that enables low-latency remote execution of control policies on physical robots, we present the first systematic benchmarking of fabric manipulation algorithms on physical hardware. We develop 4 novel l…
▽ More
Autonomous fabric manipulation is a longstanding challenge in robotics, but evaluating progress is difficult due to the cost and diversity of robot hardware. Using Reach, a cloud robotics platform that enables low-latency remote execution of control policies on physical robots, we present the first systematic benchmarking of fabric manipulation algorithms on physical hardware. We develop 4 novel learning-based algorithms that model expert actions, keypoints, reward functions, and dynamic motions, and we compare these against 4 learning-free and inverse dynamics algorithms on the task of folding a crumpled T-shirt with a single robot arm. The entire lifecycle of data collection, model training, and policy evaluation is performed remotely without physical access to the robot workcell. Results suggest a new algorithm combining imitation learning with analytic methods achieves 84% of human-level performance on the folding task. See https://sites.google.com/berkeley.edu/cloudfolding for all data, code, models, and supplemental material.
△ Less
Submitted 21 April, 2022;
originally announced April 2022.
-
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Authors:
Michael Ahn,
Anthony Brohan,
Noah Brown,
Yevgen Chebotar,
Omar Cortes,
Byron David,
Chelsea Finn,
Chuyuan Fu,
Keerthana Gopalakrishnan,
Karol Hausman,
Alex Herzog,
Daniel Ho,
Jasmine Hsu,
Julian Ibarz,
Brian Ichter,
Alex Irpan,
Eric Jang,
Rosario Jauregui Ruano,
Kyle Jeffrey,
Sally Jesmonth,
Nikhil J Joshi,
Ryan Julian,
Dmitry Kalashnikov,
Yuheng Kuang,
Kuang-Huei Lee
, et al. (20 additional authors not shown)
Abstract:
Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embo…
▽ More
Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.
△ Less
Submitted 16 August, 2022; v1 submitted 4 April, 2022;
originally announced April 2022.
-
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Authors:
Andy Zeng,
Maria Attarian,
Brian Ichter,
Krzysztof Choromanski,
Adrian Wong,
Stefan Welker,
Federico Tombari,
Aveek Purohit,
Michael Ryoo,
Vikas Sindhwani,
Johnny Lee,
Vincent Vanhoucke,
Pete Florence
Abstract:
Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT quest…
▽ More
Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains are generic, they may only barely overlap. For example, visual-language models (VLMs) are trained on Internet-scale image captions, but large language models (LMs) are further trained on Internet-scale text with no images (e.g., spreadsheets, SAT questions, code). As a result, these models store different forms of commonsense knowledge across different domains. In this work, we show that this diversity is symbiotic, and can be leveraged through Socratic Models (SMs): a modular framework in which multiple pretrained models may be composed zero-shot i.e., via multimodal-informed prompting, to exchange information with each other and capture new multimodal capabilities, without requiring finetuning. With minimal engineering, SMs are not only competitive with state-of-the-art zero-shot image captioning and video-to-text retrieval, but also enable new applications such as (i) answering free-form questions about egocentric video, (ii) engaging in multimodal assistive dialogue with people (e.g., for cooking recipes) by interfacing with external APIs and databases (e.g., web search), and (iii) robot perception and planning.
△ Less
Submitted 27 May, 2022; v1 submitted 1 April, 2022;
originally announced April 2022.
-
Mechanical Search on Shelves using Lateral Access X-RAY
Authors:
Huang Huang,
Marcus Dominguez-Kuhne,
Jeffrey Ichnowski,
Vishal Satish,
Michael Danielczuk,
Kate Sanders,
Andrew Lee,
Anelia Angelova,
Vincent Vanhoucke,
Ken Goldberg
Abstract:
Efficiently finding an occluded object with lateral access arises in many contexts such as warehouses, retail, healthcare, shipping, and homes. We introduce LAX-RAY (Lateral Access maXimal Reduction of occupancY support Area), a system to automate the mechanical search for occluded objects on shelves. For such lateral access environments, LAX-RAY couples a perception pipeline predicting a target o…
▽ More
Efficiently finding an occluded object with lateral access arises in many contexts such as warehouses, retail, healthcare, shipping, and homes. We introduce LAX-RAY (Lateral Access maXimal Reduction of occupancY support Area), a system to automate the mechanical search for occluded objects on shelves. For such lateral access environments, LAX-RAY couples a perception pipeline predicting a target object occupancy support distribution with a mechanical search policy that sequentially selects occluding objects to push to the side to reveal the target as efficiently as possible. Within the context of extruded polygonal objects and a stationary target with a known aspect ratio, we explore three lateral access search policies: Distribution Area Reduction (DAR), Distribution Entropy Reduction (DER), and Distribution Entropy Reduction over Multiple Time Steps (DER-MT) utilizing the support distribution and prior information. We evaluate these policies using the First-Order Shelf Simulator (FOSS) in which we simulate 800 random shelf environments of varying difficulty, and in a physical shelf environment with a Fetch robot and an embedded PrimeSense RGBD Camera. Average simulation results of 87.3% success rate demonstrate better performance of DER-MT with 2 prediction steps. When deployed on the robot, results show a success rate of at least 80% for all policies, suggesting that LAX-RAY can efficiently reveal the target object in reality. Both results show significantly better performance of the three proposed policies compared to a baseline policy with uniform probability distribution assumption in non-trivial cases, showing the importance of distribution prediction. Code, videos, and supplementary material can be found at https://sites.google.com/berkeley.edu/lax-ray.
△ Less
Submitted 23 November, 2020;
originally announced November 2020.
-
Differentiable Mapping Networks: Learning Structured Map Representations for Sparse Visual Localization
Authors:
Peter Karkus,
Anelia Angelova,
Vincent Vanhoucke,
Rico Jonschkowski
Abstract:
Mapping and localization, preferably from a small number of observations, are fundamental tasks in robotics. We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN). The DMN constructs a spatially structured view-embedding map and uses it for subsequent visual localizatio…
▽ More
Mapping and localization, preferably from a small number of observations, are fundamental tasks in robotics. We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN). The DMN constructs a spatially structured view-embedding map and uses it for subsequent visual localization with a particle filter. Since the DMN architecture is end-to-end differentiable, we can jointly learn the map representation and localization using gradient descent. We apply the DMN to sparse visual localization, where a robot needs to localize in a new environment with respect to a small number of images from known viewpoints. We evaluate the DMN using simulated environments and a challenging real-world Street View dataset. We find that the DMN learns effective map representations for visual localization. The benefit of spatial structure increases with larger environments, more viewpoints for mapping, and when training data is scarce. Project website: http://sites.google.com/view/differentiable-mapping
△ Less
Submitted 19 May, 2020;
originally announced May 2020.
-
X-Ray: Mechanical Search for an Occluded Object by Minimizing Support of Learned Occupancy Distributions
Authors:
Michael Danielczuk,
Anelia Angelova,
Vincent Vanhoucke,
Ken Goldberg
Abstract:
For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object. For mechanical search, we introduce X-Ray, an algorithm based on learned occupancy distributions. We train a neural network using a synthetic dataset of RGBD heap images labeled for a set of standard bounding box targets with vary…
▽ More
For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object. For mechanical search, we introduce X-Ray, an algorithm based on learned occupancy distributions. We train a neural network using a synthetic dataset of RGBD heap images labeled for a set of standard bounding box targets with varying aspect ratios. X-Ray minimizes support of the learned distribution as part of a mechanical search policy in both simulated and real environments. We benchmark these policies against two baseline policies on 1,000 heaps of 15 objects in simulation where the target object is partially or fully occluded. Results suggest that X-Ray is significantly more efficient, as it succeeds in extracting the target object 82% of the time, 15% more often than the best-performing baseline. Experiments on an ABB YuMi robot with 20 heaps of 25 household objects suggest that the learned policy transfers easily to a physical system, where it outperforms baseline policies by 15% in success rate with 17% fewer actions. Datasets, videos, and experiments are available at https://sites.google.com/berkeley.edu/x-ray.
△ Less
Submitted 10 October, 2020; v1 submitted 19 April, 2020;
originally announced April 2020.
-
Policies Modulating Trajectory Generators
Authors:
Atil Iscen,
Ken Caluwaerts,
Jie Tan,
Tingnan Zhang,
Erwin Coumans,
Vikas Sindhwani,
Vincent Vanhoucke
Abstract:
We propose an architecture for learning complex controllable behaviors by having simple Policies Modulate Trajectory Generators (PMTG), a powerful combination that can provide both memory and prior knowledge to the controller. The result is a flexible architecture that is applicable to a class of problems with periodic motion for which one has an insight into the class of trajectories that might l…
▽ More
We propose an architecture for learning complex controllable behaviors by having simple Policies Modulate Trajectory Generators (PMTG), a powerful combination that can provide both memory and prior knowledge to the controller. The result is a flexible architecture that is applicable to a class of problems with periodic motion for which one has an insight into the class of trajectories that might lead to a desired behavior. We illustrate the basics of our architecture using a synthetic control problem, then go on to learn speed-controlled locomotion for a quadrupedal robot by using Deep Reinforcement Learning and Evolutionary Strategies. We demonstrate that a simple linear policy, when paired with a parametric Trajectory Generator for quadrupedal gaits, can induce walking behaviors with controllable speed from 4-dimensional IMU observations alone, and can be learned in under 1000 rollouts. We also transfer these policies to a real robot and show locomotion with controllable forward velocity.
△ Less
Submitted 7 October, 2019;
originally announced October 2019.
-
Grasp2Vec: Learning Object Representations from Self-Supervised Grasping
Authors:
Eric Jang,
Coline Devin,
Vincent Vanhoucke,
Sergey Levine
Abstract:
Well structured visual representations can make robot learning faster and can improve generalization. In this paper, we study how we can acquire effective object-centric representations for robotic manipulation tasks without human labeling by using autonomous robot interaction with the environment. Such representation learning methods can benefit from continuous refinement of the representation as…
▽ More
Well structured visual representations can make robot learning faster and can improve generalization. In this paper, we study how we can acquire effective object-centric representations for robotic manipulation tasks without human labeling by using autonomous robot interaction with the environment. Such representation learning methods can benefit from continuous refinement of the representation as the robot collects more experience, allowing them to scale effectively without human intervention. Our representation learning approach is based on object persistence: when a robot removes an object from a scene, the representation of that scene should change according to the features of the object that was removed. We formulate an arithmetic relationship between feature vectors from this observation, and use it to learn a representation of scenes and objects that can then be used to identify object instances, localize them in the scene, and perform goal-directed grasping tasks where the robot must retrieve commanded objects from a bin. The same grasping procedure can also be used to automatically collect training data for our method, by recording images of scenes, grasping and removing an object, and recording the outcome. Our experiments demonstrate that this self-supervised approach for tasked grasping substantially outperforms direct reinforcement learning from images and prior representation learning methods.
△ Less
Submitted 19 November, 2018; v1 submitted 16 November, 2018;
originally announced November 2018.
-
QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation
Authors:
Dmitry Kalashnikov,
Alex Irpan,
Peter Pastor,
Julian Ibarz,
Alexander Herzog,
Eric Jang,
Deirdre Quillen,
Ethan Holly,
Mrinal Kalakrishnan,
Vincent Vanhoucke,
Sergey Levine
Abstract:
In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, where…
▽ More
In this paper, we study the problem of learning vision-based dynamic manipulation skills using a scalable reinforcement learning approach. We study this problem in the context of grasping, a longstanding challenge in robotic manipulation. In contrast to static learning behaviors that choose a grasp point and then execute the desired grasp, our method enables closed-loop vision-based control, whereby the robot continuously updates its grasp strategy based on the most recent observations to optimize long-horizon grasp success. To that end, we introduce QT-Opt, a scalable self-supervised vision-based reinforcement learning framework that can leverage over 580k real-world grasp attempts to train a deep neural network Q-function with over 1.2M parameters to perform closed-loop, real-world grasping that generalizes to 96% grasp success on unseen objects. Aside from attaining a very high success rate, our method exhibits behaviors that are quite distinct from more standard grasping systems: using only RGB vision-based perception from an over-the-shoulder camera, our method automatically learns regrasping strategies, probes objects to find the most effective grasps, learns to reposition objects and perform other non-prehensile pre-grasp manipulations, and responds dynamically to disturbances and perturbations.
△ Less
Submitted 27 November, 2018; v1 submitted 27 June, 2018;
originally announced June 2018.
-
Sim-to-Real: Learning Agile Locomotion For Quadruped Robots
Authors:
Jie Tan,
Tingnan Zhang,
Erwin Coumans,
Atil Iscen,
Yunfei Bai,
Danijar Hafner,
Steven Bohez,
Vincent Vanhoucke
Abstract:
Designing agile locomotion for quadruped robots often requires extensive expertise and tedious manual tuning. In this paper, we present a system to automate this process by leveraging deep reinforcement learning techniques. Our system can learn quadruped locomotion from scratch using simple reward signals. In addition, users can provide an open loop reference to guide the learning process when mor…
▽ More
Designing agile locomotion for quadruped robots often requires extensive expertise and tedious manual tuning. In this paper, we present a system to automate this process by leveraging deep reinforcement learning techniques. Our system can learn quadruped locomotion from scratch using simple reward signals. In addition, users can provide an open loop reference to guide the learning process when more control over the learned gait is needed. The control policies are learned in a physics simulator and then deployed on real robots. In robotics, policies trained in simulation often do not transfer to the real world. We narrow this reality gap by improving the physics simulator and learning robust policies. We improve the simulation using system identification, developing an accurate actuator model and simulating latency. We learn robust controllers by randomizing the physical environments, adding perturbations and designing a compact observation space. We evaluate our system on two agile locomotion gaits: trotting and galloping. After learning in simulation, a quadruped robot can successfully perform both gaits in the real world.
△ Less
Submitted 16 May, 2018; v1 submitted 26 April, 2018;
originally announced April 2018.
-
Classification of crystallization outcomes using deep convolutional neural networks
Authors:
Andrew E. Bruno,
Patrick Charbonneau,
Janet Newman,
Edward H. Snell,
David R. So,
Vincent Vanhoucke,
Christopher J. Watkins,
Shawn Williams,
Julie Wilson
Abstract:
The Machine Recognition of Crystallization Outcomes (MARCO) initiative has assembled roughly half a million annotated images of macromolecular crystallization experiments from various sources and setups. Here, state-of-the-art machine learning algorithms are trained and tested on different parts of this data set. We find that more than 94% of the test images can be correctly labeled, irrespective…
▽ More
The Machine Recognition of Crystallization Outcomes (MARCO) initiative has assembled roughly half a million annotated images of macromolecular crystallization experiments from various sources and setups. Here, state-of-the-art machine learning algorithms are trained and tested on different parts of this data set. We find that more than 94% of the test images can be correctly labeled, irrespective of their experimental origin. Because crystal recognition is key to high-density screening and the systematic analysis of crystallization experiments, this approach opens the door to both industrial and fundamental research applications.
△ Less
Submitted 25 May, 2018; v1 submitted 27 March, 2018;
originally announced March 2018.
-
Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping
Authors:
Konstantinos Bousmalis,
Alex Irpan,
Paul Wohlhart,
Yunfei Bai,
Matthew Kelcey,
Mrinal Kalakrishnan,
Laura Downs,
Julian Ibarz,
Peter Pastor,
Kurt Konolige,
Sergey Levine,
Vincent Vanhoucke
Abstract:
Instrumenting and collecting annotated visual grasping datasets to train modern machine learning algorithms can be extremely time-consuming and expensive. An appealing alternative is to use off-the-shelf simulators to render synthetic data for which ground-truth annotations are generated automatically. Unfortunately, models trained purely on simulated data often fail to generalize to the real worl…
▽ More
Instrumenting and collecting annotated visual grasping datasets to train modern machine learning algorithms can be extremely time-consuming and expensive. An appealing alternative is to use off-the-shelf simulators to render synthetic data for which ground-truth annotations are generated automatically. Unfortunately, models trained purely on simulated data often fail to generalize to the real world. We study how randomized simulated environments and domain adaptation methods can be extended to train a grasping system to grasp novel objects from raw monocular RGB images. We extensively evaluate our approaches with a total of more than 25,000 physical test grasps, studying a range of simulation conditions and domain adaptation methods, including a novel extension of pixel-level domain adaptation that we term the GraspGAN. We show that, by using synthetic data and domain adaptation, we are able to reduce the number of real-world samples needed to achieve a given level of performance by up to 50 times, using only randomly generated simulated objects. We also show that by using only unlabeled real-world data and our GraspGAN methodology, we obtain real-world grasping performance without any real-world labels that is similar to that achieved with 939,777 labeled real-world samples.
△ Less
Submitted 25 September, 2017; v1 submitted 22 September, 2017;
originally announced September 2017.
-
TensorFlow Agents: Efficient Batched Reinforcement Learning in TensorFlow
Authors:
Danijar Hafner,
James Davidson,
Vincent Vanhoucke
Abstract:
We introduce TensorFlow Agents, an efficient infrastructure paradigm for building parallel reinforcement learning algorithms in TensorFlow. We simulate multiple environments in parallel, and group them to perform the neural network computation on a batch rather than individual observations. This allows the TensorFlow execution engine to parallelize computation, without the need for manual synchron…
▽ More
We introduce TensorFlow Agents, an efficient infrastructure paradigm for building parallel reinforcement learning algorithms in TensorFlow. We simulate multiple environments in parallel, and group them to perform the neural network computation on a batch rather than individual observations. This allows the TensorFlow execution engine to parallelize computation, without the need for manual synchronization. Environments are stepped in separate Python processes to progress them in parallel without interference of the global interpreter lock. As part of this project, we introduce BatchPPO, an efficient implementation of the proximal policy optimization algorithm. By open sourcing TensorFlow Agents, we hope to provide a flexible starting point for future projects that accelerates future research in the field.
△ Less
Submitted 31 October, 2018; v1 submitted 8 September, 2017;
originally announced September 2017.
-
YouTube-BoundingBoxes: A Large High-Precision Human-Annotated Data Set for Object Detection in Video
Authors:
Esteban Real,
Jonathon Shlens,
Stefano Mazzocchi,
Xin Pan,
Vincent Vanhoucke
Abstract:
We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. T…
▽ More
We introduce a new large-scale data set of video URLs with densely-sampled object bounding box annotations called YouTube-BoundingBoxes (YT-BB). The data set consists of approximately 380,000 video segments about 19s long, automatically selected to feature objects in natural settings without editing or post-processing, with a recording quality often akin to that of a hand-held cell phone camera. The objects represent a subset of the MS COCO label set. All video segments were human-annotated with high-precision classification labels and bounding boxes at 1 frame per second. The use of a cascade of increasingly precise human annotations ensures a label accuracy above 95% for every class and tight bounding boxes. Finally, we train and evaluate well-known deep network architectures and report baseline figures for per-frame classification and localization to provide a point of comparison for future work. We also demonstrate how the temporal contiguity of video can potentially be used to improve such inferences. Please see the PDF file to find the URL to download the data. We hope the availability of such large curated corpus will spur new advances in video object detection and tracking.
△ Less
Submitted 24 March, 2017; v1 submitted 2 February, 2017;
originally announced February 2017.
-
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Authors:
Martín Abadi,
Ashish Agarwal,
Paul Barham,
Eugene Brevdo,
Zhifeng Chen,
Craig Citro,
Greg S. Corrado,
Andy Davis,
Jeffrey Dean,
Matthieu Devin,
Sanjay Ghemawat,
Ian Goodfellow,
Andrew Harp,
Geoffrey Irving,
Michael Isard,
Yangqing Jia,
Rafal Jozefowicz,
Lukasz Kaiser,
Manjunath Kudlur,
Josh Levenberg,
Dan Mane,
Rajat Monga,
Sherry Moore,
Derek Murray,
Chris Olah
, et al. (15 additional authors not shown)
Abstract:
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational de…
▽ More
TensorFlow is an interface for expressing machine learning algorithms, and an implementation for executing such algorithms. A computation expressed using TensorFlow can be executed with little or no change on a wide variety of heterogeneous systems, ranging from mobile devices such as phones and tablets up to large-scale distributed systems of hundreds of machines and thousands of computational devices such as GPU cards. The system is flexible and can be used to express a wide variety of algorithms, including training and inference algorithms for deep neural network models, and it has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields, including speech recognition, computer vision, robotics, information retrieval, natural language processing, geographic information extraction, and computational drug discovery. This paper describes the TensorFlow interface and an implementation of that interface that we have built at Google. The TensorFlow API and a reference implementation were released as an open-source package under the Apache 2.0 license in November, 2015 and are available at www.tensorflow.org.
△ Less
Submitted 16 March, 2016; v1 submitted 14 March, 2016;
originally announced March 2016.
-
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Authors:
Christian Szegedy,
Sergey Ioffe,
Vincent Vanhoucke,
Alex Alemi
Abstract:
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performanc…
▽ More
Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge
△ Less
Submitted 23 August, 2016; v1 submitted 23 February, 2016;
originally announced February 2016.
-
Rethinking the Inception Architecture for Computer Vision
Authors:
Christian Szegedy,
Vincent Vanhoucke,
Sergey Ioffe,
Jonathon Shlens,
Zbigniew Wojna
Abstract:
Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided…
▽ More
Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error on the validation set (3.6% error on the test set) and 17.3% top-1 error on the validation set.
△ Less
Submitted 11 December, 2015; v1 submitted 1 December, 2015;
originally announced December 2015.
-
Going Deeper with Convolutions
Authors:
Christian Szegedy,
Wei Liu,
Yangqing Jia,
Pierre Sermanet,
Scott Reed,
Dragomir Anguelov,
Dumitru Erhan,
Vincent Vanhoucke,
Andrew Rabinovich
Abstract:
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully c…
▽ More
We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
△ Less
Submitted 16 September, 2014;
originally announced September 2014.