-
Gemini Robotics: Bringing AI into the Physical World
Authors:
Gemini Robotics Team,
Saminda Abeyruwan,
Joshua Ainslie,
Jean-Baptiste Alayrac,
Montserrat Gonzalez Arenas,
Travis Armstrong,
Ashwin Balakrishna,
Robert Baruch,
Maria Bauza,
Michiel Blokzijl,
Steven Bohez,
Konstantinos Bousmalis,
Anthony Brohan,
Thomas Buschmann,
Arunkumar Byravan,
Serkan Cabi,
Ken Caluwaerts,
Federico Casarini,
Oscar Chang,
Jose Enrique Chen,
Xi Chen,
Hao-Tien Lewis Chiang,
Krzysztof Choromanski,
David D'Ambrosio,
Sudeep Dasari
, et al. (93 additional authors not shown)
Abstract:
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Lang…
▽ More
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities in digital domains, yet their translation to physical agents such as robots remains a significant challenge. This report introduces a new family of AI models purposefully designed for robotics and built upon the foundation of Gemini 2.0. We present Gemini Robotics, an advanced Vision-Language-Action (VLA) generalist model capable of directly controlling robots. Gemini Robotics executes smooth and reactive movements to tackle a wide range of complex manipulation tasks while also being robust to variations in object types and positions, handling unseen environments as well as following diverse, open vocabulary instructions. We show that with additional fine-tuning, Gemini Robotics can be specialized to new capabilities including solving long-horizon, highly dexterous tasks, learning new short-horizon tasks from as few as 100 demonstrations and adapting to completely novel robot embodiments. This is made possible because Gemini Robotics builds on top of the Gemini Robotics-ER model, the second model we introduce in this work. Gemini Robotics-ER (Embodied Reasoning) extends Gemini's multimodal reasoning capabilities into the physical world, with enhanced spatial and temporal understanding. This enables capabilities relevant to robotics including object detection, pointing, trajectory and grasp prediction, as well as multi-view correspondence and 3D bounding box predictions. We show how this novel combination can support a variety of robotics applications. We also discuss and address important safety considerations related to this new class of robotics foundation models. The Gemini Robotics family marks a substantial step towards developing general-purpose robots that realizes AI's potential in the physical world.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
A Taxonomy for Evaluating Generalist Robot Policies
Authors:
Jensen Gao,
Suneel Belkhale,
Sudeep Dasari,
Ashwin Balakrishna,
Dhruv Shah,
Dorsa Sadigh
Abstract:
Machine learning for robotics promises to unlock generalization to novel tasks and environments. Guided by this promise, many recent works have focused on scaling up robot data collection and developing larger, more expressive policies to achieve this. But how do we measure progress towards this goal of policy generalization in practice? Evaluating and quantifying generalization is the Wild West o…
▽ More
Machine learning for robotics promises to unlock generalization to novel tasks and environments. Guided by this promise, many recent works have focused on scaling up robot data collection and developing larger, more expressive policies to achieve this. But how do we measure progress towards this goal of policy generalization in practice? Evaluating and quantifying generalization is the Wild West of modern robotics, with each work proposing and measuring different types of generalization in their own, often difficult to reproduce, settings. In this work, our goal is (1) to outline the forms of generalization we believe are important in robot manipulation in a comprehensive and fine-grained manner, and (2) to provide reproducible guidelines for measuring these notions of generalization. We first propose STAR-Gen, a taxonomy of generalization for robot manipulation structured around visual, semantic, and behavioral generalization. We discuss how our taxonomy encompasses most prior notions of generalization in robotics. Next, we instantiate STAR-Gen with a concrete real-world benchmark based on the widely-used Bridge V2 dataset. We evaluate a variety of state-of-the-art models on this benchmark to demonstrate the utility of our taxonomy in practice. Our taxonomy of generalization can yield many interesting insights into existing models: for example, we observe that current vision-language-action models struggle with various types of semantic generalization, despite the promise of pre-training on internet-scale language datasets. We believe STAR-Gen and our guidelines can improve the dissemination and evaluation of progress towards generalization in robotics, which we hope will guide model design and future data collection efforts. We provide videos and demos at our website stargen-taxonomy.github.io.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
Enhancing SQL Injection Detection and Prevention Using Generative Models
Authors:
Naga Sai Dasari,
Atta Badii,
Armin Moin,
Ahmed Ashlam
Abstract:
SQL Injection (SQLi) continues to pose a significant threat to the security of web applications, enabling attackers to manipulate databases and access sensitive information without authorisation. Although advancements have been made in detection techniques, traditional signature-based methods still struggle to identify sophisticated SQL injection attacks that evade predefined patterns. As SQLi att…
▽ More
SQL Injection (SQLi) continues to pose a significant threat to the security of web applications, enabling attackers to manipulate databases and access sensitive information without authorisation. Although advancements have been made in detection techniques, traditional signature-based methods still struggle to identify sophisticated SQL injection attacks that evade predefined patterns. As SQLi attacks evolve, the need for more adaptive detection systems becomes crucial. This paper introduces an innovative approach that leverages generative models to enhance SQLi detection and prevention mechanisms. By incorporating Variational Autoencoders (VAE), Conditional Wasserstein GAN with Gradient Penalty (CWGAN-GP), and U-Net, synthetic SQL queries were generated to augment training datasets for machine learning models. The proposed method demonstrated improved accuracy in SQLi detection systems by reducing both false positives and false negatives. Extensive empirical testing further illustrated the ability of the system to adapt to evolving SQLi attack patterns, resulting in enhanced precision and robustness.
△ Less
Submitted 7 February, 2025;
originally announced February 2025.
-
The Ingredients for Robotic Diffusion Transformers
Authors:
Sudeep Dasari,
Oier Mees,
Sebastian Zhao,
Mohan Kumar Srirama,
Sergey Levine
Abstract:
In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design c…
▽ More
In recent years roboticists have achieved remarkable progress in solving increasingly general tasks on dexterous robotic hardware by leveraging high capacity Transformer network architectures and generative diffusion models. Unfortunately, combining these two orthogonal improvements has proven surprisingly difficult, since there is no clear and well-understood process for making important design choices. In this paper, we identify, study and improve key architectural design decisions for high-capacity diffusion transformer policies. The resulting models can efficiently solve diverse tasks on multiple robot embodiments, without the excruciating pain of per-setup hyper-parameter tuning. By combining the results of our investigation with our improved model components, we are able to present a novel architecture, named \method, that significantly outperforms the state of the art in solving long-horizon ($1500+$ time-steps) dexterous tasks on a bi-manual ALOHA robot. In addition, we find that our policies show improved scaling performance when trained on 10 hours of highly multi-modal, language annotated ALOHA demonstration data. We hope this work will open the door for future robot learning techniques that leverage the efficiency of generative diffusion modeling with the scalability of large scale transformer architectures. Code, robot dataset, and videos are available at: https://dit-policy.github.io
△ Less
Submitted 13 October, 2024;
originally announced October 2024.
-
Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation
Authors:
Ria Doshi,
Homer Walke,
Oier Mees,
Sudeep Dasari,
Sergey Levine
Abstract:
Modern machine learning systems rely on large datasets to attain broad generalization, and this often poses a challenge in robot learning, where each robotic platform and task might have only a small dataset. By training a single policy across many different kinds of robots, a robot learning method can leverage much broader and more diverse datasets, which in turn can lead to better generalization…
▽ More
Modern machine learning systems rely on large datasets to attain broad generalization, and this often poses a challenge in robot learning, where each robotic platform and task might have only a small dataset. By training a single policy across many different kinds of robots, a robot learning method can leverage much broader and more diverse datasets, which in turn can lead to better generalization and robustness. However, training a single policy on multi-robot data is challenging because robots can have widely varying sensors, actuators, and control frequencies. We propose CrossFormer, a scalable and flexible transformer-based policy that can consume data from any embodiment. We train CrossFormer on the largest and most diverse dataset to date, 900K trajectories across 20 different robot embodiments. We demonstrate that the same network weights can control vastly different robots, including single and dual arm manipulation systems, wheeled robots, quadcopters, and quadrupeds. Unlike prior work, our model does not require manual alignment of the observation or action spaces. Extensive experiments in the real world show that our method matches the performance of specialist policies tailored for each embodiment, while also significantly outperforming the prior state of the art in cross-embodiment learning.
△ Less
Submitted 21 August, 2024;
originally announced August 2024.
-
HRP: Human Affordances for Robotic Pre-Training
Authors:
Mohan Kumar Srirama,
Sudeep Dasari,
Shikhar Bahl,
Abhinav Gupta
Abstract:
In order to *generalize* to various tasks in the wild, robotic agents will need a suitable representation (i.e., vision network) that enables the robot to predict optimal actions given high dimensional vision inputs. However, learning such a representation requires an extreme amount of diverse training data, which is prohibitively expensive to collect on a real robot. How can we overcome this prob…
▽ More
In order to *generalize* to various tasks in the wild, robotic agents will need a suitable representation (i.e., vision network) that enables the robot to predict optimal actions given high dimensional vision inputs. However, learning such a representation requires an extreme amount of diverse training data, which is prohibitively expensive to collect on a real robot. How can we overcome this problem? Instead of collecting more robot data, this paper proposes using internet-scale, human videos to extract "affordances," both at the environment and agent level, and distill them into a pre-trained representation. We present a simple framework for pre-training representations on hand, object, and contact "affordance labels" that highlight relevant objects in images and how to interact with them. These affordances are automatically extracted from human video data (with the help of off-the-shelf computer vision modules) and used to fine-tune existing representations. Our approach can efficiently fine-tune *any* existing representation, and results in models with stronger downstream robotic performance across the board. We experimentally demonstrate (using 3000+ robot trials) that this affordance pre-training scheme boosts performance by a minimum of 15% on 5 real-world tasks, which consider three diverse robot morphologies (including a dexterous hand). Unlike prior works in the space, these representations improve performance across 3 different camera views. Quantitatively, we find that our approach leads to higher levels of generalization in out-of-distribution settings. For code, weights, and data check: https://hrp-robot.github.io
△ Less
Submitted 26 July, 2024;
originally announced July 2024.
-
Octo: An Open-Source Generalist Robot Policy
Authors:
Octo Model Team,
Dibya Ghosh,
Homer Walke,
Karl Pertsch,
Kevin Black,
Oier Mees,
Sudeep Dasari,
Joey Hejna,
Tobias Kreiman,
Charles Xu,
Jianlan Luo,
You Liang Tan,
Lawrence Yunliang Chen,
Pannag Sanketi,
Quan Vuong,
Ted Xiao,
Dorsa Sadigh,
Chelsea Finn,
Sergey Levine
Abstract:
Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sen…
▽ More
Large policies pretrained on diverse robot datasets have the potential to transform robotic learning: instead of training new policies from scratch, such generalist robot policies may be finetuned with only a little in-domain data, yet generalize broadly. However, to be widely applicable across a range of robotic learning scenarios, environments, and tasks, such policies need to handle diverse sensors and action spaces, accommodate a variety of commonly used robotic platforms, and finetune readily and efficiently to new domains. In this work, we aim to lay the groundwork for developing open-source, widely applicable, generalist policies for robotic manipulation. As a first step, we introduce Octo, a large transformer-based policy trained on 800k trajectories from the Open X-Embodiment dataset, the largest robot manipulation dataset to date. It can be instructed via language commands or goal images and can be effectively finetuned to robot setups with new sensory inputs and action spaces within a few hours on standard consumer GPUs. In experiments across 9 robotic platforms, we demonstrate that Octo serves as a versatile policy initialization that can be effectively finetuned to new observation and action spaces. We also perform detailed ablations of design decisions for the Octo model, from architecture to training data, to guide future research on building generalist robot models.
△ Less
Submitted 26 May, 2024; v1 submitted 20 May, 2024;
originally announced May 2024.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Authors:
Alexander Khazatsky,
Karl Pertsch,
Suraj Nair,
Ashwin Balakrishna,
Sudeep Dasari,
Siddharth Karamcheti,
Soroush Nasiriany,
Mohan Kumar Srirama,
Lawrence Yunliang Chen,
Kirsty Ellis,
Peter David Fagan,
Joey Hejna,
Masha Itkina,
Marion Lepert,
Yecheng Jason Ma,
Patrick Tree Miller,
Jimmy Wu,
Suneel Belkhale,
Shivin Dass,
Huy Ha,
Arhan Jain,
Abraham Lee,
Youngwoon Lee,
Marius Memmel,
Sungjae Park
, et al. (76 additional authors not shown)
Abstract:
The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a resu…
▽ More
The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.
△ Less
Submitted 22 April, 2025; v1 submitted 19 March, 2024;
originally announced March 2024.
-
An Unbiased Look at Datasets for Visuo-Motor Pre-Training
Authors:
Sudeep Dasari,
Mohan Kumar Srirama,
Unnat Jain,
Abhinav Gupta
Abstract:
Visual representation learning hold great promise for robotics, but is severely hampered by the scarcity and homogeneity of robotics datasets. Recent works address this problem by pre-training visual representations on large-scale but out-of-domain data (e.g., videos of egocentric interactions) and then transferring them to target robotics tasks. While the field is heavily focused on developing be…
▽ More
Visual representation learning hold great promise for robotics, but is severely hampered by the scarcity and homogeneity of robotics datasets. Recent works address this problem by pre-training visual representations on large-scale but out-of-domain data (e.g., videos of egocentric interactions) and then transferring them to target robotics tasks. While the field is heavily focused on developing better pre-training algorithms, we find that dataset choice is just as important to this paradigm's success. After all, the representation can only learn the structures or priors present in the pre-training dataset. To this end, we flip the focus on algorithms, and instead conduct a dataset centric analysis of robotic pre-training. Our findings call into question some common wisdom in the field. We observe that traditional vision datasets (like ImageNet, Kinetics and 100 Days of Hands) are surprisingly competitive options for visuo-motor representation learning, and that the pre-training dataset's image distribution matters more than its size. Finally, we show that common simulation benchmarks are not a reliable proxy for real world performance and that simple regularization strategies can dramatically improve real world policy learning. https://data4robotics.github.io
△ Less
Submitted 13 October, 2023;
originally announced October 2023.
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Authors:
Open X-Embodiment Collaboration,
Abby O'Neill,
Abdul Rehman,
Abhinav Gupta,
Abhiram Maddukuri,
Abhishek Gupta,
Abhishek Padalkar,
Abraham Lee,
Acorn Pooley,
Agrim Gupta,
Ajay Mandlekar,
Ajinkya Jain,
Albert Tung,
Alex Bewley,
Alex Herzog,
Alex Irpan,
Alexander Khazatsky,
Anant Rai,
Anchit Gupta,
Andrew Wang,
Andrey Kolobov,
Anikait Singh,
Animesh Garg,
Aniruddha Kembhavi,
Annie Xie
, et al. (269 additional authors not shown)
Abstract:
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method…
▽ More
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.
△ Less
Submitted 14 May, 2025; v1 submitted 13 October, 2023;
originally announced October 2023.
-
MyoDex: A Generalizable Prior for Dexterous Manipulation
Authors:
Vittorio Caggiano,
Sudeep Dasari,
Vikash Kumar
Abstract:
Human dexterity is a hallmark of motor control. Our hands can rapidly synthesize new behaviors despite the complexity (multi-articular and multi-joints, with 23 joints controlled by more than 40 muscles) of musculoskeletal sensory-motor circuits. In this work, we take inspiration from how human dexterity builds on a diversity of prior experiences, instead of being acquired through a single task. M…
▽ More
Human dexterity is a hallmark of motor control. Our hands can rapidly synthesize new behaviors despite the complexity (multi-articular and multi-joints, with 23 joints controlled by more than 40 muscles) of musculoskeletal sensory-motor circuits. In this work, we take inspiration from how human dexterity builds on a diversity of prior experiences, instead of being acquired through a single task. Motivated by this observation, we set out to develop agents that can build upon their previous experience to quickly acquire new (previously unattainable) behaviors. Specifically, our approach leverages multi-task learning to implicitly capture task-agnostic behavioral priors (MyoDex) for human-like dexterity, using a physiologically realistic human hand model - MyoHand. We demonstrate MyoDex's effectiveness in few-shot generalization as well as positive transfer to a large repertoire of unseen dexterous manipulation tasks. Agents leveraging MyoDex can solve approximately 3x more tasks, and 4x faster in comparison to a distillation baseline. While prior work has synthesized single musculoskeletal control behaviors, MyoDex is the first generalizable manipulation prior that catalyzes the learning of dexterous physiological control across a large variety of contact-rich behaviors. We also demonstrate the effectiveness of our paradigms beyond musculoskeletal control towards the acquisition of dexterity in 24 DoF Adroit Hand. Website: https://sites.google.com/view/myodex
△ Less
Submitted 6 September, 2023;
originally announced September 2023.
-
Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations
Authors:
Jianren Wang,
Sudeep Dasari,
Mohan Kumar Srirama,
Shubham Tulsiani,
Abhinav Gupta
Abstract:
The field of visual representation learning has seen explosive growth in the past years, but its benefits in robotics have been surprisingly limited so far. Prior work uses generic visual representations as a basis to learn (task-specific) robot action policies (e.g., via behavior cloning). While the visual representations do accelerate learning, they are primarily used to encode visual observatio…
▽ More
The field of visual representation learning has seen explosive growth in the past years, but its benefits in robotics have been surprisingly limited so far. Prior work uses generic visual representations as a basis to learn (task-specific) robot action policies (e.g., via behavior cloning). While the visual representations do accelerate learning, they are primarily used to encode visual observations. Thus, action information has to be derived purely from robot data, which is expensive to collect! In this work, we present a scalable alternative where the visual representations can help directly infer robot actions. We observe that vision encoders express relationships between image observations as distances (e.g., via embedding dot product) that could be used to efficiently plan robot behavior. We operationalize this insight and develop a simple algorithm for acquiring a distance function and dynamics predictor, by fine-tuning a pre-trained representation on human collected video sequences. The final method is able to substantially outperform traditional robot learning baselines (e.g., 70% success v.s. 50% for behavior cloning on pick-place) on a suite of diverse real-world manipulation tasks. It can also generalize to novel objects, without using any robot demonstrations during train time. For visualizations of the learned policies please check: https://agi-labs.github.io/manipulate-by-seeing/.
△ Less
Submitted 15 August, 2023; v1 submitted 14 March, 2023;
originally announced March 2023.
-
Learning Dexterous Manipulation from Exemplar Object Trajectories and Pre-Grasps
Authors:
Sudeep Dasari,
Abhinav Gupta,
Vikash Kumar
Abstract:
Learning diverse dexterous manipulation behaviors with assorted objects remains an open grand challenge. While policy learning methods offer a powerful avenue to attack this problem, they require extensive per-task engineering and algorithmic tuning. This paper seeks to escape these constraints, by developing a Pre-Grasp informed Dexterous Manipulation (PGDM) framework that generates diverse dexte…
▽ More
Learning diverse dexterous manipulation behaviors with assorted objects remains an open grand challenge. While policy learning methods offer a powerful avenue to attack this problem, they require extensive per-task engineering and algorithmic tuning. This paper seeks to escape these constraints, by developing a Pre-Grasp informed Dexterous Manipulation (PGDM) framework that generates diverse dexterous manipulation behaviors, without any task-specific reasoning or hyper-parameter tuning. At the core of PGDM is a well known robotics construct, pre-grasps (i.e. the hand-pose preparing for object interaction). This simple primitive is enough to induce efficient exploration strategies for acquiring complex dexterous manipulation behaviors. To exhaustively verify these claims, we introduce TCDM, a benchmark of 50 diverse manipulation tasks defined over multiple objects and dexterous manipulators. Tasks for TCDM are defined automatically using exemplar object trajectories from various sources (animators, human behaviors, etc.), without any per-task engineering and/or supervision. Our experiments validate that PGDM's exploration strategy, induced by a surprisingly simple ingredient (single pre-grasp pose), matches the performance of prior methods, which require expensive per-task feature/reward engineering, expert supervision, and hyper-parameter tuning. For animated visualizations, trained policies, and project code, please refer to: https://pregrasps.github.io/
△ Less
Submitted 12 February, 2023; v1 submitted 22 September, 2022;
originally announced September 2022.
-
RB2: Robotic Manipulation Benchmarking with a Twist
Authors:
Sudeep Dasari,
Jianren Wang,
Joyce Hong,
Shikhar Bahl,
Yixin Lin,
Austin Wang,
Abitha Thankaraj,
Karanbir Chahal,
Berk Calli,
Saurabh Gupta,
David Held,
Lerrel Pinto,
Deepak Pathak,
Vikash Kumar,
Abhinav Gupta
Abstract:
Benchmarks offer a scientific way to compare algorithms using objective performance metrics. Good benchmarks have two features: (a) they should be widely useful for many research groups; (b) and they should produce reproducible findings. In robotic manipulation research, there is a trade-off between reproducibility and broad accessibility. If the benchmark is kept restrictive (fixed hardware, obje…
▽ More
Benchmarks offer a scientific way to compare algorithms using objective performance metrics. Good benchmarks have two features: (a) they should be widely useful for many research groups; (b) and they should produce reproducible findings. In robotic manipulation research, there is a trade-off between reproducibility and broad accessibility. If the benchmark is kept restrictive (fixed hardware, objects), the numbers are reproducible but the setup becomes less general. On the other hand, a benchmark could be a loose set of protocols (e.g. object sets) but the underlying variation in setups make the results non-reproducible. In this paper, we re-imagine benchmarking for robotic manipulation as state-of-the-art algorithmic implementations, alongside the usual set of tasks and experimental protocols. The added baseline implementations will provide a way to easily recreate SOTA numbers in a new local robotic setup, thus providing credible relative rankings between existing approaches and new work. However, these local rankings could vary between different setups. To resolve this issue, we build a mechanism for pooling experimental data between labs, and thus we establish a single global ranking for existing (and proposed) SOTA algorithms. Our benchmark, called Ranking-Based Robotics Benchmark (RB2), is evaluated on tasks that are inspired from clinically validated Southampton Hand Assessment Procedures. Our benchmark was run across two different labs and reveals several surprising findings. For example, extremely simple baselines like open-loop behavior cloning, outperform more complicated models (e.g. closed loop, RNN, Offline-RL, etc.) that are preferred by the field. We hope our fellow researchers will use RB2 to improve their research's quality and rigor.
△ Less
Submitted 30 October, 2022; v1 submitted 15 March, 2022;
originally announced March 2022.
-
Unified Citizen Identity System Using Blockchain
Authors:
Sri Sai Abhishake Gopal Dasari
Abstract:
The citizenship identities of a nation's occupants enable the state to identify and authenticate them unquestionably. These documents help individuals in recognizing themselves and to profit from the rights and advantages given to them by the legislature or the constitution of the land. There are problems in the traditional way of issuance f these identities and many hurdles that impede people fro…
▽ More
The citizenship identities of a nation's occupants enable the state to identify and authenticate them unquestionably. These documents help individuals in recognizing themselves and to profit from the rights and advantages given to them by the legislature or the constitution of the land. There are problems in the traditional way of issuance f these identities and many hurdles that impede people from getting their benefits or exercising their rights. These paper-based identities can be forged easily and are hard to authenticate at various civil end points. There are reports of identity thefts. In this paper, we are discussing how Blockchain can be employed to overcome these problems and makes these identities confidential, immutable, and secured. Blockchain technology can help the governing bodies in maintaining and verifying these identities in a quick manner with less chance for human errors, meaning more reach for government plans and aid
△ Less
Submitted 17 January, 2021;
originally announced January 2021.
-
Model-Based Visual Planning with Self-Supervised Functional Distances
Authors:
Stephen Tian,
Suraj Nair,
Frederik Ebert,
Sudeep Dasari,
Benjamin Eysenbach,
Chelsea Finn,
Sergey Levine
Abstract:
A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when hand-engineered reward functions are not available. Learned dynamics models are a promising approach for learning about the e…
▽ More
A generalist robot must be able to complete a variety of tasks in its environment. One appealing way to specify each task is in terms of a goal observation. However, learning goal-reaching policies with reinforcement learning remains a challenging problem, particularly when hand-engineered reward functions are not available. Learned dynamics models are a promising approach for learning about the environment without rewards or task-directed data, but planning to reach goals with such a model requires a notion of functional similarity between observations and goal states. We present a self-supervised method for model-based visual goal reaching, which uses both a visual dynamics model as well as a dynamical distance function learned using model-free reinforcement learning. Our approach learns entirely using offline, unlabeled data, making it practical to scale to large and diverse datasets. In our experiments, we find that our method can successfully learn models that perform a variety of tasks at test-time, moving objects amid distractors with a simulated robotic arm and even learning to open and close a drawer using a real-world robot. In comparisons, we find that this approach substantially outperforms both model-free and model-based prior methods. Videos and visualizations are available here: http://sites.google.com/berkeley.edu/mbold.
△ Less
Submitted 30 December, 2020;
originally announced December 2020.
-
Transformers for One-Shot Visual Imitation
Authors:
Sudeep Dasari,
Abhinav Gupta
Abstract:
Humans are able to seamlessly visually imitate others, by inferring their intentions and using past experience to achieve the same end goal. In other words, we can parse complex semantic knowledge from raw video and efficiently translate that into concrete motor control. Is it possible to give a robot this same capability? Prior research in robot imitation learning has created agents which can acq…
▽ More
Humans are able to seamlessly visually imitate others, by inferring their intentions and using past experience to achieve the same end goal. In other words, we can parse complex semantic knowledge from raw video and efficiently translate that into concrete motor control. Is it possible to give a robot this same capability? Prior research in robot imitation learning has created agents which can acquire diverse skills from expert human operators. However, expanding these techniques to work with a single positive example during test time is still an open challenge. Apart from control, the difficulty stems from mismatches between the demonstrator and robot domains. For example, objects may be placed in different locations (e.g. kitchen layouts are different in every house). Additionally, the demonstration may come from an agent with different morphology and physical appearance (e.g. human), so one-to-one action correspondences are not available. This paper investigates techniques which allow robots to partially bridge these domain gaps, using their past experience. A neural network is trained to mimic ground truth robot actions given context video from another agent, and must generalize to unseen task instances when prompted with new videos during test time. We hypothesize that our policy representations must be both context driven and dynamics aware in order to perform these tasks. These assumptions are baked into the neural network using the Transformers attention mechanism and a self-supervised inverse dynamics loss. Finally, we experimentally determine that our method accomplishes a $\sim 2$x improvement in terms of task success rate over prior baselines in a suite of one-shot manipulation tasks.
△ Less
Submitted 11 November, 2020;
originally announced November 2020.
-
RoboNet: Large-Scale Multi-Robot Learning
Authors:
Sudeep Dasari,
Frederik Ebert,
Stephen Tian,
Suraj Nair,
Bernadette Bucher,
Karl Schmeckpeper,
Siddharth Singh,
Sergey Levine,
Chelsea Finn
Abstract:
Robot learning has emerged as a promising tool for taming the complexity and diversity of the real world. Methods based on high-capacity models, such as deep networks, hold the promise of providing effective generalization to a wide range of open-world environments. However, these same methods typically require large amounts of diverse training data to generalize effectively. In contrast, most rob…
▽ More
Robot learning has emerged as a promising tool for taming the complexity and diversity of the real world. Methods based on high-capacity models, such as deep networks, hold the promise of providing effective generalization to a wide range of open-world environments. However, these same methods typically require large amounts of diverse training data to generalize effectively. In contrast, most robotic learning experiments are small-scale, single-domain, and single-robot. This leads to a frequent tension in robotic learning: how can we learn generalizable robotic controllers without having to collect impractically large amounts of data for each separate experiment? In this paper, we propose RoboNet, an open database for sharing robotic experience, which provides an initial pool of 15 million video frames, from 7 different robot platforms, and study how it can be used to learn generalizable models for vision-based robotic manipulation. We combine the dataset with two different learning algorithms: visual foresight, which uses forward video prediction models, and supervised inverse models. Our experiments test the learned algorithms' ability to work across new objects, new tasks, new scenes, new camera viewpoints, new grippers, or even entirely new robots. In our final experiment, we find that by pre-training on RoboNet and fine-tuning on data from a held-out Franka or Kuka robot, we can exceed the performance of a robot-specific training approach that uses 4x-20x more data. For videos and data, see the project webpage: https://www.robonet.wiki/
△ Less
Submitted 2 January, 2020; v1 submitted 24 October, 2019;
originally announced October 2019.
-
Finding New Diagnostic Information for Detecting Glaucoma using Neural Networks
Authors:
Erfan Noury,
Suria S. Mannil,
Robert T. Chang,
An Ran Ran,
Carol Y. Cheung,
Suman S. Thapa,
Harsha L. Rao,
Srilakshmi Dasari,
Mohammed Riyazuddin,
Dolly Chang,
Sriharsha Nagaraj,
Clement C. Tham,
Reza Zadeh
Abstract:
We describe a new approach to automated Glaucoma detection in 3D Spectral Domain Optical Coherence Tomography (OCT) optic nerve scans. First, we gathered a unique and diverse multi-ethnic dataset of OCT scans consisting of glaucoma and non-glaucomatous cases obtained from four tertiary care eye hospitals located in four different countries. Using this longitudinal data, we achieved state-of-the-ar…
▽ More
We describe a new approach to automated Glaucoma detection in 3D Spectral Domain Optical Coherence Tomography (OCT) optic nerve scans. First, we gathered a unique and diverse multi-ethnic dataset of OCT scans consisting of glaucoma and non-glaucomatous cases obtained from four tertiary care eye hospitals located in four different countries. Using this longitudinal data, we achieved state-of-the-art results for automatically detecting Glaucoma from a single raw OCT using a 3D Deep Learning system. These results are close to human doctors in a variety of settings across heterogeneous datasets and scanning environments. To verify correctness and interpretability of the automated categorization, we used saliency maps to find areas of focus for the model. Matching human doctor behavior, the model predictions indeed correlated with the conventional diagnostic parameters in the OCT printouts, such as the retinal nerve fiber layer. We further used our model to find new areas in the 3D data that are presently not being identified as a diagnostic parameter to detect glaucoma by human doctors. Namely, we found that the Lamina Cribrosa (LC) region can be a valuable source of helpful diagnostic information previously unavailable to doctors during routine clinical care because it lacks a quantitative printout. Our model provides such volumetric quantification of this region. We found that even when a majority of the RNFL is removed, the LC region can distinguish glaucoma. This is clinically relevant in high myopes, when the RNFL is already reduced, and thus the LC region may help differentiate glaucoma in this confounding situation. We further generalize this approach to create a new algorithm called DiagFind that provides a recipe for finding new diagnostic information in medical imagery that may have been previously unusable by doctors.
△ Less
Submitted 2 September, 2020; v1 submitted 14 October, 2019;
originally announced October 2019.
-
Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control
Authors:
Frederik Ebert,
Chelsea Finn,
Sudeep Dasari,
Annie Xie,
Alex Lee,
Sergey Levine
Abstract:
Deep reinforcement learning (RL) algorithms can learn complex robotic skills from raw sensory inputs, but have yet to achieve the kind of broad generalization and applicability demonstrated by deep learning methods in supervised domains. We present a deep RL method that is practical for real-world robotics tasks, such as robotic manipulation, and generalizes effectively to never-before-seen tasks…
▽ More
Deep reinforcement learning (RL) algorithms can learn complex robotic skills from raw sensory inputs, but have yet to achieve the kind of broad generalization and applicability demonstrated by deep learning methods in supervised domains. We present a deep RL method that is practical for real-world robotics tasks, such as robotic manipulation, and generalizes effectively to never-before-seen tasks and objects. In these settings, ground truth reward signals are typically unavailable, and we therefore propose a self-supervised model-based approach, where a predictive model learns to directly predict the future from raw sensory readings, such as camera images. At test time, we explore three distinct goal specification methods: designated pixels, where a user specifies desired object manipulation tasks by selecting particular pixels in an image and corresponding goal positions, goal images, where the desired goal state is specified with an image, and image classifiers, which define spaces of goal states. Our deep predictive models are trained using data collected autonomously and continuously by a robot interacting with hundreds of objects, without human supervision. We demonstrate that visual MPC can generalize to never-before-seen objects---both rigid and deformable---and solve a range of user-defined object manipulation tasks using the same model.
△ Less
Submitted 3 December, 2018;
originally announced December 2018.
-
Robustness via Retrying: Closed-Loop Robotic Manipulation with Self-Supervised Learning
Authors:
Frederik Ebert,
Sudeep Dasari,
Alex X. Lee,
Sergey Levine,
Chelsea Finn
Abstract:
Prediction is an appealing objective for self-supervised learning of behavioral skills, particularly for autonomous robots. However, effectively utilizing predictive models for control, especially with raw image inputs, poses a number of major challenges. How should the predictions be used? What happens when they are inaccurate? In this paper, we tackle these questions by proposing a method for le…
▽ More
Prediction is an appealing objective for self-supervised learning of behavioral skills, particularly for autonomous robots. However, effectively utilizing predictive models for control, especially with raw image inputs, poses a number of major challenges. How should the predictions be used? What happens when they are inaccurate? In this paper, we tackle these questions by proposing a method for learning robotic skills from raw image observations, using only autonomously collected experience. We show that even an imperfect model can complete complex tasks if it can continuously retry, but this requires the model to not lose track of the objective (e.g., the object of interest). To enable a robot to continuously retry a task, we devise a self-supervised algorithm for learning image registration, which can keep track of objects of interest for the duration of the trial. We demonstrate that this idea can be combined with a video-prediction based controller to enable complex behaviors to be learned from scratch using only raw visual inputs, including grasping, repositioning objects, and non-prehensile manipulation. Our real-world experiments demonstrate that a model trained with 160 robot hours of autonomously collected, unlabeled data is able to successfully perform complex manipulation tasks with a wide range of objects not seen during training.
△ Less
Submitted 6 October, 2018;
originally announced October 2018.
-
One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning
Authors:
Tianhe Yu,
Chelsea Finn,
Annie Xie,
Sudeep Dasari,
Tianhao Zhang,
Pieter Abbeel,
Sergey Levine
Abstract:
Humans and animals are capable of learning a new behavior by observing others perform the skill just once. We consider the problem of allowing a robot to do the same -- learning from a raw video pixels of a human, even when there is substantial domain shift in the perspective, environment, and embodiment between the robot and the observed human. Prior approaches to this problem have hand-specified…
▽ More
Humans and animals are capable of learning a new behavior by observing others perform the skill just once. We consider the problem of allowing a robot to do the same -- learning from a raw video pixels of a human, even when there is substantial domain shift in the perspective, environment, and embodiment between the robot and the observed human. Prior approaches to this problem have hand-specified how human and robot actions correspond and often relied on explicit human pose detection systems. In this work, we present an approach for one-shot learning from a video of a human by using human and robot demonstration data from a variety of previous tasks to build up prior knowledge through meta-learning. Then, combining this prior knowledge and only a single video demonstration from a human, the robot can perform the task that the human demonstrated. We show experiments on both a PR2 arm and a Sawyer arm, demonstrating that after meta-learning, the robot can learn to place, push, and pick-and-place new objects using just one video of a human performing the manipulation.
△ Less
Submitted 5 February, 2018;
originally announced February 2018.