-
A Careful Examination of Large Behavior Models for Multitask Dexterous Manipulation
Authors:
TRI LBM Team,
Jose Barreiros,
Andrew Beaulieu,
Aditya Bhat,
Rick Cory,
Eric Cousineau,
Hongkai Dai,
Ching-Hsin Fang,
Kunimatsu Hashimoto,
Muhammad Zubair Irshad,
Masha Itkina,
Naveen Kuppuswamy,
Kuan-Hui Lee,
Katherine Liu,
Dale McConachie,
Ian McMahon,
Haruki Nishimura,
Calder Phillips-Grafflin,
Charles Richter,
Paarth Shah,
Krishnan Srinivasan,
Blake Wulfe,
Chen Xu,
Mengchao Zhang,
Alex Alspach
, et al. (57 additional authors not shown)
Abstract:
Robot manipulation has seen tremendous progress in recent years, with imitation learning policies enabling successful performance of dexterous and hard-to-model tasks. Concurrently, scaling data and model size has led to the development of capable language and vision foundation models, motivating large-scale efforts to create general-purpose robot foundation models. While these models have garnere…
▽ More
Robot manipulation has seen tremendous progress in recent years, with imitation learning policies enabling successful performance of dexterous and hard-to-model tasks. Concurrently, scaling data and model size has led to the development of capable language and vision foundation models, motivating large-scale efforts to create general-purpose robot foundation models. While these models have garnered significant enthusiasm and investment, meaningful evaluation of real-world performance remains a challenge, limiting both the pace of development and inhibiting a nuanced understanding of current capabilities. In this paper, we rigorously evaluate multitask robot manipulation policies, referred to as Large Behavior Models (LBMs), by extending the Diffusion Policy paradigm across a corpus of simulated and real-world robot data. We propose and validate an evaluation pipeline to rigorously analyze the capabilities of these models with statistical confidence. We compare against single-task baselines through blind, randomized trials in a controlled setting, using both simulation and real-world experiments. We find that multi-task pretraining makes the policies more successful and robust, and enables teaching complex new tasks more quickly, using a fraction of the data when compared to single-task baselines. Moreover, performance predictably increases as pretraining scale and diversity grows. Project page: https://toyotaresearchinstitute.github.io/lbm1/
△ Less
Submitted 7 July, 2025;
originally announced July 2025.
-
CUPID: Curating Data your Robot Loves with Influence Functions
Authors:
Christopher Agia,
Rohan Sinha,
Jingyun Yang,
Rika Antonova,
Marco Pavone,
Haruki Nishimura,
Masha Itkina,
Jeannette Bohg
Abstract:
In robot imitation learning, policy performance is tightly coupled with the quality and composition of the demonstration data. Yet, developing a precise understanding of how individual demonstrations contribute to downstream outcomes - such as closed-loop task success or failure - remains a persistent challenge. We propose CUPID, a robot data curation method based on a novel influence function-the…
▽ More
In robot imitation learning, policy performance is tightly coupled with the quality and composition of the demonstration data. Yet, developing a precise understanding of how individual demonstrations contribute to downstream outcomes - such as closed-loop task success or failure - remains a persistent challenge. We propose CUPID, a robot data curation method based on a novel influence function-theoretic formulation for imitation learning policies. Given a set of evaluation rollouts, CUPID estimates the influence of each training demonstration on the policy's expected return. This enables ranking and selection of demonstrations according to their impact on the policy's closed-loop performance. We use CUPID to curate data by 1) filtering out training demonstrations that harm policy performance and 2) subselecting newly collected trajectories that will most improve the policy. Extensive simulated and hardware experiments show that our approach consistently identifies which data drives test-time performance. For example, training with less than 33% of curated data can yield state-of-the-art diffusion policies on the simulated RoboMimic benchmark, with similar gains observed in hardware. Furthermore, hardware experiments show that our method can identify robust strategies under distribution shift, isolate spurious correlations, and even enhance the post-training of generalist robot policies. Additional materials are made available at: https://cupid-curation.github.io.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
SAFE: Multitask Failure Detection for Vision-Language-Action Models
Authors:
Qiao Gu,
Yuanliang Ju,
Shengxiang Sun,
Igor Gilitschenski,
Haruki Nishimura,
Masha Itkina,
Florian Shkurti
Abstract:
While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existi…
▽ More
While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out-of-the-box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts, and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, $π_0$, and $π_0$-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results can be found at https://vla-safe.github.io/.
△ Less
Submitted 11 June, 2025;
originally announced June 2025.
-
STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation
Authors:
Hossein Goli,
Michael Gimelfarb,
Nathan Samuel de Lara,
Haruki Nishimura,
Masha Itkina,
Florian Shkurti
Abstract:
Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or…
▽ More
Off-policy evaluation (OPE) estimates the performance of a target policy using offline data collected from a behavior policy, and is crucial in domains such as robotics or healthcare where direct interaction with the environment is costly or unsafe. Existing OPE methods are ineffective for high-dimensional, long-horizon problems, due to exponential blow-ups in variance from importance weighting or compounding errors from learned dynamics models. To address these challenges, we propose STITCH-OPE, a model-based generative framework that leverages denoising diffusion for long-horizon OPE in high-dimensional state and action spaces. Starting with a diffusion model pre-trained on the behavior data, STITCH-OPE generates synthetic trajectories from the target policy by guiding the denoising process using the score function of the target policy. STITCH-OPE proposes two technical innovations that make it advantageous for OPE: (1) prevents over-regularization by subtracting the score of the behavior policy during guidance, and (2) generates long-horizon trajectories by stitching partial trajectories together end-to-end. We provide a theoretical guarantee that under mild assumptions, these modifications result in an exponential reduction in variance versus long-horizon trajectory diffusion. Experiments on the D4RL and OpenAI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Is Your Imitation Learning Policy Better than Mine? Policy Comparison with Near-Optimal Stopping
Authors:
David Snyder,
Asher James Hancock,
Apurva Badithela,
Emma Dixon,
Patrick Miller,
Rares Andrei Ambrus,
Anirudha Majumdar,
Masha Itkina,
Haruki Nishimura
Abstract:
Imitation learning has enabled robots to perform complex, long-horizon tasks in challenging dexterous manipulation settings. As new methods are developed, they must be rigorously evaluated and compared against corresponding baselines through repeated evaluation trials. However, policy comparison is fundamentally constrained by a small feasible sample size (e.g., 10 or 50) due to significant human…
▽ More
Imitation learning has enabled robots to perform complex, long-horizon tasks in challenging dexterous manipulation settings. As new methods are developed, they must be rigorously evaluated and compared against corresponding baselines through repeated evaluation trials. However, policy comparison is fundamentally constrained by a small feasible sample size (e.g., 10 or 50) due to significant human effort and limited inference throughput of policies. This paper proposes a novel statistical framework for rigorously comparing two policies in the small sample size regime. Prior work in statistical policy comparison relies on batch testing, which requires a fixed, pre-determined number of trials and lacks flexibility in adapting the sample size to the observed evaluation data. Furthermore, extending the test with additional trials risks inducing inadvertent p-hacking, undermining statistical assurances. In contrast, our proposed statistical test is sequential, allowing researchers to decide whether or not to run more trials based on intermediate results. This adaptively tailors the number of trials to the difficulty of the underlying comparison, saving significant time and effort without sacrificing probabilistic correctness. Extensive numerical simulation and real-world robot manipulation experiments show that our test achieves near-optimal stopping, letting researchers stop evaluation and make a decision in a near-minimal number of trials. Specifically, it reduces the number of evaluation trials by up to 32% as compared to state-of-the-art baselines, while preserving the probabilistic correctness and statistical power of the comparison. Moreover, our method is strongest in the most challenging comparison instances (requiring the most evaluation trials); in a multi-task comparison scenario, we save the evaluator more than 160 simulation rollouts.
△ Less
Submitted 6 June, 2025; v1 submitted 13 March, 2025;
originally announced March 2025.
-
Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies
Authors:
Chen Xu,
Tony Khuong Nguyen,
Emma Dixon,
Christopher Rodriguez,
Patrick Miller,
Robert Lee,
Paarth Shah,
Rares Ambrus,
Haruki Nishimura,
Masha Itkina
Abstract:
Recent years have witnessed impressive robotic manipulation systems driven by advances in imitation learning and generative modeling, such as diffusion- and flow-based approaches. As robot policy performance increases, so does the complexity and time horizon of achievable tasks, inducing unexpected and diverse failure modes that are difficult to predict a priori. To enable trustworthy policy deplo…
▽ More
Recent years have witnessed impressive robotic manipulation systems driven by advances in imitation learning and generative modeling, such as diffusion- and flow-based approaches. As robot policy performance increases, so does the complexity and time horizon of achievable tasks, inducing unexpected and diverse failure modes that are difficult to predict a priori. To enable trustworthy policy deployment in safety-critical human environments, reliable runtime failure detection becomes important during policy inference. However, most existing failure detection approaches rely on prior knowledge of failure modes and require failure data during training, which imposes a significant challenge in practicality and scalability. In response to these limitations, we present FAIL-Detect, a modular two-stage approach for failure detection in imitation learning-based robotic manipulation. To accurately identify failures from successful training data alone, we frame the problem as sequential out-of-distribution (OOD) detection. We first distill policy inputs and outputs into scalar signals that correlate with policy failures and capture epistemic uncertainty. FAIL-Detect then employs conformal prediction (CP) as a versatile framework for uncertainty quantification with statistical guarantees. Empirically, we thoroughly investigate both learned and post-hoc scalar signal candidates on diverse robotic manipulation tasks. Our experiments show learned signals to be mostly consistently effective, particularly when using our novel flow-based density estimator. Furthermore, our method detects failures more accurately and faster than state-of-the-art (SOTA) failure detection baselines. These results highlight the potential of FAIL-Detect to enhance the safety and reliability of imitation learning-based robotic systems as they progress toward real-world deployment.
△ Less
Submitted 20 June, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
GHIL-Glue: Hierarchical Control with Filtered Subgoal Images
Authors:
Kyle B. Hatch,
Ashwin Balakrishna,
Oier Mees,
Suraj Nair,
Seohong Park,
Blake Wulfe,
Masha Itkina,
Benjamin Eysenbach,
Sergey Levine,
Thomas Kollar,
Benjamin Burchfiel
Abstract:
Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models…
▽ More
Image and video generative models that are pre-trained on Internet-scale data can greatly increase the generalization capacity of robot learning systems. These models can function as high-level planners, generating intermediate subgoals for low-level goal-conditioned policies to reach. However, the performance of these systems can be greatly bottlenecked by the interface between generative models and low-level controllers. For example, generative models may predict photorealistic yet physically infeasible frames that confuse low-level policies. Low-level policies may also be sensitive to subtle visual artifacts in generated goal images. This paper addresses these two facets of generalization, providing an interface to effectively "glue together" language-conditioned image or video prediction models with low-level goal-conditioned policies. Our method, Generative Hierarchical Imitation Learning-Glue (GHIL-Glue), filters out subgoals that do not lead to task progress and improves the robustness of goal-conditioned policies to generated subgoals with harmful visual artifacts. We find in extensive experiments in both simulated and real environments that GHIL-Glue achieves a 25% improvement across several hierarchical models that leverage generative subgoals, achieving a new state-of-the-art on the CALVIN simulation benchmark for policies using observations from a single RGB camera. GHIL-Glue also outperforms other generalist robot policies across 3/4 language-conditioned manipulation tasks testing zero-shot generalization in physical experiments.
△ Less
Submitted 25 October, 2024;
originally announced October 2024.
-
Self-supervised Multi-future Occupancy Forecasting for Autonomous Driving
Authors:
Bernard Lange,
Masha Itkina,
Jiachen Li,
Mykel J. Kochenderfer
Abstract:
Environment prediction frameworks are critical for the safe navigation of autonomous vehicles (AVs) in dynamic settings. LiDAR-generated occupancy grid maps (L-OGMs) offer a robust bird's-eye view for the scene representation, enabling self-supervised joint scene predictions while exhibiting resilience to partial observability and perception detection failures. Prior approaches have focused on det…
▽ More
Environment prediction frameworks are critical for the safe navigation of autonomous vehicles (AVs) in dynamic settings. LiDAR-generated occupancy grid maps (L-OGMs) offer a robust bird's-eye view for the scene representation, enabling self-supervised joint scene predictions while exhibiting resilience to partial observability and perception detection failures. Prior approaches have focused on deterministic L-OGM prediction architectures within the grid cell space. While these methods have seen some success, they frequently produce unrealistic predictions and fail to capture the stochastic nature of the environment. Additionally, they do not effectively integrate additional sensor modalities present in AVs. Our proposed framework, Latent Occupancy Prediction (LOPR), performs stochastic L-OGM prediction in the latent space of a generative architecture and allows for conditioning on RGB cameras, maps, and planned trajectories. We decode predictions using either a single-step decoder, which provides high-quality predictions in real-time, or a diffusion-based batch decoder, which can further refine the decoded frames to address temporal consistency issues and reduce compression losses. Our experiments on the nuScenes and Waymo Open datasets show that all variants of our approach qualitatively and quantitatively outperform prior approaches.
△ Less
Submitted 22 May, 2025; v1 submitted 30 July, 2024;
originally announced July 2024.
-
How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation
Authors:
Joseph A. Vincent,
Haruki Nishimura,
Masha Itkina,
Paarth Shah,
Mac Schwager,
Thomas Kollar
Abstract:
With the rise of stochastic generative models in robot policy learning, end-to-end visuomotor policies are increasingly successful at solving complex tasks by learning from human demonstrations. Nevertheless, since real-world evaluation costs afford users only a small number of policy rollouts, it remains a challenge to accurately gauge the performance of such policies. This is exacerbated by dist…
▽ More
With the rise of stochastic generative models in robot policy learning, end-to-end visuomotor policies are increasingly successful at solving complex tasks by learning from human demonstrations. Nevertheless, since real-world evaluation costs afford users only a small number of policy rollouts, it remains a challenge to accurately gauge the performance of such policies. This is exacerbated by distribution shifts causing unpredictable changes in performance during deployment. To rigorously evaluate behavior cloning policies, we present a framework that provides a tight lower-bound on robot performance in an arbitrary environment, using a minimal number of experimental policy rollouts. Notably, by applying the standard stochastic ordering to robot performance distributions, we provide a worst-case bound on the entire distribution of performance (via bounds on the cumulative distribution function) for a given task. We build upon established statistical results to ensure that the bounds hold with a user-specified confidence level and tightness, and are constructed from as few policy rollouts as possible. In experiments we evaluate policies for visuomotor manipulation in both simulation and hardware. Specifically, we (i) empirically validate the guarantees of the bounds in simulated manipulation settings, (ii) find the degree to which a learned policy deployed on hardware generalizes to new real-world environments, and (iii) rigorously compare two policies tested in out-of-distribution settings. Our experimental data, code, and implementation of confidence bounds are open-source.
△ Less
Submitted 18 July, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
Explore until Confident: Efficient Exploration for Embodied Question Answering
Authors:
Allen Z. Ren,
Jaden Clark,
Anushri Dixit,
Masha Itkina,
Anirudha Majumdar,
Dorsa Sadigh
Abstract:
We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions…
▽ More
We consider the problem of Embodied Question Answering (EQA), which refers to settings where an embodied agent such as a robot needs to actively explore an environment to gather information until it is confident about the answer to a question. In this work, we leverage the strong semantic reasoning capabilities of large vision-language models (VLMs) to efficiently explore and answer such questions. However, there are two main challenges when using VLMs in EQA: they do not have an internal memory for mapping the scene to be able to plan how to explore over time, and their confidence can be miscalibrated and can cause the robot to prematurely stop exploration or over-explore. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM - leveraging its vast knowledge of relevant regions of the scene for exploration. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration - leading to a more calibrated and efficient exploration strategy. To test our framework in simulation, we also contribute a new EQA dataset with diverse, realistic human-robot scenarios and scenes built upon the Habitat-Matterport 3D Research Dataset (HM3D). Both simulated and real robot experiments show our proposed approach improves the performance and efficiency over baselines that do no leverage VLM for exploration or do not calibrate its confidence. Webpage with experiment videos and code: https://explore-eqa.github.io/
△ Less
Submitted 7 July, 2024; v1 submitted 23 March, 2024;
originally announced March 2024.
-
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Authors:
Alexander Khazatsky,
Karl Pertsch,
Suraj Nair,
Ashwin Balakrishna,
Sudeep Dasari,
Siddharth Karamcheti,
Soroush Nasiriany,
Mohan Kumar Srirama,
Lawrence Yunliang Chen,
Kirsty Ellis,
Peter David Fagan,
Joey Hejna,
Masha Itkina,
Marion Lepert,
Yecheng Jason Ma,
Patrick Tree Miller,
Jimmy Wu,
Suneel Belkhale,
Shivin Dass,
Huy Ha,
Arhan Jain,
Abraham Lee,
Youngwoon Lee,
Marius Memmel,
Sungjae Park
, et al. (76 additional authors not shown)
Abstract:
The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a resu…
▽ More
The creation of large, diverse, high-quality robot manipulation datasets is an important stepping stone on the path toward more capable and robust robotic manipulation policies. However, creating such datasets is challenging: collecting robot manipulation data in diverse environments poses logistical and safety challenges and requires substantial investments in hardware and human labour. As a result, even the most general robot manipulation policies today are mostly trained on data collected in a small number of environments with limited scene and task diversity. In this work, we introduce DROID (Distributed Robot Interaction Dataset), a diverse robot manipulation dataset with 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.
△ Less
Submitted 22 April, 2025; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Authors:
Open X-Embodiment Collaboration,
Abby O'Neill,
Abdul Rehman,
Abhinav Gupta,
Abhiram Maddukuri,
Abhishek Gupta,
Abhishek Padalkar,
Abraham Lee,
Acorn Pooley,
Agrim Gupta,
Ajay Mandlekar,
Ajinkya Jain,
Albert Tung,
Alex Bewley,
Alex Herzog,
Alex Irpan,
Alexander Khazatsky,
Anant Rai,
Anchit Gupta,
Andrew Wang,
Andrey Kolobov,
Anikait Singh,
Animesh Garg,
Aniruddha Kembhavi,
Annie Xie
, et al. (269 additional authors not shown)
Abstract:
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning method…
▽ More
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.
△ Less
Submitted 14 May, 2025; v1 submitted 13 October, 2023;
originally announced October 2023.
-
Interpretable Self-Aware Neural Networks for Robust Trajectory Prediction
Authors:
Masha Itkina,
Mykel J. Kochenderfer
Abstract:
Although neural networks have seen tremendous success as predictive models in a variety of domains, they can be overly confident in their predictions on out-of-distribution (OOD) data. To be viable for safety-critical applications, like autonomous vehicles, neural networks must accurately estimate their epistemic or model uncertainty, achieving a level of system self-awareness. Techniques for epis…
▽ More
Although neural networks have seen tremendous success as predictive models in a variety of domains, they can be overly confident in their predictions on out-of-distribution (OOD) data. To be viable for safety-critical applications, like autonomous vehicles, neural networks must accurately estimate their epistemic or model uncertainty, achieving a level of system self-awareness. Techniques for epistemic uncertainty quantification often require OOD data during training or multiple neural network forward passes during inference. These approaches may not be suitable for real-time performance on high-dimensional inputs. Furthermore, existing methods lack interpretability of the estimated uncertainty, which limits their usefulness both to engineers for further system development and to downstream modules in the autonomy stack. We propose the use of evidential deep learning to estimate the epistemic uncertainty over a low-dimensional, interpretable latent space in a trajectory prediction setting. We introduce an interpretable paradigm for trajectory prediction that distributes the uncertainty among the semantic concepts: past agent behavior, road structure, and social context. We validate our approach on real-world autonomous driving data, demonstrating superior performance over state-of-the-art baselines. Our code is available at: https://github.com/sisl/InterpretableSelfAwarePrediction.
△ Less
Submitted 16 November, 2022;
originally announced November 2022.
-
LOPR: Latent Occupancy PRediction using Generative Models
Authors:
Bernard Lange,
Masha Itkina,
Mykel J. Kochenderfer
Abstract:
Environment prediction frameworks are integral for autonomous vehicles, enabling safe navigation in dynamic environments. LiDAR generated occupancy grid maps (L-OGMs) offer a robust bird's eye-view scene representation that facilitates joint scene predictions without relying on manual labeling unlike commonly used trajectory prediction frameworks. Prior approaches have optimized deterministic L-OG…
▽ More
Environment prediction frameworks are integral for autonomous vehicles, enabling safe navigation in dynamic environments. LiDAR generated occupancy grid maps (L-OGMs) offer a robust bird's eye-view scene representation that facilitates joint scene predictions without relying on manual labeling unlike commonly used trajectory prediction frameworks. Prior approaches have optimized deterministic L-OGM prediction architectures directly in grid cell space. While these methods have achieved some degree of success in prediction, they occasionally grapple with unrealistic and incorrect predictions. We claim that the quality and realism of the forecasted occupancy grids can be enhanced with the use of generative models. We propose a framework that decouples occupancy prediction into: representation learning and stochastic prediction within the learned latent space. Our approach allows for conditioning the model on other available sensor modalities such as RGB-cameras and high definition maps. We demonstrate that our approach achieves state-of-the-art performance and is readily transferable between different robotic platforms on the real-world NuScenes, Waymo Open, and a custom dataset we collected on an experimental vehicle platform.
△ Less
Submitted 24 August, 2023; v1 submitted 3 October, 2022;
originally announced October 2022.
-
Occlusion-Aware Crowd Navigation Using People as Sensors
Authors:
Ye-Ji Mun,
Masha Itkina,
Shuijing Liu,
Katherine Driggs-Campbell
Abstract:
Autonomous navigation in crowded spaces poses a challenge for mobile robots due to the highly dynamic, partially observable environment. Occlusions are highly prevalent in such settings due to a limited sensor field of view and obstructing human agents. Previous work has shown that observed interactive behaviors of human agents can be used to estimate potential obstacles despite occlusions. We pro…
▽ More
Autonomous navigation in crowded spaces poses a challenge for mobile robots due to the highly dynamic, partially observable environment. Occlusions are highly prevalent in such settings due to a limited sensor field of view and obstructing human agents. Previous work has shown that observed interactive behaviors of human agents can be used to estimate potential obstacles despite occlusions. We propose integrating such social inference techniques into the planning pipeline. We use a variational autoencoder with a specially designed loss function to learn representations that are meaningful for occlusion inference. This work adopts a deep reinforcement learning approach to incorporate the learned representation for occlusion-aware planning. In simulation, our occlusion-aware policy achieves comparable collision avoidance performance to fully observable navigation by estimating agents in occluded spaces. We demonstrate successful policy transfer from simulation to the real-world Turtlebot 2i. To the best of our knowledge, this work is the first to use social occlusion inference for crowd navigation.
△ Less
Submitted 28 April, 2023; v1 submitted 2 October, 2022;
originally announced October 2022.
-
How Do We Fail? Stress Testing Perception in Autonomous Vehicles
Authors:
Harrison Delecki,
Masha Itkina,
Bernard Lange,
Ransalu Senanayake,
Mykel J. Kochenderfer
Abstract:
Autonomous vehicles (AVs) rely on environment perception and behavior prediction to reason about agents in their surroundings. These perception systems must be robust to adverse weather such as rain, fog, and snow. However, validation of these systems is challenging due to their complexity and dependence on observation histories. This paper presents a method for characterizing failures of LiDAR-ba…
▽ More
Autonomous vehicles (AVs) rely on environment perception and behavior prediction to reason about agents in their surroundings. These perception systems must be robust to adverse weather such as rain, fog, and snow. However, validation of these systems is challenging due to their complexity and dependence on observation histories. This paper presents a method for characterizing failures of LiDAR-based perception systems for AVs in adverse weather conditions. We develop a methodology based in reinforcement learning to find likely failures in object tracking and trajectory prediction due to sequences of disturbances. We apply disturbances using a physics-based data augmentation technique for simulating LiDAR point clouds in adverse weather conditions. Experiments performed across a wide range of driving scenarios from a real-world driving dataset show that our proposed approach finds high likelihood failures with smaller input disturbances compared to baselines while remaining computationally tractable. Identified failures can inform future development of robust perception systems for AVs.
△ Less
Submitted 26 March, 2022;
originally announced March 2022.
-
Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models
Authors:
Phil Chen,
Masha Itkina,
Ransalu Senanayake,
Mykel J. Kochenderfer
Abstract:
Many applications of generative models rely on the marginalization of their high-dimensional output probability distributions. Normalization functions that yield sparse probability distributions can make exact marginalization more computationally tractable. However, sparse normalization functions usually require alternative loss functions for training since the log-likelihood is undefined for spar…
▽ More
Many applications of generative models rely on the marginalization of their high-dimensional output probability distributions. Normalization functions that yield sparse probability distributions can make exact marginalization more computationally tractable. However, sparse normalization functions usually require alternative loss functions for training since the log-likelihood is undefined for sparse probability distributions. Furthermore, many sparse normalization functions often collapse the multimodality of distributions. In this work, we present $\textit{ev-softmax}$, a sparse normalization function that preserves the multimodality of probability distributions. We derive its properties, including its gradient in closed-form, and introduce a continuous family of approximations to $\textit{ev-softmax}$ that have full support and can be trained with probabilistic loss functions such as negative log-likelihood and Kullback-Leibler divergence. We evaluate our method on a variety of generative models, including variational autoencoders and auto-regressive architectures. Our method outperforms existing dense and sparse normalization techniques in distributional accuracy. We demonstrate that $\textit{ev-softmax}$ successfully reduces the dimensionality of probability distributions while maintaining multimodality.
△ Less
Submitted 27 October, 2021;
originally announced October 2021.
-
Multi-Agent Variational Occlusion Inference Using People as Sensors
Authors:
Masha Itkina,
Ye-Ji Mun,
Katherine Driggs-Campbell,
Mykel J. Kochenderfer
Abstract:
Autonomous vehicles must reason about spatial occlusions in urban environments to ensure safety without being overly cautious. Prior work explored occlusion inference from observed social behaviors of road agents, hence treating people as sensors. Inferring occupancy from agent behaviors is an inherently multimodal problem; a driver may behave similarly for different occupancy patterns ahead of th…
▽ More
Autonomous vehicles must reason about spatial occlusions in urban environments to ensure safety without being overly cautious. Prior work explored occlusion inference from observed social behaviors of road agents, hence treating people as sensors. Inferring occupancy from agent behaviors is an inherently multimodal problem; a driver may behave similarly for different occupancy patterns ahead of them (e.g., a driver may move at constant speed in traffic or on an open road). Past work, however, does not account for this multimodality, thus neglecting to model this source of aleatoric uncertainty in the relationship between driver behaviors and their environment. We propose an occlusion inference method that characterizes observed behaviors of human agents as sensor measurements, and fuses them with those from a standard sensor suite. To capture the aleatoric uncertainty, we train a conditional variational autoencoder with a discrete latent space to learn a multimodal mapping from observed driver trajectories to an occupancy grid representation of the view ahead of the driver. Our method handles multi-agent scenarios, combining measurements from multiple observed drivers using evidential theory to solve the sensor fusion problem. Our approach is validated on a cluttered, real-world intersection, outperforming baselines and demonstrating real-time capable performance. Our code is available at https://github.com/sisl/MultiAgentVariationalOcclusionInference .
△ Less
Submitted 2 March, 2022; v1 submitted 5 September, 2021;
originally announced September 2021.
-
Double-Prong ConvLSTM for Spatiotemporal Occupancy Prediction in Dynamic Environments
Authors:
Maneekwan Toyungyernsub,
Masha Itkina,
Ransalu Senanayake,
Mykel J. Kochenderfer
Abstract:
Predicting the future occupancy state of an environment is important to enable informed decisions for autonomous vehicles. Common challenges in occupancy prediction include vanishing dynamic objects and blurred predictions, especially for long prediction horizons. In this work, we propose a double-prong neural network architecture to predict the spatiotemporal evolution of the occupancy state. One…
▽ More
Predicting the future occupancy state of an environment is important to enable informed decisions for autonomous vehicles. Common challenges in occupancy prediction include vanishing dynamic objects and blurred predictions, especially for long prediction horizons. In this work, we propose a double-prong neural network architecture to predict the spatiotemporal evolution of the occupancy state. One prong is dedicated to predicting how the static environment will be observed by the moving ego vehicle. The other prong predicts how the dynamic objects in the environment will move. Experiments conducted on the real-world Waymo Open Dataset indicate that the fused output of the two prongs is capable of retaining dynamic objects and reducing blurriness in the predictions for longer time horizons than baseline models.
△ Less
Submitted 27 September, 2022; v1 submitted 17 November, 2020;
originally announced November 2020.
-
Out-of-Distribution Detection for Automotive Perception
Authors:
Julia Nitsch,
Masha Itkina,
Ransalu Senanayake,
Juan Nieto,
Max Schmidt,
Roland Siegwart,
Mykel J. Kochenderfer,
Cesar Cadena
Abstract:
Neural networks (NNs) are widely used for object classification in autonomous driving. However, NNs can fail on input data not well represented by the training dataset, known as out-of-distribution (OOD) data. A mechanism to detect OOD samples is important for safety-critical applications, such as automotive perception, to trigger a safe fallback mode. NNs often rely on softmax normalization for c…
▽ More
Neural networks (NNs) are widely used for object classification in autonomous driving. However, NNs can fail on input data not well represented by the training dataset, known as out-of-distribution (OOD) data. A mechanism to detect OOD samples is important for safety-critical applications, such as automotive perception, to trigger a safe fallback mode. NNs often rely on softmax normalization for confidence estimation, which can lead to high confidences being assigned to OOD samples, thus hindering the detection of failures. This paper presents a method for determining whether inputs are OOD, which does not require OOD data during training and does not increase the computational cost of inference. The latter property is especially important in automotive applications with limited computational resources and real-time constraints. Our proposed approach outperforms state-of-the-art methods on real-world automotive datasets.
△ Less
Submitted 5 September, 2021; v1 submitted 2 November, 2020;
originally announced November 2020.
-
Attention Augmented ConvLSTM for Environment Prediction
Authors:
Bernard Lange,
Masha Itkina,
Mykel J. Kochenderfer
Abstract:
Safe and proactive planning in robotic systems generally requires accurate predictions of the environment. Prior work on environment prediction applied video frame prediction techniques to bird's-eye view environment representations, such as occupancy grids. ConvLSTM-based frameworks used previously often result in significant blurring and vanishing of moving objects, thus hindering their applicab…
▽ More
Safe and proactive planning in robotic systems generally requires accurate predictions of the environment. Prior work on environment prediction applied video frame prediction techniques to bird's-eye view environment representations, such as occupancy grids. ConvLSTM-based frameworks used previously often result in significant blurring and vanishing of moving objects, thus hindering their applicability for use in safety-critical applications. In this work, we propose two extensions to the ConvLSTM to address these issues. We present the Temporal Attention Augmented ConvLSTM (TAAConvLSTM) and Self-Attention Augmented ConvLSTM (SAAConvLSTM) frameworks for spatiotemporal occupancy prediction, and demonstrate improved performance over baseline architectures on the real-world KITTI and Waymo datasets.
△ Less
Submitted 10 September, 2021; v1 submitted 19 October, 2020;
originally announced October 2020.
-
Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders
Authors:
Masha Itkina,
Boris Ivanovic,
Ransalu Senanayake,
Mykel J. Kochenderfer,
Marco Pavone
Abstract:
Discrete latent spaces in variational autoencoders have been shown to effectively capture the data distribution for many real-world problems such as natural language understanding, human intent prediction, and visual scene representation. However, discrete latent spaces need to be sufficiently large to capture the complexities of real-world data, rendering downstream tasks computationally challeng…
▽ More
Discrete latent spaces in variational autoencoders have been shown to effectively capture the data distribution for many real-world problems such as natural language understanding, human intent prediction, and visual scene representation. However, discrete latent spaces need to be sufficiently large to capture the complexities of real-world data, rendering downstream tasks computationally challenging. For instance, performing motion planning in a high-dimensional latent representation of the environment could be intractable. We consider the problem of sparsifying the discrete latent space of a trained conditional variational autoencoder, while preserving its learned multimodality. As a post hoc latent space reduction technique, we use evidential theory to identify the latent classes that receive direct evidence from a particular input condition and filter out those that do not. Experiments on diverse tasks, such as image generation and human behavior prediction, demonstrate the effectiveness of our proposed technique at reducing the discrete latent sample space size of a model while maintaining its learned multimodality.
△ Less
Submitted 18 January, 2021; v1 submitted 18 October, 2020;
originally announced October 2020.
-
Dynamic Environment Prediction in Urban Scenes using Recurrent Representation Learning
Authors:
Masha Itkina,
Katherine Driggs-Campbell,
Mykel J. Kochenderfer
Abstract:
A key challenge for autonomous driving is safe trajectory planning in cluttered, urban environments with dynamic obstacles, such as pedestrians, bicyclists, and other vehicles. A reliable prediction of the future environment, including the behavior of dynamic agents, would allow planning algorithms to proactively generate a trajectory in response to a rapidly changing environment. We present a nov…
▽ More
A key challenge for autonomous driving is safe trajectory planning in cluttered, urban environments with dynamic obstacles, such as pedestrians, bicyclists, and other vehicles. A reliable prediction of the future environment, including the behavior of dynamic agents, would allow planning algorithms to proactively generate a trajectory in response to a rapidly changing environment. We present a novel framework that predicts the future occupancy state of the local environment surrounding an autonomous agent by learning a motion model from occupancy grid data using a neural network. We take advantage of the temporal structure of the grid data by utilizing a convolutional long-short term memory network in the form of the PredNet architecture. This method is validated on the KITTI dataset and demonstrates higher accuracy and better predictive power than baseline methods.
△ Less
Submitted 18 August, 2019; v1 submitted 28 April, 2019;
originally announced April 2019.