Search | arXiv e-print repository

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

Authors: Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, Jonathan Tremblay, Kanav Arora, Kirsty Ellis, Luca Macesanu, Matthew Leonard, Meedeum Cho, Ozgur Aslan, Shivin Dass, Jie Wang, Xingfang Yuan, Xuning Yang, Abhishek Gupta, Dinesh Jayaraman, Glen Berseth, Kostas Daniilidis , et al. (5 additional authors not shown)

Abstract: Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ''robot challenges'', and do not readily scale to evaluating generalist policies across a broad range of tasks and environ… ▽ More Comprehensive, unbiased, and comparable evaluation of modern generalist policies is uniquely challenging: existing approaches for robot benchmarking typically rely on heavy standardization, either by specifying fixed evaluation tasks and environments, or by hosting centralized ''robot challenges'', and do not readily scale to evaluating generalist policies across a broad range of tasks and environments. In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world. Instead of standardizing evaluations around fixed tasks, environments, or locations, we propose to crowd-source evaluations across a distributed network of evaluators. Importantly, evaluators can freely choose the tasks and environments they evaluate on, enabling easy scaling of diversity, but they are required to perform double-blind evaluations over pairs of policies. Then, by aggregating preference feedback from pairwise comparisons across diverse tasks and environments, we can derive a ranking of policies. We instantiate our approach across a network of evaluators at seven academic institutions using the DROID robot platform. Through more than 600 pairwise real-robot evaluation episodes across seven generalist policies, we demonstrate that our crowd-sourced approach can more accurately rank the performance of existing generalist policies than conventional, centralized evaluation approaches, while being more scalable, resilient, and trustworthy. We open our evaluation network to the community and hope that it can enable more accessible comparisons of generalist robot policies. △ Less

Submitted 22 June, 2025; originally announced June 2025.

Comments: Website: https://robo-arena.github.io/

arXiv:2505.03402 [pdf, ps, other]

Fabrication-tolerant frequency conversion in thin film lithium niobate waveguide with layer-poled modal phase matching

Authors: O. Hefti, J. -E. Tremblay, A. Volpini, Y. Koyaz, I. Prieto, O. Dubochet, M. Despont, H. Zarebidaki, C. Caër, J. Berney, S. Lecomte, H. Sattari, C. -S. Brès, D. Grassani

Abstract: Thanks to its high quadratic nonlinear susceptibilty and low propagation losses, thin film lithium niobate (TFLN) on insulator is an ideal platform for laser frequency conversion and generation of quantum states of light. Frequency conversion is usually achieved by quasi-phase matching (QPM) via electric-field poling. However, this scheme shows very high sensitivity to the dimensions of the wavegu… ▽ More Thanks to its high quadratic nonlinear susceptibilty and low propagation losses, thin film lithium niobate (TFLN) on insulator is an ideal platform for laser frequency conversion and generation of quantum states of light. Frequency conversion is usually achieved by quasi-phase matching (QPM) via electric-field poling. However, this scheme shows very high sensitivity to the dimensions of the waveguide, poling period and duty cycle, resulting in a lack of repeatability of the phase matched wavelength and efficiency, which in turn limits the spread of TFLN frequency converters in complex circuits and hinders wafer-scale production. Here we propose a layer-poled modal phase matching (MPM) that is 5 to 10 times more robust towards fabrication uncertainties and theoretically more efficient than conventional QPM. By selectively poling the bottom part of the waveguide all along its length, second harmonic is efficiently generated on a higher order waveguide's mode. We validate this approach by poling TFLN waveguides as a post-process after the fabrication in a foundry process. We perform a tolerance analysis and compare the experimental results with conventional QPM second harmonic generation process on the same waveguides. Then, we show how MPM can be exploited to obtain efficient intraband frequency conversion processes at telecom wavelengths by leveraging simultaneous second harmonic and difference frequency generation in the same waveguide. △ Less

Submitted 6 May, 2025; originally announced May 2025.

Comments: 12, 7, submitted to APL Photonics

arXiv:2504.02812 [pdf, other]

BOP Challenge 2024 on Model-Based and Model-Free 6D Object Pose Estimation

Authors: Van Nguyen Nguyen, Stephen Tyree, Andrew Guo, Mederic Fourmy, Anas Gouda, Taeyeop Lee, Sungphill Moon, Hyeontae Son, Lukas Ranftl, Jonathan Tremblay, Eric Brachmann, Bertram Drost, Vincent Lepetit, Carsten Rother, Stan Birchfield, Jiri Matas, Yann Labbe, Martin Sundermeyer, Tomas Hodan

Abstract: We present the evaluation methodology, datasets and results of the BOP Challenge 2024, the 6th in a series of public competitions organized to capture the state of the art in 6D object pose estimation and related tasks. In 2024, our goal was to transition BOP from lab-like setups to real-world scenarios. First, we introduced new model-free tasks, where no 3D object models are available and methods… ▽ More We present the evaluation methodology, datasets and results of the BOP Challenge 2024, the 6th in a series of public competitions organized to capture the state of the art in 6D object pose estimation and related tasks. In 2024, our goal was to transition BOP from lab-like setups to real-world scenarios. First, we introduced new model-free tasks, where no 3D object models are available and methods need to onboard objects just from provided reference videos. Second, we defined a new, more practical 6D object detection task where identities of objects visible in a test image are not provided as input. Third, we introduced new BOP-H3 datasets recorded with high-resolution sensors and AR/VR headsets, closely resembling real-world scenarios. BOP-H3 include 3D models and onboarding videos to support both model-based and model-free tasks. Participants competed on seven challenge tracks. Notably, the best 2024 method for model-based 6D localization of unseen objects (FreeZeV2.1) achieves 22% higher accuracy on BOP-Classic-Core than the best 2023 method (GenFlow), and is only 4% behind the best 2023 method for seen objects (GPose2023) although being significantly slower (24.9 vs 2.7s per image). A more practical 2024 method for this task is Co-op which takes only 0.8s per image and is 13% more accurate than GenFlow. Methods have similar rankings on 6D detection as on 6D localization but higher run time. On model-based 2D detection of unseen objects, the best 2024 method (MUSE) achieves 21--29% relative improvement compared to the best 2023 method (CNOS). However, the 2D detection accuracy for unseen objects is still -35% behind the accuracy for seen objects (GDet2023), and the 2D detection stage is consequently the main bottleneck of existing pipelines for 6D localization/detection of unseen objects. The online evaluation system stays open and is available at http://bop.felk.cvut.cz/ △ Less

Submitted 23 April, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

Comments: arXiv admin note: text overlap with arXiv:2403.09799

arXiv:2412.08445 [pdf, other]

TapeAgents: a Holistic Framework for Agent Development and Optimization

Authors: Dzmitry Bahdanau, Nicolas Gontier, Gabriel Huang, Ehsan Kamalloo, Rafael Pardinas, Alex Piché, Torsten Scholak, Oleh Shliazhko, Jordan Prince Tremblay, Karam Ghanem, Soham Parikh, Mitul Tiwari, Quaizar Vohra

Abstract: We present TapeAgents, an agent framework built around a granular, structured log tape of the agent session that also plays the role of the session's resumable state. In TapeAgents we leverage tapes to facilitate all stages of the LLM Agent development lifecycle. The agent reasons by processing the tape and the LLM output to produce new thought and action steps and append them to the tape. The env… ▽ More We present TapeAgents, an agent framework built around a granular, structured log tape of the agent session that also plays the role of the session's resumable state. In TapeAgents we leverage tapes to facilitate all stages of the LLM Agent development lifecycle. The agent reasons by processing the tape and the LLM output to produce new thought and action steps and append them to the tape. The environment then reacts to the agent's actions by likewise appending observation steps to the tape. By virtue of this tape-centred design, TapeAgents can provide AI practitioners with holistic end-to-end support. At the development stage, tapes facilitate session persistence, agent auditing, and step-by-step debugging. Post-deployment, one can reuse tapes for evaluation, fine-tuning, and prompt-tuning; crucially, one can adapt tapes from other agents or use revised historical tapes. In this report, we explain the TapeAgents design in detail. We demonstrate possible applications of TapeAgents with several concrete examples of building monolithic agents and multi-agent teams, of optimizing agent prompts and finetuning the agent's LLM. We present tooling prototypes and report a case study where we use TapeAgents to finetune a Llama-3.1-8B form-filling assistant to perform as well as GPT-4o while being orders of magnitude cheaper. Lastly, our comparative analysis shows that TapeAgents's advantages over prior frameworks stem from our novel design of the LLM agent as a resumable, modular state machine with a structured configuration, that generates granular, structured logs and that can transform these logs into training text -- a unique combination of features absent in previous work. △ Less

Submitted 11 December, 2024; originally announced December 2024.

arXiv:2411.16537 [pdf, other]

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

Authors: Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, Stan Birchfield

Abstract: Spatial understanding is a crucial capability that enables robots to perceive their surroundings, reason about their environment, and interact with it meaningfully. In modern robotics, these capabilities are increasingly provided by vision-language models. However, these models face significant challenges in spatial reasoning tasks, as their training data are based on general-purpose image dataset… ▽ More Spatial understanding is a crucial capability that enables robots to perceive their surroundings, reason about their environment, and interact with it meaningfully. In modern robotics, these capabilities are increasingly provided by vision-language models. However, these models face significant challenges in spatial reasoning tasks, as their training data are based on general-purpose image datasets that often lack sophisticated spatial understanding. For example, datasets frequently do not capture reference frame comprehension, yet effective spatial reasoning requires understanding whether to reason from ego-, world-, or object-centric perspectives. To address this issue, we introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics. It consists of real indoor and tabletop scenes, captured as 3D scans and egocentric images, and annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5k 3D scans, and 3M annotated spatial relationships, and the pairing of 2D egocentric images with 3D scans makes it both 2D- and 3D- ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robot manipulation. △ Less

Submitted 5 April, 2025; v1 submitted 25 November, 2024; originally announced November 2024.

Comments: CVPR 2025 (Oral); Project Website: https://chanh.ee/RoboSpatial

arXiv:2410.20220 [pdf, other]

Neural Fields in Robotics: A Survey

Authors: Muhammad Zubair Irshad, Mauro Comi, Yen-Chen Lin, Nick Heppert, Abhinav Valada, Rares Ambrus, Zsolt Kira, Jonathan Tremblay

Abstract: Neural Fields have emerged as a transformative approach for 3D scene representation in computer vision and robotics, enabling accurate inference of geometry, 3D semantics, and dynamics from posed 2D data. Leveraging differentiable rendering, Neural Fields encompass both continuous implicit and explicit neural representations enabling high-fidelity 3D reconstruction, integration of multi-modal sens… ▽ More Neural Fields have emerged as a transformative approach for 3D scene representation in computer vision and robotics, enabling accurate inference of geometry, 3D semantics, and dynamics from posed 2D data. Leveraging differentiable rendering, Neural Fields encompass both continuous implicit and explicit neural representations enabling high-fidelity 3D reconstruction, integration of multi-modal sensor data, and generation of novel viewpoints. This survey explores their applications in robotics, emphasizing their potential to enhance perception, planning, and control. Their compactness, memory efficiency, and differentiability, along with seamless integration with foundation and generative models, make them ideal for real-time applications, improving robot adaptability and decision-making. This paper provides a thorough review of Neural Fields in robotics, categorizing applications across various domains and evaluating their strengths and limitations, based on over 200 papers. First, we present four key Neural Fields frameworks: Occupancy Networks, Signed Distance Fields, Neural Radiance Fields, and Gaussian Splatting. Second, we detail Neural Fields' applications in five major robotics domains: pose estimation, manipulation, navigation, physics, and autonomous driving, highlighting key works and discussing takeaways and open challenges. Finally, we outline the current limitations of Neural Fields in robotics and propose promising directions for future research. Project page: https://robonerf.github.io △ Less

Submitted 26 October, 2024; originally announced October 2024.

Comments: 20 pages, 20 figures. Project Page: https://robonerf.github.io

arXiv:2410.15536 [pdf, other]

GRS: Generating Robotic Simulation Tasks from Real-World Images

Authors: Alex Zook, Fan-Yun Sun, Josef Spjut, Valts Blukis, Stan Birchfield, Jonathan Tremblay

Abstract: We introduce GRS (Generating Robotic Simulation tasks), a system addressing real-to-sim for robotic simulations. GRS creates digital twin simulations from single RGB-D observations with solvable tasks for virtual agent training. Using vision-language models (VLMs), our pipeline operates in three stages: 1) scene comprehension with SAM2 for segmentation and object description, 2) matching objects w… ▽ More We introduce GRS (Generating Robotic Simulation tasks), a system addressing real-to-sim for robotic simulations. GRS creates digital twin simulations from single RGB-D observations with solvable tasks for virtual agent training. Using vision-language models (VLMs), our pipeline operates in three stages: 1) scene comprehension with SAM2 for segmentation and object description, 2) matching objects with simulation-ready assets, and 3) generating appropriate tasks. We ensure simulation-task alignment through generated test suites and introduce a router that iteratively refines both simulation and test code. Experiments demonstrate our system's effectiveness in object correspondence and task environment generation through our novel router mechanism. △ Less

Submitted 4 April, 2025; v1 submitted 20 October, 2024; originally announced October 2024.

arXiv:2410.01925 [pdf, other]

Topological mapping for traversability-aware long-range navigation in off-road terrain

Authors: Jean-François Tremblay, Julie Alhosh, Louis Petit, Faraz Lotfi, Lara Landauro, David Meger

Abstract: Autonomous robots navigating in off-road terrain like forests open new opportunities for automation. While off-road navigation has been studied, existing work often relies on clearly delineated pathways. We present a method allowing for long-range planning, exploration and low-level control in unknown off-trail forest terrain, using vision and GPS only. We represent outdoor terrain with a topologi… ▽ More Autonomous robots navigating in off-road terrain like forests open new opportunities for automation. While off-road navigation has been studied, existing work often relies on clearly delineated pathways. We present a method allowing for long-range planning, exploration and low-level control in unknown off-trail forest terrain, using vision and GPS only. We represent outdoor terrain with a topological map, which is a set of panoramic snapshots connected with edges containing traversability information. A novel traversability analysis method is demonstrated, predicting the existence of a safe path towards a target in an image. Navigating between nodes is done using goal-conditioned behavior cloning, leveraging the power of a pretrained vision transformer. An exploration planner is presented, efficiently covering an unknown off-road area with unknown traversability using a frontiers-based approach. The approach is successfully deployed to autonomously explore two 400 meters squared forest sites unseen during training, in difficult conditions for navigation. △ Less

Submitted 2 October, 2024; originally announced October 2024.

arXiv:2409.17652 [pdf, other]

FactorSim: Generative Simulation via Factorized Representation

Authors: Fan-Yun Sun, S. I. Harini, Angela Yi, Yihan Zhou, Alex Zook, Jonathan Tremblay, Logan Cross, Jiajun Wu, Nick Haber

Abstract: Generating simulations to train intelligent agents in game-playing and robotics from natural language input, from user input or task documentation, remains an open-ended challenge. Existing approaches focus on parts of this challenge, such as generating reward functions or task hyperparameters. Unlike previous work, we introduce FACTORSIM that generates full simulations in code from language input… ▽ More Generating simulations to train intelligent agents in game-playing and robotics from natural language input, from user input or task documentation, remains an open-ended challenge. Existing approaches focus on parts of this challenge, such as generating reward functions or task hyperparameters. Unlike previous work, we introduce FACTORSIM that generates full simulations in code from language input that can be used to train agents. Exploiting the structural modularity specific to coded simulations, we propose to use a factored partially observable Markov decision process representation that allows us to reduce context dependence during each step of the generation. For evaluation, we introduce a generative simulation benchmark that assesses the generated simulation code's accuracy and effectiveness in facilitating zero-shot transfers in reinforcement learning settings. We show that FACTORSIM outperforms existing methods in generating simulations regarding prompt alignment (e.g., accuracy), zero-shot transfer abilities, and human evaluation. We also demonstrate its effectiveness in generating robotic tasks. △ Less

Submitted 11 November, 2024; v1 submitted 26 September, 2024; originally announced September 2024.

Comments: neurips 2024, project website: https://cs.stanford.edu/~sunfanyun/factorsim/

arXiv:2407.03067 [pdf, other]

A quantum mechanical evaluation of the intermediate scattering function

Authors: Oussama Bindech, Roberto Marquardt, Fabien Gatti, Souvik Mandal, Jean Christophe Tremblay

Abstract: The intermediate scattering function is interpreted as a correlation function of thermal wave packets of the scattering centers perturbed by the scattering particles at different times. A proof of concept is given at the example of ballistic moving centers. The ensuing numerical method is then illustrated at the example of CO adsorbed on Cu(100). The intermediate scattering function is interpreted as a correlation function of thermal wave packets of the scattering centers perturbed by the scattering particles at different times. A proof of concept is given at the example of ballistic moving centers. The ensuing numerical method is then illustrated at the example of CO adsorbed on Cu(100). △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 8 pages, 2 figures

MSC Class: 81; 82

arXiv:2406.10543 [pdf, other]

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows

Authors: Zhenggang Tang, Zhongzheng Ren, Xiaoming Zhao, Bowen Wen, Jonathan Tremblay, Stan Birchfield, Alexander Schwing

Abstract: We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points, we introduce a novel… ▽ More We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flow, specifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points, we introduce a novel correspondence algorithm that first matches RGB-based pairs, then leverages multi-view information and 3D reprojection to robustly filter false positives in two steps. We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation. Our dataset ( https://github.com/nerfdeformer/nerfdeformer ) contains 113 synthetic scenes leveraging 47 3D assets. We show that our proposed method outperforms NeRF editing methods as well as diffusion-based methods, and we also explore different methods for filtering correspondences. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: 8 pages of main paper, CVPR 2024. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024

arXiv:2404.01440 [pdf, other]

Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects

Authors: Yijia Weng, Bowen Wen, Jonathan Tremblay, Valts Blukis, Dieter Fox, Leonidas Guibas, Stan Birchfield

Abstract: We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. We decompose the problem into two stages, each addressing distinct aspects. Our method first reconstructs object-level shape at each state, then recovers the underlying articulation model including part segmentation and joint articulations that associa… ▽ More We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. We decompose the problem into two stages, each addressing distinct aspects. Our method first reconstructs object-level shape at each state, then recovers the underlying articulation model including part segmentation and joint articulations that associate the two states. By explicitly modeling point-level correspondences and exploiting cues from images, 3D reconstructions, and kinematics, our method yields more accurate and stable results compared to prior work. It also handles more than one movable part and does not rely on any object shape or structure priors. Project page: https://github.com/NVlabs/DigitalTwinArt △ Less

Submitted 6 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: CVPR 2024

arXiv:2403.20275 [pdf, other]

Snap-it, Tap-it, Splat-it: Tactile-Informed 3D Gaussian Splatting for Reconstructing Challenging Surfaces

Authors: Mauro Comi, Alessio Tonioni, Max Yang, Jonathan Tremblay, Valts Blukis, Yijiong Lin, Nathan F. Lepora, Laurence Aitchison

Abstract: Touch and vision go hand in hand, mutually enhancing our ability to understand the world. From a research perspective, the problem of mixing touch and vision is underexplored and presents interesting challenges. To this end, we propose Tactile-Informed 3DGS, a novel approach that incorporates touch data (local depth maps) with multi-view vision data to achieve surface reconstruction and novel view… ▽ More Touch and vision go hand in hand, mutually enhancing our ability to understand the world. From a research perspective, the problem of mixing touch and vision is underexplored and presents interesting challenges. To this end, we propose Tactile-Informed 3DGS, a novel approach that incorporates touch data (local depth maps) with multi-view vision data to achieve surface reconstruction and novel view synthesis. Our method optimises 3D Gaussian primitives to accurately model the object's geometry at points of contact. By creating a framework that decreases the transmittance at touch locations, we achieve a refined surface reconstruction, ensuring a uniformly smooth depth map. Touch is particularly useful when considering non-Lambertian objects (e.g. shiny or reflective surfaces) since contemporary methods tend to fail to reconstruct with fidelity specular highlights. By combining vision and tactile sensing, we achieve more accurate geometry reconstructions with fewer images than prior methods. We conduct evaluation on objects with glossy and reflective surfaces and demonstrate the effectiveness of our approach, offering significant improvements in reconstruction quality. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: 17 pages

arXiv:2312.00215 [pdf, other]

Learning active tactile perception through belief-space control

Authors: Jean-François Tremblay, David Meger, Francois Hogan, Gregory Dudek

Abstract: Robots operating in an open world will encounter novel objects with unknown physical properties, such as mass, friction, or size. These robots will need to sense these properties through interaction prior to performing downstream tasks with the objects. We propose a method that autonomously learns tactile exploration policies by developing a generative world model that is leveraged to 1) estimate… ▽ More Robots operating in an open world will encounter novel objects with unknown physical properties, such as mass, friction, or size. These robots will need to sense these properties through interaction prior to performing downstream tasks with the objects. We propose a method that autonomously learns tactile exploration policies by developing a generative world model that is leveraged to 1) estimate the object's physical parameters using a differentiable Bayesian filtering algorithm and 2) develop an exploration policy using an information-gathering model predictive controller. We evaluate our method on three simulated tasks where the goal is to estimate a desired object property (mass, height or toppling height) through physical interaction. We find that our method is able to discover policies that efficiently gather information about the desired property in an intuitive manner. Finally, we validate our method on a real robot system for the height estimation task, where our method is able to successfully learn and execute an information-gathering policy from scratch. △ Less

Submitted 30 November, 2023; originally announced December 2023.

Comments: 10 pages + references, 6 figures

arXiv:2310.00463 [pdf, other]

Diff-DOPE: Differentiable Deep Object Pose Estimation

Authors: Jonathan Tremblay, Bowen Wen, Valts Blukis, Balakumar Sundaralingam, Stephen Tyree, Stan Birchfield

Abstract: We introduce Diff-DOPE, a 6-DoF pose refiner that takes as input an image, a 3D textured model of an object, and an initial pose of the object. The method uses differentiable rendering to update the object pose to minimize the visual error between the image and the projection of the model. We show that this simple, yet effective, idea is able to achieve state-of-the-art results on pose estimation… ▽ More We introduce Diff-DOPE, a 6-DoF pose refiner that takes as input an image, a 3D textured model of an object, and an initial pose of the object. The method uses differentiable rendering to update the object pose to minimize the visual error between the image and the projection of the model. We show that this simple, yet effective, idea is able to achieve state-of-the-art results on pose estimation datasets. Our approach is a departure from recent methods in which the pose refiner is a deep neural network trained on a large synthetic dataset to map inputs to refinement steps. Rather, our use of differentiable rendering allows us to avoid training altogether. Our approach performs multiple gradient descent optimizations in parallel with different random learning rates to avoid local minima from symmetric objects, similar appearances, or wrong step size. Various modalities can be used, e.g., RGB, depth, intensity edges, and object segmentation masks. We present experiments examining the effect of various choices, showing that the best results are found when the RGB image is accompanied by an object mask and depth image to guide the optimization process. △ Less

Submitted 30 September, 2023; originally announced October 2023.

Comments: Submitted to ICRA 2023. Project page is at https://diffdope.github.io

arXiv:2308.01477 [pdf, other]

HANDAL: A Dataset of Real-World Manipulable Object Categories with Pose Annotations, Affordances, and Reconstructions

Authors: Andrew Guo, Bowen Wen, Jianhe Yuan, Jonathan Tremblay, Stephen Tyree, Jeffrey Smith, Stan Birchfield

Abstract: We present the HANDAL dataset for category-level object pose estimation and affordance prediction. Unlike previous datasets, ours is focused on robotics-ready manipulable objects that are of the proper size and shape for functional grasping by robot manipulators, such as pliers, utensils, and screwdrivers. Our annotation process is streamlined, requiring only a single off-the-shelf camera and semi… ▽ More We present the HANDAL dataset for category-level object pose estimation and affordance prediction. Unlike previous datasets, ours is focused on robotics-ready manipulable objects that are of the proper size and shape for functional grasping by robot manipulators, such as pliers, utensils, and screwdrivers. Our annotation process is streamlined, requiring only a single off-the-shelf camera and semi-automated processing, allowing us to produce high-quality 3D annotations without crowd-sourcing. The dataset consists of 308k annotated image frames from 2.2k videos of 212 real-world objects in 17 categories. We focus on hardware and kitchen tool objects to facilitate research in practical scenarios in which a robot manipulator needs to interact with the environment beyond simple pushing or indiscriminate grasping. We outline the usefulness of our dataset for 6-DoF category-level pose+scale estimation and related tasks. We also provide 3D reconstructed meshes of all objects, and we outline some of the bottlenecks to be addressed for democratizing the collection of datasets like this one. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: IROS 2023. Project page: https://nvlabs.github.io/HANDAL/

arXiv:2304.00673 [pdf, other]

Partial-View Object View Synthesis via Filtered Inversion

Authors: Fan-Yun Sun, Jonathan Tremblay, Valts Blukis, Kevin Lin, Danfei Xu, Boris Ivanovic, Peter Karkus, Stan Birchfield, Dieter Fox, Ruohan Zhang, Yunzhu Li, Jiajun Wu, Marco Pavone, Nick Haber

Abstract: We propose Filtering Inversion (FINV), a learning framework and optimization process that predicts a renderable 3D object representation from one or few partial views. FINV addresses the challenge of synthesizing novel views of objects from partial observations, spanning cases where the object is not entirely in view, is partially occluded, or is only observed from similar views. To achieve this,… ▽ More We propose Filtering Inversion (FINV), a learning framework and optimization process that predicts a renderable 3D object representation from one or few partial views. FINV addresses the challenge of synthesizing novel views of objects from partial observations, spanning cases where the object is not entirely in view, is partially occluded, or is only observed from similar views. To achieve this, FINV learns shape priors by training a 3D generative model. At inference, given one or more views of a novel real-world object, FINV first finds a set of latent codes for the object by inverting the generative model from multiple initial seeds. Maintaining the set of latent codes, FINV filters and resamples them after receiving each new observation, akin to particle filtering. The generator is then finetuned for each latent code on the available views in order to adapt to novel objects. We show that FINV successfully synthesizes novel views of real-world objects (e.g., chairs, tables, and cars), even if the generative prior is trained only on synthetic objects. The ability to address the sim-to-real problem allows FINV to be used for object categories without real-world datasets. FINV achieves state-of-the-art performance on multiple real-world datasets, recovers object shape and texture from partial and sparse views, is robust to occlusion, and is able to incrementally improve its representation with more observations. △ Less

Submitted 17 August, 2024; v1 submitted 2 April, 2023; originally announced April 2023.

Comments: project website: http://cs.stanford.edu/~sunfanyun/finv

arXiv:2303.16730 [pdf, other]

TTA-COPE: Test-Time Adaptation for Category-Level Object Pose Estimation

Authors: Taeyeop Lee, Jonathan Tremblay, Valts Blukis, Bowen Wen, Byeong-Uk Lee, Inkyu Shin, Stan Birchfield, In So Kweon, Kuk-Jin Yoon

Abstract: Test-time adaptation methods have been gaining attention recently as a practical solution for addressing source-to-target domain gaps by gradually updating the model without requiring labels on the target data. In this paper, we propose a method of test-time adaptation for category-level object pose estimation called TTA-COPE. We design a pose ensemble approach with a self-training loss using pose… ▽ More Test-time adaptation methods have been gaining attention recently as a practical solution for addressing source-to-target domain gaps by gradually updating the model without requiring labels on the target data. In this paper, we propose a method of test-time adaptation for category-level object pose estimation called TTA-COPE. We design a pose ensemble approach with a self-training loss using pose-aware confidence. Unlike previous unsupervised domain adaptation methods for category-level object pose estimation, our approach processes the test data in a sequential, online manner, and it does not require access to the source domain at runtime. Extensive experimental results demonstrate that the proposed pose ensemble and the self-training loss improve category-level object pose performance during test time under both semi-supervised and unsupervised settings. Project page: https://taeyeop.com/ttacope △ Less

Submitted 29 March, 2023; originally announced March 2023.

Comments: Accepted to CVPR 2023, Project page: https://taeyeop.com/ttacope

arXiv:2303.15926 [pdf, ps, other]

doi 10.1140/epjs/s11734-023-00942-1

Charge Migration in Heterocyclic Five-Membered Rings

Authors: Sucharita Giri, Gopal Dixit, Jean Christophe Tremblay

Abstract: This contribution presents numerical simulations of N-electron dynamics in heterocyclic five-membered ring molecules to shed light on the effect of molecular symmetry on charge migration. Laser-driven dynamics is studied using the hybrid time-dependent density functional theory/configuration methodology, and the ensuing field-free charge migration is investigated by means of transient electronic f… ▽ More This contribution presents numerical simulations of N-electron dynamics in heterocyclic five-membered ring molecules to shed light on the effect of molecular symmetry on charge migration. Laser-driven dynamics is studied using the hybrid time-dependent density functional theory/configuration methodology, and the ensuing field-free charge migration is investigated by means of transient electronic flux density maps. Our results demonstrate that the charge migration in aromatic rings is sensitive to the presence of heteroatoms such as oxygen and nitrogen. Their presence within the ring induces significant modifications of the character in the ground and low-lying electronic states, which is imprinted in the charge migration mechanism. △ Less

Submitted 28 March, 2023; originally announced March 2023.

arXiv:2303.14158 [pdf, other]

BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects

Authors: Bowen Wen, Jonathan Tremblay, Valts Blukis, Stephen Tyree, Thomas Muller, Alex Evans, Dieter Fox, Jan Kautz, Stan Birchfield

Abstract: We present a near real-time method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence, while simultaneously performing neural 3D reconstruction of the object. Our method works for arbitrary rigid objects, even when visual texture is largely absent. The object is assumed to be segmented in the first frame only. No additional information is required, and no assumption is ma… ▽ More We present a near real-time method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence, while simultaneously performing neural 3D reconstruction of the object. Our method works for arbitrary rigid objects, even when visual texture is largely absent. The object is assumed to be segmented in the first frame only. No additional information is required, and no assumption is made about the interaction agent. Key to our method is a Neural Object Field that is learned concurrently with a pose graph optimization process in order to robustly accumulate information into a consistent 3D representation capturing both geometry and appearance. A dynamic pool of posed memory frames is automatically maintained to facilitate communication between these threads. Our approach handles challenging sequences with large pose changes, partial and full occlusion, untextured surfaces, and specular highlights. We show results on HO3D, YCBInEOAT, and BEHAVE datasets, demonstrating that our method significantly outperforms existing approaches. Project page: https://bundlesdf.github.io △ Less

Submitted 24 March, 2023; originally announced March 2023.

Comments: CVPR 2023

arXiv:2212.06870 [pdf, other]

MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare

Authors: Yann Labbé, Lucas Manuelli, Arsalan Mousavian, Stephen Tyree, Stan Birchfield, Jonathan Tremblay, Justin Carpentier, Mathieu Aubry, Dieter Fox, Josef Sivic

Abstract: We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which c… ▽ More We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/. △ Less

Submitted 13 December, 2022; originally announced December 2022.

Comments: CoRL 2022

arXiv:2210.12126 [pdf, other]

One-Shot Neural Fields for 3D Object Understanding

Authors: Valts Blukis, Taeyeop Lee, Jonathan Tremblay, Bowen Wen, In So Kweon, Kuk-Jin Yoon, Dieter Fox, Stan Birchfield

Abstract: We present a unified and compact scene representation for robotics, where each object in the scene is depicted by a latent code capturing geometry and appearance. This representation can be decoded for various tasks such as novel view rendering, 3D reconstruction (e.g. recovering depth, point clouds, or voxel maps), collision checking, and stable grasp prediction. We build our representation from… ▽ More We present a unified and compact scene representation for robotics, where each object in the scene is depicted by a latent code capturing geometry and appearance. This representation can be decoded for various tasks such as novel view rendering, 3D reconstruction (e.g. recovering depth, point clouds, or voxel maps), collision checking, and stable grasp prediction. We build our representation from a single RGB input image at test time by leveraging recent advances in Neural Radiance Fields (NeRF) that learn category-level priors on large multiview datasets, then fine-tune on novel objects from one or few views. We expand the NeRF model for additional grasp outputs and explore ways to leverage this representation for robotics. At test-time, we build the representation from a single RGB input image observing the scene from only one viewpoint. We find that the recovered representation allows rendering from novel views, including of occluded object parts, and also for predicting successful stable grasps. Grasp poses can be directly decoded from our latent representation with an implicit grasp decoder. We experimented in both simulation and real world and demonstrated the capability for robust robotic grasping using such compact representation. Website: https://nerfgrasp.github.io △ Less

Submitted 8 August, 2023; v1 submitted 21 October, 2022; originally announced October 2022.

Comments: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshop (CVPRW) on XRNeRF: Advances in NeRF for the Metaverse 2023

arXiv:2210.11668 [pdf, other]

RGB-Only Reconstruction of Tabletop Scenes for Collision-Free Manipulator Control

Authors: Zhenggang Tang, Balakumar Sundaralingam, Jonathan Tremblay, Bowen Wen, Ye Yuan, Stephen Tyree, Charles Loop, Alexander Schwing, Stan Birchfield

Abstract: We present a system for collision-free control of a robot manipulator that uses only RGB views of the world. Perceptual input of a tabletop scene is provided by multiple images of an RGB camera (without depth) that is either handheld or mounted on the robot end effector. A NeRF-like process is used to reconstruct the 3D geometry of the scene, from which the Euclidean full signed distance function… ▽ More We present a system for collision-free control of a robot manipulator that uses only RGB views of the world. Perceptual input of a tabletop scene is provided by multiple images of an RGB camera (without depth) that is either handheld or mounted on the robot end effector. A NeRF-like process is used to reconstruct the 3D geometry of the scene, from which the Euclidean full signed distance function (ESDF) is computed. A model predictive control algorithm is then used to control the manipulator to reach a desired pose while avoiding obstacles in the ESDF. We show results on a real dataset collected and annotated in our lab. △ Less

Submitted 10 March, 2023; v1 submitted 20 October, 2022; originally announced October 2022.

Comments: ICRA 2023. Project page at https://ngp-mpc.github.io/

arXiv:2210.10108 [pdf, other]

Parallel Inversion of Neural Radiance Fields for Robust Pose Estimation

Authors: Yunzhi Lin, Thomas Müller, Jonathan Tremblay, Bowen Wen, Stephen Tyree, Alex Evans, Patricio A. Vela, Stan Birchfield

Abstract: We present a parallelized optimization method based on fast Neural Radiance Fields (NeRF) for estimating 6-DoF pose of a camera with respect to an object or scene. Given a single observed RGB image of the target, we can predict the translation and rotation of the camera by minimizing the residual between pixels rendered from a fast NeRF model and pixels in the observed image. We integrate a moment… ▽ More We present a parallelized optimization method based on fast Neural Radiance Fields (NeRF) for estimating 6-DoF pose of a camera with respect to an object or scene. Given a single observed RGB image of the target, we can predict the translation and rotation of the camera by minimizing the residual between pixels rendered from a fast NeRF model and pixels in the observed image. We integrate a momentum-based camera extrinsic optimization procedure into Instant Neural Graphics Primitives, a recent exceptionally fast NeRF implementation. By introducing parallel Monte Carlo sampling into the pose estimation task, our method overcomes local minima and improves efficiency in a more extensive search space. We also show the importance of adopting a more robust pixel-based loss function to reduce error. Experiments demonstrate that our method can achieve improved generalization and robustness on both synthetic and real-world benchmarks. △ Less

Submitted 10 March, 2023; v1 submitted 18 October, 2022; originally announced October 2022.

Comments: ICRA 2023. Project page at https://pnerfp.github.io/

arXiv:2209.11302 [pdf, other]

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Authors: Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg

Abstract: Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumeratin… ▽ More Task planning can require defining myriad domain knowledge about the world in which a robot needs to act. To ameliorate that effort, large language models (LLMs) can be used to score potential next actions during task planning, and even generate action sequences directly, given an instruction in natural language with no additional domain information. However, such methods either require enumerating all possible next steps for scoring, or generate free-form text that may contain actions not possible on a given robot in its current context. We present a programmatic LLM prompt structure that enables plan generation functional across situated environments, robot capabilities, and tasks. Our key insight is to prompt the LLM with program-like specifications of the available actions and objects in an environment, as well as with example programs that can be executed. We make concrete recommendations about prompt structure and generation constraints through ablation experiments, demonstrate state of the art success rates in VirtualHome household tasks, and deploy our method on a physical robot arm for tabletop tasks. Website at progprompt.github.io △ Less

Submitted 22 September, 2022; originally announced September 2022.

arXiv:2206.08241 [pdf, other]

doi 10.1063/5.0105308

Stochastic Multi Configuration Time-Dependent Hartree for Dissipative Quantum Dynamics with Strong Intramolecular Coupling

Authors: Souvik Mandal, Fabien Gatti, Oussama Bindech, Roberto Marquardt, Jean Christophe Tremblay

Abstract: In this article, we explore the dissipation dynamics of a strongly coupled multidimensional system in contact with a Markovian bath following a system-bath approach. We use in this endeavour the recently developed stochastic Multi-Configuration Time-Dependent Hartree approach within the Monte Carlo wave packet formalism [J.Chem.Phys.156, 094109 (2022)]. The method proved to yield thermalized ensem… ▽ More In this article, we explore the dissipation dynamics of a strongly coupled multidimensional system in contact with a Markovian bath following a system-bath approach. We use in this endeavour the recently developed stochastic Multi-Configuration Time-Dependent Hartree approach within the Monte Carlo wave packet formalism [J.Chem.Phys.156, 094109 (2022)]. The method proved to yield thermalized ensembles of wave packets when intramolecular coupling is weak. To treat strongly coupled systems, new Lindblad dissipative operators are constructed as linear combinations of the system coordinates and associated momenta. These are obtained by an unitary transformation to a normal mode representation, which reduces intermode coupling up to second order. Additionally, we use combinations of generalized raising/lowering operators to enforce the Boltzmann distribution in the dissipation operators, which yield perfect thermalization in the harmonic limit. The two ansatz are tested using a model two-dimensional hamiltonian parameterized to disentangle the effects of intramolecular potential coupling, of strong mode mixing observed in Fermi resonances, and of anharmonicity. △ Less

Submitted 16 June, 2022; originally announced June 2022.

Comments: 32 pages, 11 figures

arXiv:2206.07707 [pdf, other]

Variable Bitrate Neural Fields

Authors: Towaki Takikawa, Alex Evans, Jonathan Tremblay, Thomas Müller, Morgan McGuire, Alec Jacobson, Sanja Fidler

Abstract: Neural approximations of scalar and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, high-quality representations. State-of-the-art results are obtained by conditioning a neural approximation with a lookup from trainable feature grids that take on part of the learning task and allow for smaller, more efficient neural networks. Unfortunately, these fea… ▽ More Neural approximations of scalar and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, high-quality representations. State-of-the-art results are obtained by conditioning a neural approximation with a lookup from trainable feature grids that take on part of the learning task and allow for smaller, more efficient neural networks. Unfortunately, these feature grids usually come at the cost of significantly increased memory consumption compared to stand-alone neural network models. We present a dictionary method for compressing such feature grids, reducing their memory consumption by up to 100x and permitting a multiresolution representation which can be useful for out-of-core streaming. We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available and with dynamic topology and structure. Our source code will be available at https://github.com/nv-tlabs/vqad. △ Less

Submitted 15 June, 2022; originally announced June 2022.

Comments: SIGGRAPH 2022. Project Page: https://nv-tlabs.github.io/vqad/

arXiv:2205.11047 [pdf, other]

Keypoint-Based Category-Level Object Pose Tracking from an RGB Sequence with Uncertainty Estimation

Authors: Yunzhi Lin, Jonathan Tremblay, Stephen Tyree, Patricio A. Vela, Stan Birchfield

Abstract: We propose a single-stage, category-level 6-DoF pose estimation algorithm that simultaneously detects and tracks instances of objects within a known category. Our method takes as input the previous and current frame from a monocular RGB video, as well as predictions from the previous frame, to predict the bounding cuboid and 6-DoF pose (up to scale). Internally, a deep network predicts distributio… ▽ More We propose a single-stage, category-level 6-DoF pose estimation algorithm that simultaneously detects and tracks instances of objects within a known category. Our method takes as input the previous and current frame from a monocular RGB video, as well as predictions from the previous frame, to predict the bounding cuboid and 6-DoF pose (up to scale). Internally, a deep network predicts distributions over object keypoints (vertices of the bounding cuboid) in image coordinates, after which a novel probabilistic filtering process integrates across estimates before computing the final pose using PnP. Our framework allows the system to take previous uncertainties into consideration when predicting the current frame, resulting in predictions that are more accurate and stable than single frame methods. Extensive experiments show that our method outperforms existing approaches on the challenging Objectron benchmark of annotated object videos. We also demonstrate the usability of our work in an augmented reality setting. △ Less

Submitted 23 May, 2022; originally announced May 2022.

Comments: ICRA 2022. Project site is at https://sites.google.com/view/centerposetrack

arXiv:2205.07058 [pdf, other]

RTMV: A Ray-Traced Multi-View Synthetic Dataset for Novel View Synthesis

Authors: Jonathan Tremblay, Moustafa Meshry, Alex Evans, Jan Kautz, Alexander Keller, Sameh Khamis, Thomas Müller, Charles Loop, Nathan Morrical, Koki Nagano, Towaki Takikawa, Stan Birchfield

Abstract: We present a large-scale synthetic dataset for novel view synthesis consisting of ~300k images rendered from nearly 2000 complex scenes using high-quality ray tracing at high resolution (1600 x 1600 pixels). The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis, thus providing a large unified benchmark for both training and evaluation. Using 4 distinct… ▽ More We present a large-scale synthetic dataset for novel view synthesis consisting of ~300k images rendered from nearly 2000 complex scenes using high-quality ray tracing at high resolution (1600 x 1600 pixels). The dataset is orders of magnitude larger than existing synthetic datasets for novel view synthesis, thus providing a large unified benchmark for both training and evaluation. Using 4 distinct sources of high-quality 3D meshes, the scenes of our dataset exhibit challenging variations in camera views, lighting, shape, materials, and textures. Because our dataset is too large for existing methods to process, we propose Sparse Voxel Light Field (SVLF), an efficient voxel-based light field approach for novel view synthesis that achieves comparable performance to NeRF on synthetic data, while being an order of magnitude faster to train and two orders of magnitude faster to render. SVLF achieves this speed by relying on a sparse voxel octree, careful voxel sampling (requiring only a handful of queries per ray), and reduced network structure; as well as ground truth depth maps at training time. Our dataset is generated by NViSII, a Python-based ray tracing renderer, which is designed to be simple for non-experts to use and share, flexible and powerful through its use of scripting, and able to create high-quality and physically-based rendered images. Experiments with a subset of our dataset allow us to compare standard methods like NeRF and mip-NeRF for single-scene modeling, and pixelNeRF for category-level modeling, pointing toward the need for future improvements in this area. △ Less

Submitted 24 October, 2022; v1 submitted 14 May, 2022; originally announced May 2022.

Comments: ECCV 2022 Workshop on Learning to Generate 3D Shapes and Scenes. Project page at http://www.cs.umd.edu/~mmeshry/projects/rtmv

arXiv:2205.01530 [pdf, other]

doi 10.1103/PhysRevA.106.033120

Probing the Effect of Molecular Structure Saddling on Ultrafast Charge Migration via Time-Resolved X-ray Diffraction

Authors: Sucharita Giri, Jean Christophe Tremblay, Gopal Dixit

Abstract: Metal-corroles are macrocycle organic molecules with numerous practical applications. In particular, copper corroles exhibit an interesting saddled geometry, which has attracted significant attention from theoreticians and experimentalists over the years. The present work is dedicated to understand the effect of structural saddling in a copper corrole on potential probe signals via imaging ultrafa… ▽ More Metal-corroles are macrocycle organic molecules with numerous practical applications. In particular, copper corroles exhibit an interesting saddled geometry, which has attracted significant attention from theoreticians and experimentalists over the years. The present work is dedicated to understand the effect of structural saddling in a copper corrole on potential probe signals via imaging ultrafast coherent electron dynamics. A linearly polarized pulse is used to trigger the electron dynamics and time-resolved x-ray diffraction is employed to image the triggered dynamics. It is found that the symmetry reduction in the time-resolved diffraction signals and electronic flux densities is a signature of the saddling in a copper corrole during ultrafast charge migration. Moreover, analysis of the electronic flux density reveals that the diagonal nitrogen atoms mediate coherent charge migration between them via a central copper atom. Correlation of the flux densities and the diffraction signals indicates that the signature of the charge migration is encoded in time-resolved diffraction signals. A comparison of the static diffraction signals of nonsaddled planar copper porphyrin and saddled nonplanar copper corrole in their ground states is made. △ Less

Submitted 29 September, 2022; v1 submitted 3 May, 2022; originally announced May 2022.

Comments: 17 pages, 7 figures

Journal ref: Physical Review A 106, 033120 (2022)

arXiv:2203.05701 [pdf, other]

6-DoF Pose Estimation of Household Objects for Robotic Manipulation: An Accessible Dataset and Benchmark

Authors: Stephen Tyree, Jonathan Tremblay, Thang To, Jia Cheng, Terry Mosier, Jeffrey Smith, Stan Birchfield

Abstract: We present a new dataset for 6-DoF pose estimation of known objects, with a focus on robotic manipulation research. We propose a set of toy grocery objects, whose physical instantiations are readily available for purchase and are appropriately sized for robotic grasping and manipulation. We provide 3D scanned textured models of these objects, suitable for generating synthetic training data, as wel… ▽ More We present a new dataset for 6-DoF pose estimation of known objects, with a focus on robotic manipulation research. We propose a set of toy grocery objects, whose physical instantiations are readily available for purchase and are appropriately sized for robotic grasping and manipulation. We provide 3D scanned textured models of these objects, suitable for generating synthetic training data, as well as RGBD images of the objects in challenging, cluttered scenes exhibiting partial occlusion, extreme lighting variations, multiple instances per image, and a large variety of poses. Using semi-automated RGBD-to-model texture correspondences, the images are annotated with ground truth poses accurate within a few millimeters. We also propose a new pose evaluation metric called ADD-H based on the Hungarian assignment algorithm that is robust to symmetries in object geometry without requiring their explicit enumeration. We share pre-trained pose estimators for all the toy grocery objects, along with their baseline performance on both validation and test sets. We offer this dataset to the community to help connect the efforts of computer vision researchers with the needs of roboticists. △ Less

Submitted 15 December, 2022; v1 submitted 10 March, 2022; originally announced March 2022.

Comments: IROS 2022. Project page is at https://github.com/swtyree/hope-dataset

arXiv:2112.11347 [pdf, other]

Watch It Move: Unsupervised Discovery of 3D Joints for Re-Posing of Articulated Objects

Authors: Atsuhiro Noguchi, Umar Iqbal, Jonathan Tremblay, Tatsuya Harada, Orazio Gallo

Abstract: Rendering articulated objects while controlling their poses is critical to applications such as virtual reality or animation for movies. Manipulating the pose of an object, however, requires the understanding of its underlying structure, that is, its joints and how they interact with each other. Unfortunately, assuming the structure to be known, as existing methods do, precludes the ability to wor… ▽ More Rendering articulated objects while controlling their poses is critical to applications such as virtual reality or animation for movies. Manipulating the pose of an object, however, requires the understanding of its underlying structure, that is, its joints and how they interact with each other. Unfortunately, assuming the structure to be known, as existing methods do, precludes the ability to work on new object categories. We propose to learn both the appearance and the structure of previously unseen articulated objects by observing them move from multiple views, with no joints annotation supervision, or information about the structure. We observe that 3D points that are static relative to one another should belong to the same part, and that adjacent parts that move relative to each other must be connected by a joint. To leverage this insight, we model the object parts in 3D as ellipsoids, which allows us to identify joints. We combine this explicit representation with an implicit one that compensates for the approximation introduced. We show that our method works for different structures, from quadrupeds, to single-arm robots, to humans. △ Less

Submitted 6 April, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

Comments: CVPR2022, 16 pages, Project page: https://nvlabs.github.io/watch-it-move

arXiv:2112.07945 [pdf, other]

Efficient Geometry-aware 3D Generative Adversarial Networks

Authors: Eric R. Chan, Connor Z. Lin, Matthew A. Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas Guibas, Jonathan Tremblay, Sameh Khamis, Tero Karras, Gordon Wetzstein

Abstract: Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape… ▽ More Unsupervised generation of high-quality multi-view-consistent images and 3D shapes using only collections of single-view 2D photographs has been a long-standing challenge. Existing 3D GANs are either compute-intensive or make approximations that are not 3D-consistent; the former limits quality and resolution of the generated images and the latter adversely affects multi-view consistency and shape quality. In this work, we improve the computational efficiency and image quality of 3D GANs without overly relying on these approximations. We introduce an expressive hybrid explicit-implicit network architecture that, together with other design choices, synthesizes not only high-resolution multi-view-consistent images in real time but also produces high-quality 3D geometry. By decoupling feature generation and neural rendering, our framework is able to leverage state-of-the-art 2D CNN generators, such as StyleGAN2, and inherit their efficiency and expressiveness. We demonstrate state-of-the-art 3D-aware synthesis with FFHQ and AFHQ Cats, among other experiments. △ Less

Submitted 27 April, 2022; v1 submitted 15 December, 2021; originally announced December 2021.

Comments: Project page: https://matthew-a-chan.github.io/EG3D

arXiv:2109.06161 [pdf, other]

Single-Stage Keypoint-Based Category-Level Object Pose Estimation from an RGB Image

Authors: Yunzhi Lin, Jonathan Tremblay, Stephen Tyree, Patricio A. Vela, Stan Birchfield

Abstract: Prior work on 6-DoF object pose estimation has largely focused on instance-level processing, in which a textured CAD model is available for each object being detected. Category-level 6-DoF pose estimation represents an important step toward developing robotic vision systems that operate in unstructured, real-world scenarios. In this work, we propose a single-stage, keypoint-based approach for cate… ▽ More Prior work on 6-DoF object pose estimation has largely focused on instance-level processing, in which a textured CAD model is available for each object being detected. Category-level 6-DoF pose estimation represents an important step toward developing robotic vision systems that operate in unstructured, real-world scenarios. In this work, we propose a single-stage, keypoint-based approach for category-level object pose estimation that operates on unknown object instances within a known category using a single RGB image as input. The proposed network performs 2D object detection, detects 2D keypoints, estimates 6-DoF pose, and regresses relative bounding cuboid dimensions. These quantities are estimated in a sequential fashion, leveraging the recent idea of convGRU for propagating information from easier tasks to those that are more difficult. We favor simplicity in our design choices: generic cuboid vertex coordinates, single-stage network, and monocular RGB input. We conduct extensive experiments on the challenging Objectron benchmark, outperforming state-of-the-art methods on the 3D IoU metric (27.6% higher than the MobilePose single-stage approach and 7.1% higher than the related two-stage approach). △ Less

Submitted 12 May, 2022; v1 submitted 13 September, 2021; originally announced September 2021.

Comments: ICRA 2022. Project page at https://sites.google.com/view/centerpose

arXiv:2107.05658 [pdf, other]

doi 10.3847/1538-4357/ac2577

Comprehensive analysis of a dense sample of FRB 121102 bursts

Authors: Kshitij Aggarwal, Devansh Agarwal, Evan F. Lewis, Reshma Anna-Thomas, Jacob Cardinal Tremblay, Sarah Burke-Spolaor, Maura A. McLaughlin, Duncan R. Lorimer

Abstract: We present an analysis of a densely repeating sample of bursts from the first repeating fast radio burst, FRB 121102. We reanalysed the data used by Gourdji et al. (2019) and detected 93 additional bursts using our single-pulse search pipeline. In total, we detected 133 bursts in three hours of data at a center frequency of 1.4 GHz using the Arecibo telescope, and develop robust modeling strategie… ▽ More We present an analysis of a densely repeating sample of bursts from the first repeating fast radio burst, FRB 121102. We reanalysed the data used by Gourdji et al. (2019) and detected 93 additional bursts using our single-pulse search pipeline. In total, we detected 133 bursts in three hours of data at a center frequency of 1.4 GHz using the Arecibo telescope, and develop robust modeling strategies to constrain the spectro-temporal properties of all the bursts in the sample. Most of the burst profiles show a scattering tail, and burst spectra are well modeled by a Gaussian with a median width of 230 MHz. We find a lack of emission below 1300 MHz, consistent with previous studies of FRB 121102. We also find that the peak of the log-normal distribution of wait times decreases from 207 s to 75 s using our larger sample of bursts, as compared to that of Gourdji et al. (2019). Our observations do not favor either Poissonian or Weibull distributions for the burst rate distribution. We searched for periodicity in the bursts using multiple techniques but did not detect any significant period. The cumulative burst energy distribution exhibits a broken power-law shape, with the lower and higher-energy slopes of $-0.4\pm0.1$ and $-1.8\pm0.2$, with the break at $(2.3\pm0.2)\times 10^{37}$ ergs. We provide our burst fitting routines as a python package BURSTFIT that can be used to model the spectrogram of any complex FRB or pulsar pulse using robust fitting techniques. All the other analysis scripts and results are publicly available. △ Less

Submitted 23 September, 2021; v1 submitted 12 July, 2021; originally announced July 2021.

Comments: 27 pages, 13 figures, 5 Tables; Accepted for publication in ApJ

arXiv:2105.13962 [pdf, other]

NViSII: A Scriptable Tool for Photorealistic Image Generation

Authors: Nathan Morrical, Jonathan Tremblay, Yunzhi Lin, Stephen Tyree, Stan Birchfield, Valerio Pascucci, Ingo Wald

Abstract: We present a Python-based renderer built on NVIDIA's OptiX ray tracing engine and the OptiX AI denoiser, designed to generate high-quality synthetic images for research in computer vision and deep learning. Our tool enables the description and manipulation of complex dynamic 3D scenes containing object meshes, materials, textures, lighting, volumetric data (e.g., smoke), and backgrounds. Metadata,… ▽ More We present a Python-based renderer built on NVIDIA's OptiX ray tracing engine and the OptiX AI denoiser, designed to generate high-quality synthetic images for research in computer vision and deep learning. Our tool enables the description and manipulation of complex dynamic 3D scenes containing object meshes, materials, textures, lighting, volumetric data (e.g., smoke), and backgrounds. Metadata, such as 2D/3D bounding boxes, segmentation masks, depth maps, normal maps, material properties, and optical flow vectors, can also be generated. In this work, we discuss design goals, architecture, and performance. We demonstrate the use of data generated by path tracing for training an object detector and pose estimator, showing improved performance in sim-to-real transfer in situations that are difficult for traditional raster-based renderers. We offer this tool as an easy-to-use, performant, high-quality renderer for advancing research in synthetic data generation and deep learning. △ Less

Submitted 28 May, 2021; originally announced May 2021.

Comments: SDG Workshop at ICLR 2021. Project page is at https://github.com/owl-project/NVISII

arXiv:2104.04631 [pdf, other]

DexYCB: A Benchmark for Capturing Hand Grasping of Objects

Authors: Yu-Wei Chao, Wei Yang, Yu Xiang, Pavlo Molchanov, Ankur Handa, Jonathan Tremblay, Yashraj S. Narang, Karl Van Wyk, Umar Iqbal, Stan Birchfield, Jan Kautz, Dieter Fox

Abstract: We introduce DexYCB, a new dataset for capturing hand grasping of objects. We first compare DexYCB with a related one through cross-dataset evaluation. We then present a thorough benchmark of state-of-the-art approaches on three relevant tasks: 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation. Finally, we evaluate a new robotics-relevant task: generating saf… ▽ More We introduce DexYCB, a new dataset for capturing hand grasping of objects. We first compare DexYCB with a related one through cross-dataset evaluation. We then present a thorough benchmark of state-of-the-art approaches on three relevant tasks: 2D object and keypoint detection, 6D object pose estimation, and 3D hand pose estimation. Finally, we evaluate a new robotics-relevant task: generating safe robot grasps in human-to-robot object handover. Dataset and code are available at https://dex-ycb.github.io. △ Less

Submitted 9 April, 2021; originally announced April 2021.

Comments: Accepted to CVPR 2021

arXiv:2104.01760 [pdf, other]

doi 10.1103/PhysRevA.104.053115

Imaging charge-migration in chiral molecules using time-resolved x-ray diffraction

Authors: Sucharita Giri, Jean Christophe Tremblay, Gopal Dixit

Abstract: Four-dimensional imaging of charge migration is crucial to the understanding of several ubiquitous processes in nature. The present work focuses on imaging of charge migration in an oriented epoxypropane: a chiral molecule. A linearly polarized pulse is used to induce the charge migration, which is imaged by time-resolved x-ray diffraction. It is found that the total time-resolved diffraction sign… ▽ More Four-dimensional imaging of charge migration is crucial to the understanding of several ubiquitous processes in nature. The present work focuses on imaging of charge migration in an oriented epoxypropane: a chiral molecule. A linearly polarized pulse is used to induce the charge migration, which is imaged by time-resolved x-ray diffraction. It is found that the total time-resolved diffraction signals are significantly different for both enantiomers. Furthermore, a connection between time-resolved x-ray diffraction and the electronic continuity equation is discussed by analyzing the time-dependent diffraction signal and the time derivative of the total electron density in the momentum space. △ Less

Submitted 23 November, 2021; v1 submitted 4 April, 2021; originally announced April 2021.

Comments: 9 Pages, 5 figures

Journal ref: Physical Review A 104, 053115 (2021)

arXiv:2103.13539 [pdf, other]

Multi-View Fusion for Multi-Level Robotic Scene Understanding

Authors: Yunzhi Lin, Jonathan Tremblay, Stephen Tyree, Patricio A. Vela, Stan Birchfield

Abstract: We present a system for multi-level scene awareness for robotic manipulation. Given a sequence of camera-in-hand RGB images, the system calculates three types of information: 1) a point cloud representation of all the surfaces in the scene, for the purpose of obstacle avoidance; 2) the rough pose of unknown objects from categories corresponding to primitive shapes (e.g., cuboids and cylinders); an… ▽ More We present a system for multi-level scene awareness for robotic manipulation. Given a sequence of camera-in-hand RGB images, the system calculates three types of information: 1) a point cloud representation of all the surfaces in the scene, for the purpose of obstacle avoidance; 2) the rough pose of unknown objects from categories corresponding to primitive shapes (e.g., cuboids and cylinders); and 3) full 6-DoF pose of known objects. By developing and fusing recent techniques in these domains, we provide a rich scene representation for robot awareness. We demonstrate the importance of each of these modules, their complementary nature, and the potential benefits of the system in the context of robotic manipulation. △ Less

Submitted 14 October, 2021; v1 submitted 24 March, 2021; originally announced March 2021.

Comments: Presented at IROS 2021. Video is at https://youtu.be/FuqMxuODGlw

arXiv:2012.07277 [pdf, other]

Hierarchical Planning for Long-Horizon Manipulation with Geometric and Symbolic Scene Graphs

Authors: Yifeng Zhu, Jonathan Tremblay, Stan Birchfield, Yuke Zhu

Abstract: We present a visually grounded hierarchical planning algorithm for long-horizon manipulation tasks. Our algorithm offers a joint framework of neuro-symbolic task planning and low-level motion generation conditioned on the specified goal. At the core of our approach is a two-level scene graph representation, namely geometric scene graph and symbolic scene graph. This hierarchical representation ser… ▽ More We present a visually grounded hierarchical planning algorithm for long-horizon manipulation tasks. Our algorithm offers a joint framework of neuro-symbolic task planning and low-level motion generation conditioned on the specified goal. At the core of our approach is a two-level scene graph representation, namely geometric scene graph and symbolic scene graph. This hierarchical representation serves as a structured, object-centric abstraction of manipulation scenes. Our model uses graph neural networks to process these scene graphs for predicting high-level task plans and low-level motions. We demonstrate that our method scales to long-horizon tasks and generalizes well to novel task goals. We validate our method in a kitchen storage task in both physical simulation and the real world. Our experiments show that our method achieved over 70% success rate and nearly 90% of subgoal completion rate on the real robot while being four orders of magnitude faster in computation time compared to standard search-based task-and-motion planner. △ Less

Submitted 29 March, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

Comments: Accepted to ICRA 2021

arXiv:2011.11751 [pdf, other]

Multimodal dynamics modeling for off-road autonomous vehicles

Authors: Jean-François Tremblay, Travis Manderson, Aurélio Noca, Gregory Dudek, David Meger

Abstract: Dynamics modeling in outdoor and unstructured environments is difficult because different elements in the environment interact with the robot in ways that can be hard to predict. Leveraging multiple sensors to perceive maximal information about the robot's environment is thus crucial when building a model to perform predictions about the robot's dynamics with the goal of doing motion planning. We… ▽ More Dynamics modeling in outdoor and unstructured environments is difficult because different elements in the environment interact with the robot in ways that can be hard to predict. Leveraging multiple sensors to perceive maximal information about the robot's environment is thus crucial when building a model to perform predictions about the robot's dynamics with the goal of doing motion planning. We design a model capable of long-horizon motion predictions, leveraging vision, lidar and proprioception, which is robust to arbitrarily missing modalities at test time. We demonstrate in simulation that our model is able to leverage vision to predict traction changes. We then test our model using a real-world challenging dataset of a robot navigating through a forest, performing predictions in trajectories unseen during training. We try different modality combinations at test time and show that, while our model performs best when all modalities are present, it is still able to perform better than the baseline even when receiving only raw vision input and no proprioception, as well as when only receiving proprioception. Overall, our study demonstrates the importance of leveraging multiple sensors when doing dynamics modeling in outdoor conditions. △ Less

Submitted 29 March, 2021; v1 submitted 23 November, 2020; originally announced November 2020.

arXiv:2011.07748 [pdf, other]

Fast Uncertainty Quantification for Deep Object Pose Estimation

Authors: Guanya Shi, Yifeng Zhu, Jonathan Tremblay, Stan Birchfield, Fabio Ramos, Animashree Anandkumar, Yuke Zhu

Abstract: Deep learning-based object pose estimators are often unreliable and overconfident especially when the input image is outside the training domain, for instance, with sim2real transfer. Efficient and robust uncertainty quantification (UQ) in pose estimators is critically needed in many robotic tasks. In this work, we propose a simple, efficient, and plug-and-play UQ method for 6-DoF object pose esti… ▽ More Deep learning-based object pose estimators are often unreliable and overconfident especially when the input image is outside the training domain, for instance, with sim2real transfer. Efficient and robust uncertainty quantification (UQ) in pose estimators is critically needed in many robotic tasks. In this work, we propose a simple, efficient, and plug-and-play UQ method for 6-DoF object pose estimation. We ensemble 2-3 pre-trained models with different neural network architectures and/or training data sources, and compute their average pairwise disagreement against one another to obtain the uncertainty quantification. We propose four disagreement metrics, including a learned metric, and show that the average distance (ADD) is the best learning-free metric and it is only slightly worse than the learned metric, which requires labeled target data. Our method has several advantages compared to the prior art: 1) our method does not require any modification of the training process or the model inputs; and 2) it needs only one forward pass for each model. We evaluate the proposed UQ method on three tasks where our uncertainty quantification yields much stronger correlations with pose estimation errors than the baselines. Moreover, in a real robot grasping task, our method increases the grasping success rate from 35% to 90%. △ Less

Submitted 26 March, 2021; v1 submitted 16 November, 2020; originally announced November 2020.

Comments: Video and code are available at https://sites.google.com/view/fastuq

Journal ref: International Conferenceon Robotics and Automation (ICRA), 2021

arXiv:2011.06332 [pdf, other]

Joint Space Control via Deep Reinforcement Learning

Authors: Visak Kumar, David Hoeller, Balakumar Sundaralingam, Jonathan Tremblay, Stan Birchfield

Abstract: The dominant way to control a robot manipulator uses hand-crafted differential equations leveraging some form of inverse kinematics / dynamics. We propose a simple, versatile joint-level controller that dispenses with differential equations entirely. A deep neural network, trained via model-free reinforcement learning, is used to map from task space to joint space. Experiments show the method capa… ▽ More The dominant way to control a robot manipulator uses hand-crafted differential equations leveraging some form of inverse kinematics / dynamics. We propose a simple, versatile joint-level controller that dispenses with differential equations entirely. A deep neural network, trained via model-free reinforcement learning, is used to map from task space to joint space. Experiments show the method capable of achieving similar error to traditional methods, while greatly simplifying the process by automatically handling redundancy, joint limits, and acceleration / deceleration profiles. The basic technique is extended to avoid obstacles by augmenting the input to the network with information about the nearest obstacles. Results are shown both in simulation and on a real robot via sim-to-real transfer of the learned policy. We show that it is possible to achieve sub-centimeter accuracy, both in simulation and the real world, with a moderate amount of training. △ Less

Submitted 20 August, 2021; v1 submitted 12 November, 2020; originally announced November 2020.

Comments: Presented at IROS 2021. Video is at https://youtu.be/ICfve-GTTp8

arXiv:2008.11822 [pdf, other]

Indirect Object-to-Robot Pose Estimation from an External Monocular RGB Camera

Authors: Jonathan Tremblay, Stephen Tyree, Terry Mosier, Stan Birchfield

Abstract: We present a robotic grasping system that uses a single external monocular RGB camera as input. The object-to-robot pose is computed indirectly by combining the output of two neural networks: one that estimates the object-to-camera pose, and another that estimates the robot-to-camera pose. Both networks are trained entirely on synthetic data, relying on domain randomization to bridge the sim-to-re… ▽ More We present a robotic grasping system that uses a single external monocular RGB camera as input. The object-to-robot pose is computed indirectly by combining the output of two neural networks: one that estimates the object-to-camera pose, and another that estimates the robot-to-camera pose. Both networks are trained entirely on synthetic data, relying on domain randomization to bridge the sim-to-real gap. Because the latter network performs online camera calibration, the camera can be moved freely during execution without affecting the quality of the grasp. Experimental results analyze the effect of camera placement, image resolution, and pose refinement in the context of grasping several household objects. We also present results on a new set of 28 textured household toy grocery objects, which have been selected to be accessible to other researchers. To aid reproducibility of the research, we offer 3D scanned textured models, along with pre-trained weights for pose estimation. △ Less

Submitted 26 August, 2020; originally announced August 2020.

Comments: IROS 2020. Video at https://youtu.be/E0J91llX-ys

arXiv:2006.16235 [pdf, other]

Vision-Based Goal-Conditioned Policies for Underwater Navigation in the Presence of Obstacles

Authors: Travis Manderson, Juan Camilo Gamboa Higuera, Stefan Wapnick, Jean-François Tremblay, Florian Shkurti, David Meger, Gregory Dudek

Abstract: We present Nav2Goal, a data-efficient and end-to-end learning method for goal-conditioned visual navigation. Our technique is used to train a navigation policy that enables a robot to navigate close to sparse geographic waypoints provided by a user without any prior map, all while avoiding obstacles and choosing paths that cover user-informed regions of interest. Our approach is based on recent ad… ▽ More We present Nav2Goal, a data-efficient and end-to-end learning method for goal-conditioned visual navigation. Our technique is used to train a navigation policy that enables a robot to navigate close to sparse geographic waypoints provided by a user without any prior map, all while avoiding obstacles and choosing paths that cover user-informed regions of interest. Our approach is based on recent advances in conditional imitation learning. General-purpose, safe and informative actions are demonstrated by a human expert. The learned policy is subsequently extended to be goal-conditioned by training with hindsight relabelling, guided by the robot's relative localization system, which requires no additional manual annotation. We deployed our method on an underwater vehicle in the open ocean to collect scientifically relevant data of coral reefs, which allowed our robot to operate safely and autonomously, even at very close proximity to the coral. Our field deployments have demonstrated over a kilometer of autonomous visual navigation, where the robot reaches on the order of 40 waypoints, while collecting scientifically relevant data. This is done while travelling within 0.5 m altitude from sensitive corals and exhibiting significant learned agility to overcome turbulent ocean conditions and to actively avoid collisions. △ Less

Submitted 29 June, 2020; originally announced June 2020.

Comments: RSS 2020. Video and project details can be found at http://www.cim.mcgill.ca/mrl/nav2goal/

arXiv:2005.10872 [pdf, other]

Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning

Authors: Michelle A. Lee, Carlos Florensa, Jonathan Tremblay, Nathan Ratliff, Animesh Garg, Fabio Ramos, Dieter Fox

Abstract: Traditional robotic approaches rely on an accurate model of the environment, a detailed description of how to perform the task, and a robust perception system to keep track of the current state. On the other hand, reinforcement learning approaches can operate directly from raw sensory inputs with only a reward signal to describe the task, but are extremely sample-inefficient and brittle. In this w… ▽ More Traditional robotic approaches rely on an accurate model of the environment, a detailed description of how to perform the task, and a robust perception system to keep track of the current state. On the other hand, reinforcement learning approaches can operate directly from raw sensory inputs with only a reward signal to describe the task, but are extremely sample-inefficient and brittle. In this work, we combine the strengths of model-based methods with the flexibility of learning-based methods to obtain a general method that is able to overcome inaccuracies in the robotics perception/actuation pipeline, while requiring minimal interactions with the environment. This is achieved by leveraging uncertainty estimates to divide the space in regions where the given model-based policy is reliable, and regions where it may have flaws or not be well defined. In these uncertain regions, we show that a locally learned-policy can be used directly with raw sensory inputs. We test our algorithm, Guided Uncertainty-Aware Policy Optimization (GUAPO), on a real-world robot performing peg insertion. Videos are available at https://sites.google.com/view/guapo-rl △ Less

Submitted 26 May, 2020; v1 submitted 21 May, 2020; originally announced May 2020.

Journal ref: International Conference in Robotics and Automation 2020

arXiv:2005.00673 [pdf, other]

PAMTRI: Pose-Aware Multi-Task Learning for Vehicle Re-Identification Using Highly Randomized Synthetic Data

Authors: Zheng Tang, Milind Naphade, Stan Birchfield, Jonathan Tremblay, William Hodge, Ratnesh Kumar, Shuo Wang, Xiaodong Yang

Abstract: In comparison with person re-identification (ReID), which has been widely studied in the research community, vehicle ReID has received less attention. Vehicle ReID is challenging due to 1) high intra-class variability (caused by the dependency of shape and appearance on viewpoint), and 2) small inter-class variability (caused by the similarity in shape and appearance between vehicles produced by d… ▽ More In comparison with person re-identification (ReID), which has been widely studied in the research community, vehicle ReID has received less attention. Vehicle ReID is challenging due to 1) high intra-class variability (caused by the dependency of shape and appearance on viewpoint), and 2) small inter-class variability (caused by the similarity in shape and appearance between vehicles produced by different manufacturers). To address these challenges, we propose a Pose-Aware Multi-Task Re-Identification (PAMTRI) framework. This approach includes two innovations compared with previous methods. First, it overcomes viewpoint-dependency by explicitly reasoning about vehicle pose and shape via keypoints, heatmaps and segments from pose estimation. Second, it jointly classifies semantic vehicle attributes (colors and types) while performing ReID, through multi-task learning with the embedded pose representations. Since manually labeling images with detailed pose and attribute information is prohibitive, we create a large-scale highly randomized synthetic dataset with automatically annotated vehicle attributes for training. Extensive experiments validate the effectiveness of each proposed component, showing that PAMTRI achieves significant improvement over state-of-the-art on two mainstream vehicle ReID benchmarks: VeRi and CityFlow-ReID. Code and models are available at https://github.com/NVlabs/PAMTRI. △ Less

Submitted 1 May, 2020; originally announced May 2020.

Comments: Accepted by ICCV 2019

arXiv:1911.09233 [pdf, other]

Contextual Reinforcement Learning of Visuo-tactile Multi-fingered Grasping Policies

Authors: Visak Kumar, Tucker Hermans, Dieter Fox, Stan Birchfield, Jonathan Tremblay

Abstract: Using simulation to train robot manipulation policies holds the promise of an almost unlimited amount of training data, generated safely out of harm's way. One of the key challenges of using simulation, to date, has been to bridge the reality gap, so that policies trained in simulation can be deployed in the real world. We explore the reality gap in the context of learning a contextual policy for… ▽ More Using simulation to train robot manipulation policies holds the promise of an almost unlimited amount of training data, generated safely out of harm's way. One of the key challenges of using simulation, to date, has been to bridge the reality gap, so that policies trained in simulation can be deployed in the real world. We explore the reality gap in the context of learning a contextual policy for multi-fingered robotic grasping. We propose a Grasping Objects Approach for Tactile (GOAT) robotic hands, learning to overcome the reality gap problem. In our approach we use human hand motion demonstration to initialize and reduce the search space for learning. We contextualize our policy with the bounding cuboid dimensions of the object of interest, which allows the policy to work on a more flexible representation than directly using an image or point cloud. Leveraging fingertip touch sensors in the hand allows the policy to overcome the reduction in geometric information introduced by the coarse bounding box, as well as pose estimation uncertainty. We show our learned policy successfully runs on a real robot without any fine tuning, thus bridging the reality gap. △ Less

Submitted 24 November, 2019; v1 submitted 20 November, 2019; originally announced November 2019.

arXiv:1911.09231 [pdf, other]

Camera-to-Robot Pose Estimation from a Single Image

Authors: Timothy E. Lee, Jonathan Tremblay, Thang To, Jia Cheng, Terry Mosier, Oliver Kroemer, Dieter Fox, Stan Birchfield

Abstract: We present an approach for estimating the pose of an external camera with respect to a robot using a single RGB image of the robot. The image is processed by a deep neural network to detect 2D projections of keypoints (such as joints) associated with the robot. The network is trained entirely on simulated data using domain randomization to bridge the reality gap. Perspective-n-point (PnP) is then… ▽ More We present an approach for estimating the pose of an external camera with respect to a robot using a single RGB image of the robot. The image is processed by a deep neural network to detect 2D projections of keypoints (such as joints) associated with the robot. The network is trained entirely on simulated data using domain randomization to bridge the reality gap. Perspective-n-point (PnP) is then used to recover the camera extrinsics, assuming that the camera intrinsics and joint configuration of the robot manipulator are known. Unlike classic hand-eye calibration systems, our method does not require an off-line calibration step. Rather, it is capable of computing the camera extrinsics from a single frame, thus opening the possibility of on-line calibration. We show experimental results for three different robots and camera sensors, demonstrating that our approach is able to achieve accuracy with a single frame that is comparable to that of classic off-line hand-eye calibration using multiple frames. With additional frames from a static pose, accuracy improves even further. Code, datasets, and pretrained models for three widely-used robot manipulators are made available. △ Less

Submitted 23 April, 2020; v1 submitted 20 November, 2019; originally announced November 2019.

Comments: ICRA 2020. Project page is at https://research.nvidia.com/publication/2020-03_DREAM

arXiv:1909.10740 [pdf, other]

doi 10.1103/PhysRevA.102.063103

Probing molecular chirality via laser-induced electronic fluxes

Authors: Sucharita Giri, Alexandra Maxi Dudzinski, Jean Christophe Tremblay, Gopal Dixit

Abstract: Chirality is ubiquitous in nature and of fundamental importance in science. The present work focuses on understanding the conditions required to modify the chirality during ultrafast electronic motion by bringing enantiomers out-of-equilibrium. Different kinds of ultrashort linearly-polarised laser pulses are used to drive an ultrafast charge migration process by the excitation of a small number o… ▽ More Chirality is ubiquitous in nature and of fundamental importance in science. The present work focuses on understanding the conditions required to modify the chirality during ultrafast electronic motion by bringing enantiomers out-of-equilibrium. Different kinds of ultrashort linearly-polarised laser pulses are used to drive an ultrafast charge migration process by the excitation of a small number of low-lying excited states from the ground electronic state of S- and R-epoxypropane. Control over chiral electron dynamics is achieved by choosing the different orientations of the linearly polarised pulse. We find that chirality breaking electric fields are only possible in oriented molecules, and that charge migration remains chiral when the polarisation of the field lies in the mirror plane defining the enantiomer pair, or when it is strictly perpendicular to it. Ultimately, the presence or the absence of a mirror symmetry for the enantiomer pair in the external field determines the chiral properties of the charge migration process. △ Less

Submitted 24 September, 2019; originally announced September 2019.

Comments: 16 pages, 7 figures

Journal ref: Phys. Rev. A 102, 063103 (2020)

Showing 1–50 of 66 results for author: Tremblay, J