Search | arXiv e-print repository

Multi-step manipulation task and motion planning guided by video demonstration

Authors: Kateryna Zorina, David Kovar, Mederic Fourmy, Florent Lamiraux, Nicolas Mansard, Justin Carpentier, Josef Sivic, Vladimir Petrik

Abstract: This work aims to leverage instructional video to solve complex multi-step task-and-motion planning tasks in robotics. Towards this goal, we propose an extension of the well-established Rapidly-Exploring Random Tree (RRT) planner, which simultaneously grows multiple trees around grasp and release states extracted from the guiding video. Our key novelty lies in combining contact states and 3D objec… ▽ More This work aims to leverage instructional video to solve complex multi-step task-and-motion planning tasks in robotics. Towards this goal, we propose an extension of the well-established Rapidly-Exploring Random Tree (RRT) planner, which simultaneously grows multiple trees around grasp and release states extracted from the guiding video. Our key novelty lies in combining contact states and 3D object poses extracted from the guiding video with a traditional planning algorithm that allows us to solve tasks with sequential dependencies, for example, if an object needs to be placed at a specific location to be grasped later. We also investigate the generalization capabilities of our approach to go beyond the scene depicted in the instructional video. To demonstrate the benefits of the proposed video-guided planning approach, we design a new benchmark with three challenging tasks: (I) 3D re-arrangement of multiple objects between a table and a shelf, (ii) multi-step transfer of an object through a tunnel, and (iii) transferring objects using a tray similar to a waiter transfers dishes. We demonstrate the effectiveness of our planning algorithm on several robots, including the Franka Emika Panda and the KUKA KMR iiwa. For a seamless transfer of the obtained plans to the real robot, we develop a trajectory refinement approach formulated as an optimal control problem (OCP). △ Less

Submitted 13 May, 2025; originally announced May 2025.

arXiv:2504.02812 [pdf, other]

BOP Challenge 2024 on Model-Based and Model-Free 6D Object Pose Estimation

Authors: Van Nguyen Nguyen, Stephen Tyree, Andrew Guo, Mederic Fourmy, Anas Gouda, Taeyeop Lee, Sungphill Moon, Hyeontae Son, Lukas Ranftl, Jonathan Tremblay, Eric Brachmann, Bertram Drost, Vincent Lepetit, Carsten Rother, Stan Birchfield, Jiri Matas, Yann Labbe, Martin Sundermeyer, Tomas Hodan

Abstract: We present the evaluation methodology, datasets and results of the BOP Challenge 2024, the 6th in a series of public competitions organized to capture the state of the art in 6D object pose estimation and related tasks. In 2024, our goal was to transition BOP from lab-like setups to real-world scenarios. First, we introduced new model-free tasks, where no 3D object models are available and methods… ▽ More We present the evaluation methodology, datasets and results of the BOP Challenge 2024, the 6th in a series of public competitions organized to capture the state of the art in 6D object pose estimation and related tasks. In 2024, our goal was to transition BOP from lab-like setups to real-world scenarios. First, we introduced new model-free tasks, where no 3D object models are available and methods need to onboard objects just from provided reference videos. Second, we defined a new, more practical 6D object detection task where identities of objects visible in a test image are not provided as input. Third, we introduced new BOP-H3 datasets recorded with high-resolution sensors and AR/VR headsets, closely resembling real-world scenarios. BOP-H3 include 3D models and onboarding videos to support both model-based and model-free tasks. Participants competed on seven challenge tracks. Notably, the best 2024 method for model-based 6D localization of unseen objects (FreeZeV2.1) achieves 22% higher accuracy on BOP-Classic-Core than the best 2023 method (GenFlow), and is only 4% behind the best 2023 method for seen objects (GPose2023) although being significantly slower (24.9 vs 2.7s per image). A more practical 2024 method for this task is Co-op which takes only 0.8s per image and is 13% more accurate than GenFlow. Methods have similar rankings on 6D detection as on 6D localization but higher run time. On model-based 2D detection of unseen objects, the best 2024 method (MUSE) achieves 21--29% relative improvement compared to the best 2023 method (CNOS). However, the 2D detection accuracy for unseen objects is still -35% behind the accuracy for seen objects (GDet2023), and the 2D detection stage is consequently the main bottleneck of existing pipelines for 6D localization/detection of unseen objects. The online evaluation system stays open and is available at http://bop.felk.cvut.cz/ △ Less

Submitted 23 April, 2025; v1 submitted 3 April, 2025; originally announced April 2025.

Comments: arXiv admin note: text overlap with arXiv:2403.09799

arXiv:2503.23587 [pdf, other]

PhysPose: Refining 6D Object Poses with Physical Constraints

Authors: Martin Malenický, Martin Cífka, Médéric Fourmy, Louis Montaut, Justin Carpentier, Josef Sivic, Vladimir Petrik

Abstract: Accurate 6D object pose estimation from images is a key problem in object-centric scene understanding, enabling applications in robotics, augmented reality, and scene reconstruction. Despite recent advances, existing methods often produce physically inconsistent pose estimates, hindering their deployment in real-world scenarios. We introduce PhysPose, a novel approach that integrates physical reas… ▽ More Accurate 6D object pose estimation from images is a key problem in object-centric scene understanding, enabling applications in robotics, augmented reality, and scene reconstruction. Despite recent advances, existing methods often produce physically inconsistent pose estimates, hindering their deployment in real-world scenarios. We introduce PhysPose, a novel approach that integrates physical reasoning into pose estimation through a postprocessing optimization enforcing non-penetration and gravitational constraints. By leveraging scene geometry, PhysPose refines pose estimates to ensure physical plausibility. Our approach achieves state-of-the-art accuracy on the YCB-Video dataset from the BOP benchmark and improves over the state-of-the-art pose estimation methods on the HOPE-Video dataset. Furthermore, we demonstrate its impact in robotics by significantly improving success rates in a challenging pick-and-place task, highlighting the importance of physical consistency in real-world applications. △ Less

Submitted 30 March, 2025; originally announced March 2025.

Comments: Project page: https://data.ciirc.cvut.cz/public/projects/2025PhysPose

arXiv:2503.10307 [pdf, other]

6D Object Pose Tracking in Internet Videos for Robotic Manipulation

Authors: Georgy Ponimatkin, Martin Cífka, Tomáš Souček, Médéric Fourmy, Yann Labbé, Vladimir Petrik, Josef Sivic

Abstract: We seek to extract a temporally consistent 6D pose trajectory of a manipulated object from an Internet instructional video. This is a challenging set-up for current 6D pose estimation methods due to uncontrolled capturing conditions, subtle but dynamic object motions, and the fact that the exact mesh of the manipulated object is not known. To address these challenges, we present the following cont… ▽ More We seek to extract a temporally consistent 6D pose trajectory of a manipulated object from an Internet instructional video. This is a challenging set-up for current 6D pose estimation methods due to uncontrolled capturing conditions, subtle but dynamic object motions, and the fact that the exact mesh of the manipulated object is not known. To address these challenges, we present the following contributions. First, we develop a new method that estimates the 6D pose of any object in the input image without prior knowledge of the object itself. The method proceeds by (i) retrieving a CAD model similar to the depicted object from a large-scale model database, (ii) 6D aligning the retrieved CAD model with the input image, and (iii) grounding the absolute scale of the object with respect to the scene. Second, we extract smooth 6D object trajectories from Internet videos by carefully tracking the detected objects across video frames. The extracted object trajectories are then retargeted via trajectory optimization into the configuration space of a robotic manipulator. Third, we thoroughly evaluate and ablate our 6D pose estimation method on YCB-V and HOPE-Video datasets as well as a new dataset of instructional videos manually annotated with approximate 6D object trajectories. We demonstrate significant improvements over existing state-of-the-art RGB 6D pose estimation methods. Finally, we show that the 6D object motion estimated from Internet videos can be transferred to a 7-axis robotic manipulator both in a virtual simulator as well as in a real world set-up. We also successfully apply our method to egocentric videos taken from the EPIC-KITCHENS dataset, demonstrating potential for Embodied AI applications. △ Less

Submitted 13 March, 2025; originally announced March 2025.

Comments: Accepted to ICLR 2025. Project page available at https://ponimatkin.github.io/wildpose/

arXiv:2311.05344 [pdf, other]

Visually Guided Model Predictive Robot Control via 6D Object Pose Localization and Tracking

Authors: Mederic Fourmy, Vojtech Priban, Jan Kristof Behrens, Nicolas Mansard, Josef Sivic, Vladimir Petrik

Abstract: The objective of this work is to enable manipulation tasks with respect to the 6D pose of a dynamically moving object using a camera mounted on a robot. Examples include maintaining a constant relative 6D pose of the robot arm with respect to the object, grasping the dynamically moving object, or co-manipulating the object together with a human. Fast and accurate 6D pose estimation is crucial to a… ▽ More The objective of this work is to enable manipulation tasks with respect to the 6D pose of a dynamically moving object using a camera mounted on a robot. Examples include maintaining a constant relative 6D pose of the robot arm with respect to the object, grasping the dynamically moving object, or co-manipulating the object together with a human. Fast and accurate 6D pose estimation is crucial to achieve smooth and stable robot control in such situations. The contributions of this work are three fold. First, we propose a new visual perception module that asynchronously combines accurate learning-based 6D object pose localizer and a high-rate model-based 6D pose tracker. The outcome is a low-latency accurate and temporally consistent 6D object pose estimation from the input video stream at up to 120 Hz. Second, we develop a visually guided robot arm controller that combines the new visual perception module with a torque-based model predictive control algorithm. Asynchronous combination of the visual and robot proprioception signals at their corresponding frequencies results in stable and robust 6D object pose guided robot arm control. Third, we experimentally validate the proposed approach on a challenging 6D pose estimation benchmark and demonstrate 6D object pose-guided control with dynamically moving objects on a real 7 DoF Franka Emika Panda robot. △ Less

Submitted 9 November, 2023; originally announced November 2023.

arXiv:2110.12919 [pdf, other]

WOLF: A modular estimation framework for robotics based on factor graphs

Authors: Joan Sola, Joan Vallve, Joaquim Casals, Jeremie Deray, Mederic Fourmy, Dinesh Atchuthan, Andreu Corominas-Murtra, Juan Andrade-Cetto

Abstract: This paper introduces WOLF, a C++ estimation framework based on factor graphs and targeted at mobile robotics. WOLF can be used beyond SLAM to handle self-calibration, model identification, or the observation of dynamic quantities other than localization. The architecture of WOLF allows for a modular yet tightly-coupled estimator. Modularity is enhanced via reusable plugins that are loaded at runt… ▽ More This paper introduces WOLF, a C++ estimation framework based on factor graphs and targeted at mobile robotics. WOLF can be used beyond SLAM to handle self-calibration, model identification, or the observation of dynamic quantities other than localization. The architecture of WOLF allows for a modular yet tightly-coupled estimator. Modularity is enhanced via reusable plugins that are loaded at runtime depending on application setup. This setup is achieved conveniently through YAML files, allowing users to configure a wide range of applications without the need of writing or compiling code. Most procedures are coded as abstract algorithms in base classes with varying levels of specialization. Overall, all these assets allow for coherent processing and favor code re-usability and scalability. WOLF can be used with ROS, and is made publicly available and open to collaboration. △ Less

Submitted 16 February, 2022; v1 submitted 25 October, 2021; originally announced October 2021.

Comments: 8 pages, 12 figures. v1: removed repository link. v2: add TERRINET thanks. v3: full review, authors addition and fix. v4: Fix one citation

Showing 1–6 of 6 results for author: Fourmy, M