Search | arXiv e-print repository

Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking

Authors: Bastian Pätzold, Jan Nogga, Sven Behnke

Abstract: This paper introduces a novel approach that leverages the capabilities of vision-language models (VLMs) by integrating them with established approaches for open-vocabulary detection (OVD), instance segmentation, and tracking. We utilize VLM-generated structured descriptions to identify visible object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extr… ▽ More This paper introduces a novel approach that leverages the capabilities of vision-language models (VLMs) by integrating them with established approaches for open-vocabulary detection (OVD), instance segmentation, and tracking. We utilize VLM-generated structured descriptions to identify visible object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extract corresponding bounding boxes that are passed to a video segmentation model providing precise segmentation masks and tracking capabilities. Once initialized, this model can then directly extract segmentation masks, allowing processing of image streams in real time with minimal computational overhead. Tracks can be updated online as needed by generating new structured descriptions and corresponding open-vocabulary detections. This combines the descriptive power of VLMs with the grounding capability of OVD and the pixel-level understanding and speed of video segmentation. Our evaluation across datasets and robotics platforms demonstrates the broad applicability of this approach, showcasing its ability to extract task-specific attributes from non-standard objects in dynamic environments. △ Less

Submitted 18 March, 2025; originally announced March 2025.

Comments: Submitted to IEEE Robotics and Automation Letters (RA-L)

arXiv:2412.14989 [pdf, other]

RoboCup@Home 2024 OPL Winner NimbRo: Anthropomorphic Service Robots using Foundation Models for Perception and Planning

Authors: Raphael Memmesheimer, Jan Nogga, Bastian Pätzold, Evgenii Kruzhkov, Simon Bultmann, Michael Schreiber, Jonas Bode, Bertan Karacora, Juhui Park, Alena Savinykh, Sven Behnke

Abstract: We present the approaches and contributions of the winning team NimbRo@Home at the RoboCup@Home 2024 competition in the Open Platform League held in Eindhoven, NL. Further, we describe our hardware setup and give an overview of the results for the task stages and the final demonstration. For this year's competition, we put a special emphasis on open-vocabulary object segmentation and grasping appr… ▽ More We present the approaches and contributions of the winning team NimbRo@Home at the RoboCup@Home 2024 competition in the Open Platform League held in Eindhoven, NL. Further, we describe our hardware setup and give an overview of the results for the task stages and the final demonstration. For this year's competition, we put a special emphasis on open-vocabulary object segmentation and grasping approaches that overcome the labeling overhead of supervised vision approaches, commonly used in RoboCup@Home. We successfully demonstrated that we can segment and grasp non-labeled objects by text descriptions. Further, we extensively employed LLMs for natural language understanding and task planning. Throughout the competition, our approaches showed robustness and generalization capabilities. A video of our performance can be found online. △ Less

Submitted 19 December, 2024; originally announced December 2024.

Comments: 12 pages, 8 figures, RoboCup 2024 Champion Paper

arXiv:2410.22997 [pdf, other]

A Comparison of Prompt Engineering Techniques for Task Planning and Execution in Service Robotics

Authors: Jonas Bode, Bastian Pätzold, Raphael Memmesheimer, Sven Behnke

Abstract: Recent advances in LLM have been instrumental in autonomous robot control and human-robot interaction by leveraging their vast general knowledge and capabilities to understand and reason across a wide range of tasks and scenarios. Previous works have investigated various prompt engineering techniques for improving the performance of LLM to accomplish tasks, while others have proposed methods that… ▽ More Recent advances in LLM have been instrumental in autonomous robot control and human-robot interaction by leveraging their vast general knowledge and capabilities to understand and reason across a wide range of tasks and scenarios. Previous works have investigated various prompt engineering techniques for improving the performance of LLM to accomplish tasks, while others have proposed methods that utilize LLMs to plan and execute tasks based on the available functionalities of a given robot platform. In this work, we consider both lines of research by comparing prompt engineering techniques and combinations thereof within the application of high-level task planning and execution in service robotics. We define a diverse set of tasks and a simple set of functionalities in simulation, and measure task completion accuracy and execution time for several state-of-the-art models. △ Less

Submitted 6 November, 2024; v1 submitted 30 October, 2024; originally announced October 2024.

Comments: 6 pages, 3 figures, 2 tables, to be published in the 2024 IEEE-RAS International Conference on Humanoid Robots, We make our code, including all prompts, available at https://github.com/AIS-Bonn/Prompt_Engineering

arXiv:2308.12238 [pdf, other]

NimbRo wins ANA Avatar XPRIZE Immersive Telepresence Competition: Human-Centric Evaluation and Lessons Learned

Authors: Christian Lenz, Max Schwarz, Andre Rochow, Bastian Pätzold, Raphael Memmesheimer, Michael Schreiber, Sven Behnke

Abstract: Robotic avatar systems can enable immersive telepresence with locomotion, manipulation, and communication capabilities. We present such an avatar system, based on the key components of immersive 3D visualization and transparent force-feedback telemanipulation. Our avatar robot features an anthropomorphic upper body with dexterous hands. The remote human operator drives the arms and fingers through… ▽ More Robotic avatar systems can enable immersive telepresence with locomotion, manipulation, and communication capabilities. We present such an avatar system, based on the key components of immersive 3D visualization and transparent force-feedback telemanipulation. Our avatar robot features an anthropomorphic upper body with dexterous hands. The remote human operator drives the arms and fingers through an exoskeleton-based operator station, which provides force feedback both at the wrist and for each finger. The robot torso is mounted on a holonomic base, providing omnidirectional locomotion on flat floors, controlled using a 3D rudder device. Finally, the robot features a 6D movable head with stereo cameras, which stream images to a VR display worn by the operator. Movement latency is hidden using spherical rendering. The head also carries a telepresence screen displaying an animated image of the operator's face, enabling direct interaction with remote persons. Our system won the \$10M ANA Avatar XPRIZE competition, which challenged teams to develop intuitive and immersive avatar systems that could be operated by briefly trained judges. We analyze our successful participation in the semifinals and finals and provide insight into our operator training and lessons learned. In addition, we evaluate our system in a user study that demonstrates its intuitive and easy usability. △ Less

Submitted 28 August, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

Comments: C. Lenz and M. Schwarz contributed equally. Accepted for International Journal of Social Robotics (SORO), Springer, to appear 2023

arXiv:2303.07186 [pdf, other]

Audio-based Roughness Sensing and Tactile Feedback for Haptic Perception in Telepresence

Authors: Bastian Pätzold, Andre Rochow, Michael Schreiber, Raphael Memmesheimer, Christian Lenz, Max Schwarz, Sven Behnke

Abstract: Haptic perception is highly important for immersive teleoperation of robots, especially for accomplishing manipulation tasks. We propose a low-cost haptic sensing and rendering system, which is capable of detecting and displaying surface roughness. As the robot fingertip moves across a surface of interest, two microphones capture sound coupled directly through the fingertip and through the air, re… ▽ More Haptic perception is highly important for immersive teleoperation of robots, especially for accomplishing manipulation tasks. We propose a low-cost haptic sensing and rendering system, which is capable of detecting and displaying surface roughness. As the robot fingertip moves across a surface of interest, two microphones capture sound coupled directly through the fingertip and through the air, respectively. A learning-based detector system analyzes the data in real time and gives roughness estimates with both high temporal resolution and low latency. Finally, an audio-based vibrational actuator displays the result to the human operator. We demonstrate the effectiveness of our system through lab experiments and our winning entry in the ANA Avatar XPRIZE competition finals, where briefly trained judges solved a roughness-based selection task even without additional vision feedback. We publish our dataset used for training and evaluation together with our trained models to enable reproducibility of results. △ Less

Submitted 16 October, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

Comments: IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, Hawaii, USA, October 2023

arXiv:2303.03297 [pdf, other]

Robust Immersive Telepresence and Mobile Telemanipulation: NimbRo wins ANA Avatar XPRIZE Finals

Authors: Max Schwarz, Christian Lenz, Raphael Memmesheimer, Bastian Pätzold, Andre Rochow, Michael Schreiber, Sven Behnke

Abstract: Robotic avatar systems promise to bridge distances and reduce the need for travel. We present the updated NimbRo avatar system, winner of the $5M grand prize at the international ANA Avatar XPRIZE competition, which required participants to build intuitive and immersive robotic telepresence systems that could be operated by briefly trained operators. We describe key improvements for the finals, co… ▽ More Robotic avatar systems promise to bridge distances and reduce the need for travel. We present the updated NimbRo avatar system, winner of the $5M grand prize at the international ANA Avatar XPRIZE competition, which required participants to build intuitive and immersive robotic telepresence systems that could be operated by briefly trained operators. We describe key improvements for the finals, compared to the system used in the semifinals: To operate without a power- and communications tether, we integrated a battery and a robust redundant wireless communication system. Video and audio data are compressed using low-latency HEVC and Opus codecs. We propose a new locomotion control device with tunable resistance force. To increase flexibility, the robot's upper-body height can be adjusted by the operator. We describe essential monitoring and robustness tools which enabled the success at the competition. Finally, we analyze our performance at the competition finals and discuss lessons learned. △ Less

Submitted 6 December, 2023; v1 submitted 6 March, 2023; originally announced March 2023.

Comments: Accepted for IEEE-RAS International Conference on Humanoid Robots (HUMANOIDS) 2023. M. Schwarz and C. Lenz contributed equally

arXiv:2209.07393 [pdf, other]

Online Marker-free Extrinsic Camera Calibration using Person Keypoint Detections

Authors: Bastian Pätzold, Simon Bultmann, Sven Behnke

Abstract: Calibration of multi-camera systems, i.e. determining the relative poses between the cameras, is a prerequisite for many tasks in computer vision and robotics. Camera calibration is typically achieved using offline methods that use checkerboard calibration targets. These methods, however, often are cumbersome and lengthy, considering that a new calibration is required each time any camera pose cha… ▽ More Calibration of multi-camera systems, i.e. determining the relative poses between the cameras, is a prerequisite for many tasks in computer vision and robotics. Camera calibration is typically achieved using offline methods that use checkerboard calibration targets. These methods, however, often are cumbersome and lengthy, considering that a new calibration is required each time any camera pose changes. In this work, we propose a novel, marker-free online method for the extrinsic calibration of multiple smart edge sensors, relying solely on 2D human keypoint detections that are computed locally on the sensor boards from RGB camera images. Our method assumes the intrinsic camera parameters to be known and requires priming with a rough initial estimate of the camera poses. The person keypoint detections from multiple views are received at a central backend where they are synchronized, filtered, and assigned to person hypotheses. We use these person hypotheses to repeatedly solve optimization problems in the form of factor graphs. Given suitable observations of one or multiple persons traversing the scene, the estimated camera poses converge towards a coherent extrinsic calibration within a few minutes. We evaluate our approach in real-world settings and show that the calibration with our method achieves lower reprojection errors compared to a reference calibration generated by an offline method using a traditional calibration target. △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: DAGM German Conference on Pattern Recognition (GCPR), Konstanz, September 2022

Showing 1–7 of 7 results for author: Pätzold, B