-
NVP-HRI: Zero Shot Natural Voice and Posture-based Human-Robot Interaction via Large Language Model
Authors:
Yuzhi Lai,
Shenghai Yuan,
Youssef Nassar,
Mingyu Fan,
Thomas Weber,
Matthias Rätsch
Abstract:
Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in…
▽ More
Effective Human-Robot Interaction (HRI) is crucial for future service robots in aging societies. Existing solutions are biased toward only well-trained objects, creating a gap when dealing with new objects. Currently, HRI systems using predefined gestures or language tokens for pretrained objects pose challenges for all individuals, especially elderly ones. These challenges include difficulties in recalling commands, memorizing hand gestures, and learning new names. This paper introduces NVP-HRI, an intuitive multi-modal HRI paradigm that combines voice commands and deictic posture. NVP-HRI utilizes the Segment Anything Model (SAM) to analyze visual cues and depth data, enabling precise structural object representation. Through a pre-trained SAM network, NVP-HRI allows interaction with new objects via zero-shot prediction, even without prior knowledge. NVP-HRI also integrates with a large language model (LLM) for multimodal commands, coordinating them with object selection and scene distribution in real time for collision-free trajectory solutions. We also regulate the action sequence with the essential control syntax to reduce LLM hallucination risks. The evaluation of diverse real-world tasks using a Universal Robot showcased up to 59.2\% efficiency improvement over traditional gesture control, as illustrated in the video https://youtu.be/EbC7al2wiAc. Our code and design will be openly available at https://github.com/laiyuzhi/NVP-HRI.git.
△ Less
Submitted 12 March, 2025;
originally announced March 2025.
-
Natural Multimodal Fusion-Based Human-Robot Interaction: Application With Voice and Deictic Posture via Large Language Model
Authors:
Yuzhi Lai,
Shenghai Yuan,
Youssef Nassar,
Mingyu Fan,
Atmaraaj Gopal,
Arihiro Yorita,
Naoyuki Kubota,
Matthias Rätsch
Abstract:
Translating human intent into robot commands is crucial for the future of service robots in an aging society. Existing Human-Robot Interaction (HRI) systems relying on gestures or verbal commands are impractical for the elderly due to difficulties with complex syntax or sign language. To address the challenge, this paper introduces a multi-modal interaction framework that combines voice and deicti…
▽ More
Translating human intent into robot commands is crucial for the future of service robots in an aging society. Existing Human-Robot Interaction (HRI) systems relying on gestures or verbal commands are impractical for the elderly due to difficulties with complex syntax or sign language. To address the challenge, this paper introduces a multi-modal interaction framework that combines voice and deictic posture information to create a more natural HRI system. The visual cues are first processed by the object detection model to gain a global understanding of the environment, and then bounding boxes are estimated based on depth information. By using a large language model (LLM) with voice-to-text commands and temporally aligned selected bounding boxes, robot action sequences can be generated, while key control syntax constraints are applied to avoid potential LLM hallucination issues. The system is evaluated on real-world tasks with varying levels of complexity using a Universal Robots UR3e manipulator. Our method demonstrates significantly better performance in HRI in terms of accuracy and robustness. To benefit the research community and the general public, we will make our code and design open-source.
△ Less
Submitted 4 April, 2025; v1 submitted 1 January, 2025;
originally announced January 2025.
-
Large problems are not necessarily hard: A case study on distributed NMPC paying off
Authors:
Gösta Stomberg,
Maurice Raetsch,
Alexander Engelmann,
Timm Faulwasser
Abstract:
A key motivation in the development of Distributed Model Predictive Control (DMPC) is to accelerate centralized Model Predictive Control (MPC) for large-scale systems. DMPC has the prospect of scaling well by parallelizing computations among subsystems. However, communication delays may deteriorate the performance of decentralized optimization, if excessively many iterations are required per contr…
▽ More
A key motivation in the development of Distributed Model Predictive Control (DMPC) is to accelerate centralized Model Predictive Control (MPC) for large-scale systems. DMPC has the prospect of scaling well by parallelizing computations among subsystems. However, communication delays may deteriorate the performance of decentralized optimization, if excessively many iterations are required per control step. Moreover, centralized solvers often exhibit faster asymptotic convergence rates and, by parallelizing costly linear algebra operations, they can also benefit from modern multicore computing architectures. On this canvas, we study the computational performance of cooperative DMPC for linear and nonlinear systems. To this end, we apply a tailored decentralized real-time iteration scheme to frequency control for power systems. DMPC scales well for the considered linear and nonlinear benchmarks, as the iteration number does not depend on the number of subsystems. Comparisons with multi-threaded centralized solvers demonstrate competitive performance of the proposed decentralized optimization algorithms.
△ Less
Submitted 15 April, 2025; v1 submitted 8 November, 2024;
originally announced November 2024.
-
Ethically aligned Deep Learning: Unbiased Facial Aesthetic Prediction
Authors:
Michael Danner,
Thomas Weber,
Leping Peng,
Tobias Gerlach,
Xueping Su,
Matthias Rätsch
Abstract:
Facial beauty prediction (FBP) aims to develop a machine that automatically makes facial attractiveness assessment. In the past those results were highly correlated with human ratings, therefore also with their bias in annotating. As artificial intelligence can have racist and discriminatory tendencies, the cause of skews in the data must be identified. Development of training data and AI algorith…
▽ More
Facial beauty prediction (FBP) aims to develop a machine that automatically makes facial attractiveness assessment. In the past those results were highly correlated with human ratings, therefore also with their bias in annotating. As artificial intelligence can have racist and discriminatory tendencies, the cause of skews in the data must be identified. Development of training data and AI algorithms that are robust against biased information is a new challenge for scientists. As aesthetic judgement usually is biased, we want to take it one step further and propose an Unbiased Convolutional Neural Network for FBP. While it is possible to create network models that can rate attractiveness of faces on a high level, from an ethical point of view, it is equally important to make sure the model is unbiased. In this work, we introduce AestheticNet, a state-of-the-art attractiveness prediction network, which significantly outperforms competitors with a Pearson Correlation of 0.9601. Additionally, we propose a new approach for generating a bias-free CNN to improve fairness in machine learning.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
Evaluation of Dense 3D Reconstruction from 2D Face Images in the Wild
Authors:
Zhen-Hua Feng,
Patrik Huber,
Josef Kittler,
Peter JB Hancock,
Xiao-Jun Wu,
Qijun Zhao,
Paul Koppen,
Matthias Rätsch
Abstract:
This paper investigates the evaluation of dense 3D face reconstruction from a single 2D image in the wild. To this end, we organise a competition that provides a new benchmark dataset that contains 2000 2D facial images of 135 subjects as well as their 3D ground truth face scans. In contrast to previous competitions or challenges, the aim of this new benchmark dataset is to evaluate the accuracy o…
▽ More
This paper investigates the evaluation of dense 3D face reconstruction from a single 2D image in the wild. To this end, we organise a competition that provides a new benchmark dataset that contains 2000 2D facial images of 135 subjects as well as their 3D ground truth face scans. In contrast to previous competitions or challenges, the aim of this new benchmark dataset is to evaluate the accuracy of a 3D dense face reconstruction algorithm using real, accurate and high-resolution 3D ground truth face scans. In addition to the dataset, we provide a standard protocol as well as a Python script for the evaluation. Last, we report the results obtained by three state-of-the-art 3D face reconstruction systems on the new benchmark dataset. The competition is organised along with the 2018 13th IEEE Conference on Automatic Face & Gesture Recognition.
△ Less
Submitted 20 April, 2018; v1 submitted 14 March, 2018;
originally announced March 2018.
-
Methodology to analyze the accuracy of 3D objects reconstructed with collaborative robot based monocular LSD-SLAM
Authors:
Sergey Triputen,
Atmaraaj Gopal,
Thomas Weber,
Christian Hofert,
Kristiaan Schreve,
Matthias Ratsch
Abstract:
SLAM systems are mainly applied for robot navigation while research on feasibility for motion planning with SLAM for tasks like bin-picking, is scarce. Accurate 3D reconstruction of objects and environments is important for planning motion and computing optimal gripper pose to grasp objects. In this work, we propose the methods to analyze the accuracy of a 3D environment reconstructed using a LSD-…
▽ More
SLAM systems are mainly applied for robot navigation while research on feasibility for motion planning with SLAM for tasks like bin-picking, is scarce. Accurate 3D reconstruction of objects and environments is important for planning motion and computing optimal gripper pose to grasp objects. In this work, we propose the methods to analyze the accuracy of a 3D environment reconstructed using a LSD-SLAM system with a monocular camera mounted onto the gripper of a collaborative robot. We discuss and propose a solution to the pose space conversion problem. Finally, we present several criteria to analyze the 3D reconstruction accuracy. These could be used as guidelines to improve the accuracy of 3D reconstructions with monocular LSD-SLAM and other SLAM based solutions.
△ Less
Submitted 6 March, 2018;
originally announced March 2018.
-
Closed-form Solution for IMU based LSD-SLAM Point Cloud Conversion into the Scaled 3D World Environment
Authors:
Sergey Triputen,
Kristiaan Schreve,
Viktor Tkachev,
Matthias Ratsch
Abstract:
SLAM is a very popular research stream in computer vision and robotics nowadays. For more effective SLAM implementation it is necessary to have reliable informa- tion about the environment, also the data should be aligned and scaled according to the real world coordinate system. Monocular SLAM research is an attractive sub-stream, because of the low equipment cost, size and weight. In this paper w…
▽ More
SLAM is a very popular research stream in computer vision and robotics nowadays. For more effective SLAM implementation it is necessary to have reliable informa- tion about the environment, also the data should be aligned and scaled according to the real world coordinate system. Monocular SLAM research is an attractive sub-stream, because of the low equipment cost, size and weight. In this paper we present a way to build a conversion from LSD-SLAM coordinate space to the real world coordinates using a true metric scale with IMU sensor data implementation. The causes of differences between the real and calculated spaces are explained and the possibility of conversions between the spaces is proved. Additionally, a closed-form solution for inter space trans- formation calculation is presented. The synthetic method of generating high level accurate and well controlled input data for the LSD-SLAM algorithm is presented. Finally, the reconstructed 3D environment representation is delivered as an output of the implemented conversion.
△ Less
Submitted 19 July, 2017;
originally announced July 2017.
-
A 3D Face Modelling Approach for Pose-Invariant Face Recognition in a Human-Robot Environment
Authors:
Michael Grupp,
Philipp Kopp,
Patrik Huber,
Matthias Rätsch
Abstract:
Face analysis techniques have become a crucial component of human-machine interaction in the fields of assistive and humanoid robotics. However, the variations in head-pose that arise naturally in these environments are still a great challenge. In this paper, we present a real-time capable 3D face modelling framework for 2D in-the-wild images that is applicable for robotics. The fitting of the 3D…
▽ More
Face analysis techniques have become a crucial component of human-machine interaction in the fields of assistive and humanoid robotics. However, the variations in head-pose that arise naturally in these environments are still a great challenge. In this paper, we present a real-time capable 3D face modelling framework for 2D in-the-wild images that is applicable for robotics. The fitting of the 3D Morphable Model is based exclusively on automatically detected landmarks. After fitting, the face can be corrected in pose and transformed back to a frontal 2D representation that is more suitable for face recognition. We conduct face recognition experiments with non-frontal images from the MUCT database and uncontrolled, in the wild images from the PaSC database, the most challenging face recognition database to date, showing an improved performance. Finally, we present our SCITOS G5 robot system, which incorporates our framework as a means of image pre-processing for face analysis.
△ Less
Submitted 1 June, 2016;
originally announced June 2016.
-
3D Face Tracking and Texture Fusion in the Wild
Authors:
Patrik Huber,
Philipp Kopp,
Matthias Rätsch,
William Christmas,
Josef Kittler
Abstract:
We present a fully automatic approach to real-time 3D face reconstruction from monocular in-the-wild videos. With the use of a cascaded-regressor based face tracking and a 3D Morphable Face Model shape fitting, we obtain a semi-dense 3D face shape. We further use the texture information from multiple frames to build a holistic 3D face representation from the video frames. Our system is able to cap…
▽ More
We present a fully automatic approach to real-time 3D face reconstruction from monocular in-the-wild videos. With the use of a cascaded-regressor based face tracking and a 3D Morphable Face Model shape fitting, we obtain a semi-dense 3D face shape. We further use the texture information from multiple frames to build a holistic 3D face representation from the video frames. Our system is able to capture facial expressions and does not require any person-specific training. We demonstrate the robustness of our approach on the challenging 300 Videos in the Wild (300-VW) dataset. Our real-time fitting framework is available as an open source library at http://4dface.org.
△ Less
Submitted 22 May, 2016;
originally announced May 2016.
-
Fitting 3D Morphable Models using Local Features
Authors:
Patrik Huber,
Zhen-Hua Feng,
William Christmas,
Josef Kittler,
Matthias Rätsch
Abstract:
In this paper, we propose a novel fitting method that uses local image features to fit a 3D Morphable Model to 2D images. To overcome the obstacle of optimising a cost function that contains a non-differentiable feature extraction operator, we use a learning-based cascaded regression method that learns the gradient direction from data. The method allows to simultaneously solve for shape and pose p…
▽ More
In this paper, we propose a novel fitting method that uses local image features to fit a 3D Morphable Model to 2D images. To overcome the obstacle of optimising a cost function that contains a non-differentiable feature extraction operator, we use a learning-based cascaded regression method that learns the gradient direction from data. The method allows to simultaneously solve for shape and pose parameters. Our method is thoroughly evaluated on Morphable Model generated data and first results on real data are presented. Compared to traditional fitting methods, which use simple raw features like pixel colour or edge maps, local features have been shown to be much more robust against variations in imaging conditions. Our approach is unique in that we are the first to use local features to fit a Morphable Model.
Because of the speed of our method, it is applicable for realtime applications. Our cascaded regression framework is available as an open source library (https://github.com/patrikhuber).
△ Less
Submitted 8 March, 2015;
originally announced March 2015.