-
Human Motion Capture from Loose and Sparse Inertial Sensors with Garment-aware Diffusion Models
Authors:
Andela Ilic,
Jiaxi Jiang,
Paul Streli,
Xintong Liu,
Christian Holz
Abstract:
Motion capture using sparse inertial sensors has shown great promise due to its portability and lack of occlusion issues compared to camera-based tracking. Existing approaches typically assume that IMU sensors are tightly attached to the human body. However, this assumption often does not hold in real-world scenarios. In this paper, we present a new task of full-body human pose estimation using sp…
▽ More
Motion capture using sparse inertial sensors has shown great promise due to its portability and lack of occlusion issues compared to camera-based tracking. Existing approaches typically assume that IMU sensors are tightly attached to the human body. However, this assumption often does not hold in real-world scenarios. In this paper, we present a new task of full-body human pose estimation using sparse, loosely attached IMU sensors. To solve this task, we simulate IMU recordings from an existing garment-aware human motion dataset. We developed transformer-based diffusion models to synthesize loose IMU data and estimate human poses based on this challenging loose IMU data. In addition, we show that incorporating garment-related parameters while training the model on simulated loose data effectively maintains expressiveness and enhances the ability to capture variations introduced by looser or tighter garments. Experiments show that our proposed diffusion methods trained on simulated and synthetic data outperformed the state-of-the-art methods quantitatively and qualitatively, opening up a promising direction for future research.
△ Less
Submitted 18 June, 2025;
originally announced June 2025.
-
Continual Human-in-the-Loop Optimization
Authors:
Yi-Chi Liao,
Paul Streli,
Zhipeng Li,
Christoph Gebhardt,
Christian Holz
Abstract:
Optimal input settings vary across users due to differences in motor abilities and personal preferences, which are typically addressed by manual tuning or calibration. Although human-in-the-loop optimization has the potential to identify optimal settings during use, it is rarely applied due to its long optimization process. A more efficient approach would continually leverage data from previous us…
▽ More
Optimal input settings vary across users due to differences in motor abilities and personal preferences, which are typically addressed by manual tuning or calibration. Although human-in-the-loop optimization has the potential to identify optimal settings during use, it is rarely applied due to its long optimization process. A more efficient approach would continually leverage data from previous users to accelerate optimization, exploiting shared traits while adapting to individual characteristics. We introduce the concept of Continual Human-in-the-Loop Optimization and a Bayesian optimization-based method that leverages a Bayesian-neural-network surrogate model to capture population-level characteristics while adapting to new users. We propose a generative replay strategy to mitigate catastrophic forgetting. We demonstrate our method by optimizing virtual reality keyboard parameters for text entry using direct touch, showing reduced adaptation times with a growing user base. Our method opens the door for next-generation personalized input systems that improve with accumulated experience.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
EgoSim: An Egocentric Multi-view Simulator and Real Dataset for Body-worn Cameras during Motion and Activity
Authors:
Dominik Hollidt,
Paul Streli,
Jiaxi Jiang,
Yasaman Haghighi,
Changlin Qian,
Xintong Liu,
Christian Holz
Abstract:
Research on egocentric tasks in computer vision has mostly focused on head-mounted cameras, such as fisheye cameras or embedded cameras inside immersive headsets. We argue that the increasing miniaturization of optical sensors will lead to the prolific integration of cameras into many more body-worn devices at various locations. This will bring fresh perspectives to established tasks in computer v…
▽ More
Research on egocentric tasks in computer vision has mostly focused on head-mounted cameras, such as fisheye cameras or embedded cameras inside immersive headsets. We argue that the increasing miniaturization of optical sensors will lead to the prolific integration of cameras into many more body-worn devices at various locations. This will bring fresh perspectives to established tasks in computer vision and benefit key areas such as human motion tracking, body pose estimation, or action recognition -- particularly for the lower body, which is typically occluded.
In this paper, we introduce EgoSim, a novel simulator of body-worn cameras that generates realistic egocentric renderings from multiple perspectives across a wearer's body. A key feature of EgoSim is its use of real motion capture data to render motion artifacts, which are especially noticeable with arm- or leg-worn cameras. In addition, we introduce MultiEgoView, a dataset of egocentric footage from six body-worn cameras and ground-truth full-body 3D poses during several activities: 119 hours of data are derived from AMASS motion sequences in four high-fidelity virtual environments, which we augment with 5 hours of real-world motion data from 13 participants using six GoPro cameras and 3D body pose references from an Xsens motion capture suit.
We demonstrate EgoSim's effectiveness by training an end-to-end video-only 3D pose estimation network. Analyzing its domain gap, we show that our dataset and simulator substantially aid training for inference on real-world data.
EgoSim code & MultiEgoView dataset: https://siplab.org/projects/EgoSim
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
SympCam: Remote Optical Measurement of Sympathetic Arousal
Authors:
Björn Braun,
Daniel McDuff,
Tadas Baltrusaitis,
Paul Streli,
Max Moebus,
Christian Holz
Abstract:
Recent work has shown that a person's sympathetic arousal can be estimated from facial videos alone using basic signal processing. This opens up new possibilities in the field of telehealth and stress management, providing a non-invasive method to measure stress only using a regular RGB camera. In this paper, we present SympCam, a new 3D convolutional architecture tailored to the task of remote sy…
▽ More
Recent work has shown that a person's sympathetic arousal can be estimated from facial videos alone using basic signal processing. This opens up new possibilities in the field of telehealth and stress management, providing a non-invasive method to measure stress only using a regular RGB camera. In this paper, we present SympCam, a new 3D convolutional architecture tailored to the task of remote sympathetic arousal prediction. Our model incorporates a temporal attention module (TAM) to enhance the temporal coherence of our sequential data processing capabilities. The predictions from our method improve accuracy metrics of sympathetic arousal in prior work by 48% to a mean correlation of 0.77. We additionally compare our method with common remote photoplethysmography (rPPG) networks and show that they alone cannot accurately predict sympathetic arousal "out-of-the-box". Furthermore, we show that the sympathetic arousal predicted by our method allows detecting physical stress with a balanced accuracy of 90% - an improvement of 61% compared to the rPPG method commonly used in related work, demonstrating the limitations of using rPPG alone. Finally, we contribute a dataset designed explicitly for the task of remote sympathetic arousal prediction. Our dataset contains synchronized face and hand videos of 20 participants from two cameras synchronized with electrodermal activity (EDA) and photoplethysmography (PPG) measurements. We will make this dataset available to the community and use it to evaluate the methods in this paper. To the best of our knowledge, this is the first dataset available to other researchers designed for remote sympathetic arousal prediction.
△ Less
Submitted 27 October, 2024;
originally announced October 2024.
-
TapType: Ten-finger text entry on everyday surfaces via Bayesian inference
Authors:
Paul Streli,
Jiaxi Jiang,
Andreas Fender,
Manuel Meier,
Hugo Romat,
Christian Holz
Abstract:
Despite the advent of touchscreens, typing on physical keyboards remains most efficient for entering text, because users can leverage all fingers across a full-size keyboard for convenient typing. As users increasingly type on the go, text input on mobile and wearable devices has had to compromise on full-size typing. In this paper, we present TapType, a mobile text entry system for full-size typi…
▽ More
Despite the advent of touchscreens, typing on physical keyboards remains most efficient for entering text, because users can leverage all fingers across a full-size keyboard for convenient typing. As users increasingly type on the go, text input on mobile and wearable devices has had to compromise on full-size typing. In this paper, we present TapType, a mobile text entry system for full-size typing on passive surfaces--without an actual keyboard. From the inertial sensors inside a band on either wrist, TapType decodes and relates surface taps to a traditional QWERTY keyboard layout. The key novelty of our method is to predict the most likely character sequences by fusing the finger probabilities from our Bayesian neural network classifier with the characters' prior probabilities from an n-gram language model. In our online evaluation, participants on average typed 19 words per minute with a character error rate of 0.6% after 30 minutes of training. Expert typists thereby consistently achieved more than 25 WPM at a similar error rate. We demonstrate applications of TapType in mobile use around smartphones and tablets, as a complement to interaction in situated Mixed Reality outside visual control, and as an eyes-free mobile text input method using an audio feedback-only interface.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
TouchInsight: Uncertainty-aware Rapid Touch and Text Input for Mixed Reality from Egocentric Vision
Authors:
Paul Streli,
Mark Richardson,
Fadi Botros,
Shugao Ma,
Robert Wang,
Christian Holz
Abstract:
While passive surfaces offer numerous benefits for interaction in mixed reality, reliably detecting touch input solely from head-mounted cameras has been a long-standing challenge. Camera specifics, hand self-occlusion, and rapid movements of both head and fingers introduce considerable uncertainty about the exact location of touch events. Existing methods have thus not been capable of achieving t…
▽ More
While passive surfaces offer numerous benefits for interaction in mixed reality, reliably detecting touch input solely from head-mounted cameras has been a long-standing challenge. Camera specifics, hand self-occlusion, and rapid movements of both head and fingers introduce considerable uncertainty about the exact location of touch events. Existing methods have thus not been capable of achieving the performance needed for robust interaction. In this paper, we present a real-time pipeline that detects touch input from all ten fingers on any physical surface, purely based on egocentric hand tracking. Our method TouchInsight comprises a neural network to predict the moment of a touch event, the finger making contact, and the touch location. TouchInsight represents locations through a bivariate Gaussian distribution to account for uncertainties due to sensing inaccuracies, which we resolve through contextual priors to accurately infer intended user input. We first evaluated our method offline and found that it locates input events with a mean error of 6.3 mm, and accurately detects touch events (F1=0.99) and identifies the finger used (F1=0.96). In an online evaluation, we then demonstrate the effectiveness of our approach for a core application of dexterous touch input: two-handed text entry. In our study, participants typed 37.0 words per minute with an uncorrected error rate of 2.9% on average.
△ Less
Submitted 8 October, 2024;
originally announced October 2024.
-
SituationAdapt: Contextual UI Optimization in Mixed Reality with Situation Awareness via LLM Reasoning
Authors:
Zhipeng Li,
Christoph Gebhardt,
Yves Inglin,
Nicolas Steck,
Paul Streli,
Christian Holz
Abstract:
Mixed Reality is increasingly used in mobile settings beyond controlled home and office spaces. This mobility introduces the need for user interface layouts that adapt to varying contexts. However, existing adaptive systems are designed only for static environments. In this paper, we introduce SituationAdapt, a system that adjusts Mixed Reality UIs to real-world surroundings by considering environ…
▽ More
Mixed Reality is increasingly used in mobile settings beyond controlled home and office spaces. This mobility introduces the need for user interface layouts that adapt to varying contexts. However, existing adaptive systems are designed only for static environments. In this paper, we introduce SituationAdapt, a system that adjusts Mixed Reality UIs to real-world surroundings by considering environmental and social cues in shared settings. Our system consists of perception, reasoning, and optimization modules for UI adaptation. Our perception module identifies objects and individuals around the user, while our reasoning module leverages a Vision-and-Language Model to assess the placement of interactive UI elements. This ensures that adapted layouts do not obstruct relevant environmental cues or interfere with social norms. Our optimization module then generates Mixed Reality interfaces that account for these considerations as well as temporal constraints. For evaluation, we first validate our reasoning module's capability of assessing UI contexts in comparison to human expert users. In an online user study, we then establish SituationAdapt's capability of producing context-aware layouts for Mixed Reality, where it outperformed previous adaptive layout methods. We conclude with a series of applications and scenarios to demonstrate SituationAdapt's versatility.
△ Less
Submitted 19 September, 2024;
originally announced September 2024.
-
EgoPressure: A Dataset for Hand Pressure and Pose Estimation in Egocentric Vision
Authors:
Yiming Zhao,
Taein Kwon,
Paul Streli,
Marc Pollefeys,
Christian Holz
Abstract:
Touch contact and pressure are essential for understanding how humans interact with and manipulate objects, insights which can significantly benefit applications in mixed reality and robotics. However, estimating these interactions from an egocentric camera perspective is challenging, largely due to the lack of comprehensive datasets that provide both accurate hand poses on contacting surfaces and…
▽ More
Touch contact and pressure are essential for understanding how humans interact with and manipulate objects, insights which can significantly benefit applications in mixed reality and robotics. However, estimating these interactions from an egocentric camera perspective is challenging, largely due to the lack of comprehensive datasets that provide both accurate hand poses on contacting surfaces and detailed annotations of pressure information. In this paper, we introduce EgoPressure, a novel egocentric dataset that captures detailed touch contact and pressure interactions. EgoPressure provides high-resolution pressure intensity annotations for each contact point and includes accurate hand pose meshes obtained through our proposed multi-view, sequence-based optimization method processing data from an 8-camera capture rig. Our dataset comprises 5 hours of recorded interactions from 21 participants captured simultaneously by one head-mounted and seven stationary Kinect cameras, which acquire RGB images and depth maps at 30 Hz. To support future research and benchmarking, we present several baseline models for estimating applied pressure on external surfaces from RGB images, with and without hand pose information. We further explore the joint estimation of the hand mesh and applied pressure. Our experiments demonstrate that pressure and hand pose are complementary for understanding hand-object interactions. ng of hand-object interactions in AR/VR and robotics research. Project page: \url{https://yiming-zhao.github.io/EgoPressure/}.
△ Less
Submitted 4 December, 2024; v1 submitted 3 September, 2024;
originally announced September 2024.
-
EgoPoser: Robust Real-Time Egocentric Pose Estimation from Sparse and Intermittent Observations Everywhere
Authors:
Jiaxi Jiang,
Paul Streli,
Manuel Meier,
Christian Holz
Abstract:
Full-body egocentric pose estimation from head and hand poses alone has become an active area of research to power articulate avatar representations on headset-based platforms. However, existing methods over-rely on the indoor motion-capture spaces in which datasets were recorded, while simultaneously assuming continuous joint motion capture and uniform body dimensions. We propose EgoPoser to over…
▽ More
Full-body egocentric pose estimation from head and hand poses alone has become an active area of research to power articulate avatar representations on headset-based platforms. However, existing methods over-rely on the indoor motion-capture spaces in which datasets were recorded, while simultaneously assuming continuous joint motion capture and uniform body dimensions. We propose EgoPoser to overcome these limitations with four main contributions. 1) EgoPoser robustly models body pose from intermittent hand position and orientation tracking only when inside a headset's field of view. 2) We rethink input representations for headset-based ego-pose estimation and introduce a novel global motion decomposition method that predicts full-body pose independent of global positions. 3) We enhance pose estimation by capturing longer motion time series through an efficient SlowFast module design that maintains computational efficiency. 4) EgoPoser generalizes across various body shapes for different users. We experimentally evaluate our method and show that it outperforms state-of-the-art methods both qualitatively and quantitatively while maintaining a high inference speed of over 600fps. EgoPoser establishes a robust baseline for future work where full-body pose estimation no longer needs to rely on outside-in capture and can scale to large-scale and unseen environments.
△ Less
Submitted 6 September, 2024; v1 submitted 12 August, 2023;
originally announced August 2023.
-
BeliefPPG: Uncertainty-aware Heart Rate Estimation from PPG signals via Belief Propagation
Authors:
Valentin Bieri,
Paul Streli,
Berken Utku Demirel,
Christian Holz
Abstract:
We present a novel learning-based method that achieves state-of-the-art performance on several heart rate estimation benchmarks extracted from photoplethysmography signals (PPG). We consider the evolution of the heart rate in the context of a discrete-time stochastic process that we represent as a hidden Markov model. We derive a distribution over possible heart rate values for a given PPG signal…
▽ More
We present a novel learning-based method that achieves state-of-the-art performance on several heart rate estimation benchmarks extracted from photoplethysmography signals (PPG). We consider the evolution of the heart rate in the context of a discrete-time stochastic process that we represent as a hidden Markov model. We derive a distribution over possible heart rate values for a given PPG signal window through a trained neural network. Using belief propagation, we incorporate the statistical distribution of heart rate changes to refine these estimates in a temporal context. From this, we obtain a quantized probability distribution over the range of possible heart rate values that captures a meaningful and well-calibrated estimate of the inherent predictive uncertainty. We show the robustness of our method on eight public datasets with three different cross-validation experiments.
△ Less
Submitted 14 June, 2023; v1 submitted 13 June, 2023;
originally announced June 2023.
-
HOOV: Hand Out-Of-View Tracking for Proprioceptive Interaction using Inertial Sensing
Authors:
Paul Streli,
Rayan Armani,
Yi Fei Cheng,
Christian Holz
Abstract:
Current Virtual Reality systems are designed for interaction under visual control. Using built-in cameras, headsets track the user's hands or hand-held controllers while they are inside the field of view. Current systems thus ignore the user's interaction with off-screen content -- virtual objects that the user could quickly access through proprioception without requiring laborious head motions to…
▽ More
Current Virtual Reality systems are designed for interaction under visual control. Using built-in cameras, headsets track the user's hands or hand-held controllers while they are inside the field of view. Current systems thus ignore the user's interaction with off-screen content -- virtual objects that the user could quickly access through proprioception without requiring laborious head motions to bring them into focus. In this paper, we present HOOV, a wrist-worn sensing method that allows VR users to interact with objects outside their field of view. Based on the signals of a single wrist-worn inertial sensor, HOOV continuously estimates the user's hand position in 3-space to complement the headset's tracking as the hands leave the tracking range. Our novel data-driven method predicts hand positions and trajectories from just the continuous estimation of hand orientation, which by itself is stable based solely on inertial observations. Our inertial sensing simultaneously detects finger pinching to register off-screen selection events, confirms them using a haptic actuator inside our wrist device, and thus allows users to select, grab, and drop virtual content. We compared HOOV's performance with a camera-based optical motion capture system in two folds. In the first evaluation, participants interacted based on tracking information from the motion capture system to assess the accuracy of their proprioceptive input, whereas in the second, they interacted based on HOOV's real-time estimations. We found that HOOV's target-agnostic estimations had a mean tracking error of 7.7 cm, which allowed participants to reliably access virtual objects around their body without first bringing them into focus. We demonstrate several applications that leverage the larger input space HOOV opens up for quick proprioceptive interaction, and conclude by discussing the potential of our technique.
△ Less
Submitted 30 April, 2023; v1 submitted 13 March, 2023;
originally announced March 2023.
-
AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing
Authors:
Jiaxi Jiang,
Paul Streli,
Huajian Qiu,
Andreas Fender,
Larissa Laich,
Patrick Snape,
Christian Holz
Abstract:
Today's Mixed Reality head-mounted displays track the user's head pose in world space as well as the user's hands for interaction in both Augmented Reality and Virtual Reality scenarios. While this is adequate to support user input, it unfortunately limits users' virtual representations to just their upper bodies. Current systems thus resort to floating avatars, whose limitation is particularly ev…
▽ More
Today's Mixed Reality head-mounted displays track the user's head pose in world space as well as the user's hands for interaction in both Augmented Reality and Virtual Reality scenarios. While this is adequate to support user input, it unfortunately limits users' virtual representations to just their upper bodies. Current systems thus resort to floating avatars, whose limitation is particularly evident in collaborative settings. To estimate full-body poses from the sparse input sources, prior work has incorporated additional trackers and sensors at the pelvis or lower body, which increases setup complexity and limits practical application in mobile settings. In this paper, we present AvatarPoser, the first learning-based method that predicts full-body poses in world coordinates using only motion input from the user's head and hands. Our method builds on a Transformer encoder to extract deep features from the input signals and decouples global motion from the learned local joint orientations to guide pose estimation. To obtain accurate full-body motions that resemble motion capture animations, we refine the arm joints' positions using an optimization routine with inverse kinematics to match the original tracking input. In our evaluation, AvatarPoser achieved new state-of-the-art results in evaluations on large motion capture datasets (AMASS). At the same time, our method's inference speed supports real-time operation, providing a practical interface to support holistic avatar control and representation for Metaverse applications.
△ Less
Submitted 27 July, 2022;
originally announced July 2022.