-
VoluMe -- Authentic 3D Video Calls from Live Gaussian Splat Prediction
Authors:
Martin de La Gorce,
Charlie Hewitt,
Tibor Takacs,
Robert Gerdisch,
Zafiirah Hosenie,
Givi Meishvili,
Marek Kowalski,
Thomas J. Cashman,
Antonio Criminisi
Abstract:
Virtual 3D meetings offer the potential to enhance copresence, increase engagement and thus improve effectiveness of remote meetings compared to standard 2D video calls. However, representing people in 3D meetings remains a challenge; existing solutions achieve high quality by using complex hardware, making use of fixed appearance via enrolment, or by inverting a pre-trained generative model. Thes…
▽ More
Virtual 3D meetings offer the potential to enhance copresence, increase engagement and thus improve effectiveness of remote meetings compared to standard 2D video calls. However, representing people in 3D meetings remains a challenge; existing solutions achieve high quality by using complex hardware, making use of fixed appearance via enrolment, or by inverting a pre-trained generative model. These approaches lead to constraints that are unwelcome and ill-fitting for videoconferencing applications. We present the first method to predict 3D Gaussian reconstructions in real time from a single 2D webcam feed, where the 3D representation is not only live and realistic, but also authentic to the input video. By conditioning the 3D representation on each video frame independently, our reconstruction faithfully recreates the input video from the captured viewpoint (a property we call authenticity), while generalizing realistically to novel viewpoints. Additionally, we introduce a stability loss to obtain reconstructions that are temporally stable on video sequences. We show that our method delivers state-of-the-art accuracy in visual quality and stability metrics compared to existing methods, and demonstrate our approach in live one-to-one 3D meetings using only a standard 2D camera and display. This demonstrates that our approach can allow anyone to communicate volumetrically, via a method for 3D videoconferencing that is not only highly accessible, but also realistic and authentic.
△ Less
Submitted 28 July, 2025;
originally announced July 2025.
-
DAViD: Data-efficient and Accurate Vision Models from Synthetic Data
Authors:
Fatemeh Saleh,
Sadegh Aliakbarian,
Charlie Hewitt,
Lohit Petikam,
Xiao-Xian,
Antonio Criminisi,
Thomas J. Cashman,
Tadas Baltrušaitis
Abstract:
The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity syn…
▽ More
The state of the art in human-centric computer vision achieves high accuracy and robustness across a diverse range of tasks. The most effective models in this domain have billions of parameters, thus requiring extremely large datasets, expensive training regimes, and compute-intensive inference. In this paper, we demonstrate that it is possible to train models on much smaller but high-fidelity synthetic datasets, with no loss in accuracy and higher efficiency. Using synthetic training data provides us with excellent levels of detail and perfect labels, while providing strong guarantees for data provenance, usage rights, and user consent. Procedural data synthesis also provides us with explicit control on data diversity, that we can use to address unfairness in the models we train. Extensive quantitative assessment on real input images demonstrates accuracy of our models on three dense prediction tasks: depth estimation, surface normal estimation, and soft foreground segmentation. Our models require only a fraction of the cost of training and inference when compared with foundational models of similar accuracy. Our human-centric synthetic dataset and trained models are available at https://aka.ms/DAViD.
△ Less
Submitted 21 July, 2025;
originally announced July 2025.
-
Look Ma, no markers: holistic performance capture without the hassle
Authors:
Charlie Hewitt,
Fatemeh Saleh,
Sadegh Aliakbarian,
Lohit Petikam,
Shideh Rezaeifar,
Louis Florentin,
Zafiirah Hosenie,
Thomas J Cashman,
Julien Valentin,
Darren Cosker,
Tadas Baltrusaitis
Abstract:
We tackle the problem of highly-accurate, holistic performance capture for the face, body and hands simultaneously. Motion-capture technologies used in film and game production typically focus only on face, body or hand capture independently, involve complex and expensive hardware and a high degree of manual intervention from skilled operators. While machine-learning-based approaches exist to over…
▽ More
We tackle the problem of highly-accurate, holistic performance capture for the face, body and hands simultaneously. Motion-capture technologies used in film and game production typically focus only on face, body or hand capture independently, involve complex and expensive hardware and a high degree of manual intervention from skilled operators. While machine-learning-based approaches exist to overcome these problems, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts. In this work, we introduce the first technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. Our approach produces stable world-space results from arbitrary camera rigs as well as supporting varied capture environments and clothing. We achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. We evaluate our method on a number of body, face and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets.
△ Less
Submitted 15 October, 2024;
originally announced October 2024.
-
3D face reconstruction with dense landmarks
Authors:
Erroll Wood,
Tadas Baltrusaitis,
Charlie Hewitt,
Matthew Johnson,
Jingjing Shen,
Nikola Milosavljevic,
Daniel Wilde,
Stephan Garbin,
Chirag Raman,
Jamie Shotton,
Toby Sharp,
Ivan Stojiljkovic,
Tom Cashman,
Julien Valentin
Abstract:
Landmarks often play a key role in face analysis, but many aspects of identity or expression cannot be represented by sparse landmarks alone. Thus, in order to reconstruct faces more accurately, landmarks are often combined with additional signals like depth images or techniques like differentiable rendering. Can we keep things simple by just using more landmarks? In answer, we present the first m…
▽ More
Landmarks often play a key role in face analysis, but many aspects of identity or expression cannot be represented by sparse landmarks alone. Thus, in order to reconstruct faces more accurately, landmarks are often combined with additional signals like depth images or techniques like differentiable rendering. Can we keep things simple by just using more landmarks? In answer, we present the first method that accurately predicts 10x as many landmarks as usual, covering the whole head, including the eyes and teeth. This is accomplished using synthetic training data, which guarantees perfect landmark annotations. By fitting a morphable model to these dense landmarks, we achieve state-of-the-art results for monocular 3D face reconstruction in the wild. We show that dense landmarks are an ideal signal for integrating face shape information across frames by demonstrating accurate and expressive facial performance capture in both monocular and multi-view scenarios. This approach is also highly efficient: we can predict dense landmarks and fit our 3D face model at over 150FPS on a single CPU thread. Please see our website: https://microsoft.github.io/DenseLandmarks/.
△ Less
Submitted 20 July, 2022; v1 submitted 6 April, 2022;
originally announced April 2022.
-
FLAG: Flow-based 3D Avatar Generation from Sparse Observations
Authors:
Sadegh Aliakbarian,
Pashmina Cameron,
Federica Bogo,
Andrew Fitzgibbon,
Thomas J. Cashman
Abstract:
To represent people in mixed reality applications for collaboration and communication, we need to generate realistic and faithful avatar poses. However, the signal streams that can be applied for this task from head-mounted devices (HMDs) are typically limited to head pose and hand pose estimates. While these signals are valuable, they are an incomplete representation of the human body, making it…
▽ More
To represent people in mixed reality applications for collaboration and communication, we need to generate realistic and faithful avatar poses. However, the signal streams that can be applied for this task from head-mounted devices (HMDs) are typically limited to head pose and hand pose estimates. While these signals are valuable, they are an incomplete representation of the human body, making it challenging to generate a faithful full-body avatar. We address this challenge by developing a flow-based generative model of the 3D human body from sparse observations, wherein we learn not only a conditional distribution of 3D human pose, but also a probabilistic mapping from observations to the latent space from which we can generate a plausible pose along with uncertainty estimates for the joints. We show that our approach is not only a strong predictive model, but can also act as an efficient pose prior in different optimization settings where a good initial latent code plays a major role.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Fake It Till You Make It: Face analysis in the wild using synthetic data alone
Authors:
Erroll Wood,
Tadas Baltrušaitis,
Charlie Hewitt,
Sebastian Dziadzio,
Matthew Johnson,
Virginia Estellers,
Thomas J. Cashman,
Jamie Shotton
Abstract:
We demonstrate that it is possible to perform face-related computer vision in the wild using synthetic data alone. The community has long enjoyed the benefits of synthesizing training data with graphics, but the domain gap between real and synthetic data has remained a problem, especially for human faces. Researchers have tried to bridge this gap with data mixing, domain adaptation, and domain-adv…
▽ More
We demonstrate that it is possible to perform face-related computer vision in the wild using synthetic data alone. The community has long enjoyed the benefits of synthesizing training data with graphics, but the domain gap between real and synthetic data has remained a problem, especially for human faces. Researchers have tried to bridge this gap with data mixing, domain adaptation, and domain-adversarial training, but we show that it is possible to synthesize data with minimal domain gap, so that models trained on synthetic data generalize to real in-the-wild datasets. We describe how to combine a procedurally-generated parametric 3D face model with a comprehensive library of hand-crafted assets to render training images with unprecedented realism and diversity. We train machine learning systems for face-related tasks such as landmark localization and face parsing, showing that synthetic data can both match real data in accuracy as well as open up new approaches where manual labelling would be impossible.
△ Less
Submitted 5 October, 2021; v1 submitted 30 September, 2021;
originally announced September 2021.
-
HoloLens 2 Research Mode as a Tool for Computer Vision Research
Authors:
Dorin Ungureanu,
Federica Bogo,
Silvano Galliani,
Pooja Sama,
Xin Duan,
Casey Meekhof,
Jan Stühmer,
Thomas J. Cashman,
Bugra Tekin,
Johannes L. Schönberger,
Pawel Olszta,
Marc Pollefeys
Abstract:
Mixed reality headsets, such as the Microsoft HoloLens 2, are powerful sensing devices with integrated compute capabilities, which makes it an ideal platform for computer vision research. In this technical report, we present HoloLens 2 Research Mode, an API and a set of tools enabling access to the raw sensor streams. We provide an overview of the API and explain how it can be used to build mixed…
▽ More
Mixed reality headsets, such as the Microsoft HoloLens 2, are powerful sensing devices with integrated compute capabilities, which makes it an ideal platform for computer vision research. In this technical report, we present HoloLens 2 Research Mode, an API and a set of tools enabling access to the raw sensor streams. We provide an overview of the API and explain how it can be used to build mixed reality applications based on processing sensor data. We also show how to combine the Research Mode sensor data with the built-in eye and hand tracking capabilities provided by HoloLens 2. By releasing the Research Mode API and a set of open-source tools, we aim to foster further research in the fields of computer vision as well as robotics and encourage contributions from the research community.
△ Less
Submitted 25 August, 2020;
originally announced August 2020.
-
A high fidelity synthetic face framework for computer vision
Authors:
Tadas Baltrusaitis,
Erroll Wood,
Virginia Estellers,
Charlie Hewitt,
Sebastian Dziadzio,
Marek Kowalski,
Matthew Johnson,
Thomas J. Cashman,
Jamie Shotton
Abstract:
Analysis of faces is one of the core applications of computer vision, with tasks ranging from landmark alignment, head pose estimation, expression recognition, and face recognition among others. However, building reliable methods requires time-consuming data collection and often even more time-consuming manual annotation, which can be unreliable. In our work we propose synthesizing such facial dat…
▽ More
Analysis of faces is one of the core applications of computer vision, with tasks ranging from landmark alignment, head pose estimation, expression recognition, and face recognition among others. However, building reliable methods requires time-consuming data collection and often even more time-consuming manual annotation, which can be unreliable. In our work we propose synthesizing such facial data, including ground truth annotations that would be almost impossible to acquire through manual annotation at the consistency and scale possible through use of synthetic data. We use a parametric face model together with hand crafted assets which enable us to generate training data with unprecedented quality and diversity (varying shape, texture, expression, pose, lighting, and hair).
△ Less
Submitted 16 July, 2020;
originally announced July 2020.
-
The Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Authors:
Jingjing Shen,
Thomas J. Cashman,
Qi Ye,
Tim Hutton,
Toby Sharp,
Federica Bogo,
Andrew William Fitzgibbon,
Jamie Shotton
Abstract:
Realtime perceptual and interaction capabilities in mixed reality require a range of 3D tracking problems to be solved at low latency on resource-constrained hardware such as head-mounted devices. Indeed, for devices such as HoloLens 2 where the CPU and GPU are left available for applications, multiple tracking subsystems are required to run on a continuous, real-time basis while sharing a single…
▽ More
Realtime perceptual and interaction capabilities in mixed reality require a range of 3D tracking problems to be solved at low latency on resource-constrained hardware such as head-mounted devices. Indeed, for devices such as HoloLens 2 where the CPU and GPU are left available for applications, multiple tracking subsystems are required to run on a continuous, real-time basis while sharing a single Digital Signal Processor. To solve model-fitting problems for HoloLens 2 hand tracking, where the computational budget is approximately 100 times smaller than an iPhone 7, we introduce a new surface model: the `Phong surface'. Using ideas from computer graphics, the Phong surface describes the same 3D shape as a triangulated mesh model, but with continuous surface normals which enable the use of lifting-based optimization, providing significant efficiency gains over ICP-based methods. We show that Phong surfaces retain the convergence benefits of smoother surface models, while triangle meshes do not.
△ Less
Submitted 9 July, 2020;
originally announced July 2020.
-
QRkit: Sparse, Composable QR Decompositions for Efficient and Stable Solutions to Problems in Computer Vision
Authors:
Jan Svoboda,
Thomas Cashman,
Andrew Fitzgibbon
Abstract:
Embedded computer vision applications increasingly require the speed and power benefits of single-precision (32 bit) floating point. However, applications which make use of Levenberg-like optimization can lose significant accuracy when reducing to single precision, sometimes unrecoverably so. This accuracy can be regained using solvers based on QR rather than Cholesky decomposition, but the absenc…
▽ More
Embedded computer vision applications increasingly require the speed and power benefits of single-precision (32 bit) floating point. However, applications which make use of Levenberg-like optimization can lose significant accuracy when reducing to single precision, sometimes unrecoverably so. This accuracy can be regained using solvers based on QR rather than Cholesky decomposition, but the absence of sparse QR solvers for common sparsity patterns found in computer vision means that many applications cannot benefit. We introduce an open-source suite of solvers for Eigen, which efficiently compute the QR decomposition for matrices with some common sparsity patterns (block diagonal, horizontal and vertical concatenation, and banded). For problems with very particular sparsity structures, these elements can be composed together in 'kit' form, hence the name QRkit. We apply our methods to several computer vision problems, showing competitive performance and suitability especially in single precision arithmetic.
△ Less
Submitted 11 February, 2018;
originally announced February 2018.