-
PanSt3R: Multi-view Consistent Panoptic Segmentation
Authors:
Lojze Zust,
Yohann Cabon,
Juliette Marrie,
Leonid Antsfeld,
Boris Chidlovskii,
Jerome Revaud,
Gabriela Csurka
Abstract:
Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on…
▽ More
Panoptic segmentation of 3D scenes, involving the segmentation and classification of object instances in a dense 3D reconstruction of a scene, is a challenging problem, especially when relying solely on unposed 2D images. Existing approaches typically leverage off-the-shelf models to extract per-frame 2D panoptic segmentations, before optimizing an implicit geometric representation (often based on NeRF) to integrate and fuse the 2D predictions. We argue that relying on 2D panoptic segmentation for a problem inherently 3D and multi-view is likely suboptimal as it fails to leverage the full potential of spatial relationships across views. In addition to requiring camera parameters, these approaches also necessitate computationally expensive test-time optimization for each scene. Instead, in this work, we propose a unified and integrated approach PanSt3R, which eliminates the need for test-time optimization by jointly predicting 3D geometry and multi-view panoptic segmentation in a single forward pass. Our approach builds upon recent advances in 3D reconstruction, specifically upon MUSt3R, a scalable multi-view version of DUSt3R, and enhances it with semantic awareness and multi-view panoptic segmentation capabilities. We additionally revisit the standard post-processing mask merging procedure and introduce a more principled approach for multi-view segmentation. We also introduce a simple method for generating novel-view predictions based on the predictions of PanSt3R and vanilla 3DGS. Overall, the proposed PanSt3R is conceptually simple, yet fast and scalable, and achieves state-of-the-art performance on several benchmarks, while being orders of magnitude faster than existing methods.
△ Less
Submitted 26 June, 2025;
originally announced June 2025.
-
Reasoning in visual navigation of end-to-end trained agents: a dynamical systems approach
Authors:
Steeven Janny,
Hervé Poirier,
Leonid Antsfeld,
Guillaume Bono,
Gianluca Monaci,
Boris Chidlovskii,
Francesco Giuliari,
Alessio Del Bue,
Christian Wolf
Abstract:
Progress in Embodied AI has made it possible for end-to-end-trained agents to navigate in photo-realistic environments with high-level reasoning and zero-shot or language-conditioned behavior, but benchmarks are still dominated by simulation. In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving \numepisodes{} navigati…
▽ More
Progress in Embodied AI has made it possible for end-to-end-trained agents to navigate in photo-realistic environments with high-level reasoning and zero-shot or language-conditioned behavior, but benchmarks are still dominated by simulation. In this work, we focus on the fine-grained behavior of fast-moving real robots and present a large-scale experimental study involving \numepisodes{} navigation episodes in a real environment with a physical robot, where we analyze the type of reasoning emerging from end-to-end training. In particular, we study the presence of realistic dynamics which the agent learned for open-loop forecasting, and their interplay with sensing. We analyze the way the agent uses latent memory to hold elements of the scene structure and information gathered during exploration. We probe the planning capabilities of the agent, and find in its memory evidence for somewhat precise plans over a limited horizon. Furthermore, we show in a post-hoc analysis that the value function learned by the agent relates to long-term planning. Put together, our experiments paint a new picture on how using tools from computer vision and sequential decision making have led to new capabilities in robotics and control. An interactive tool is available at europe.naverlabs.com/research/publications/reasoning-in-visual-navigation-of-end-to-end-trained-agents.
△ Less
Submitted 15 April, 2025; v1 submitted 11 March, 2025;
originally announced March 2025.
-
MUSt3R: Multi-view Network for Stereo 3D Reconstruction
Authors:
Yohann Cabon,
Lucas Stoffl,
Leonid Antsfeld,
Gabriela Csurka,
Boris Chidlovskii,
Jerome Revaud,
Vincent Leroy
Abstract:
DUSt3R introduced a novel paradigm in geometric computer vision by proposing a model that can provide dense and unconstrained Stereo 3D Reconstruction of arbitrary image collections with no prior information about camera calibration nor viewpoint poses. Under the hood, however, DUSt3R processes image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system.…
▽ More
DUSt3R introduced a novel paradigm in geometric computer vision by proposing a model that can provide dense and unconstrained Stereo 3D Reconstruction of arbitrary image collections with no prior information about camera calibration nor viewpoint poses. Under the hood, however, DUSt3R processes image pairs, regressing local 3D reconstructions that need to be aligned in a global coordinate system. The number of pairs, growing quadratically, is an inherent limitation that becomes especially concerning for robust and fast optimization in the case of large image collections. In this paper, we propose an extension of DUSt3R from pairs to multiple views, that addresses all aforementioned concerns. Indeed, we propose a Multi-view Network for Stereo 3D Reconstruction, or MUSt3R, that modifies the DUSt3R architecture by making it symmetric and extending it to directly predict 3D structure for all views in a common coordinate frame. Second, we entail the model with a multi-layer memory mechanism which allows to reduce the computational complexity and to scale the reconstruction to large collections, inferring thousands of 3D pointmaps at high frame-rates with limited added complexity. The framework is designed to perform 3D reconstruction both offline and online, and hence can be seamlessly applied to SfM and visual SLAM scenarios showing state-of-the-art performance on various 3D downstream tasks, including uncalibrated Visual Odometry, relative camera pose, scale and focal estimation, 3D reconstruction and multi-view depth estimation.
△ Less
Submitted 3 March, 2025;
originally announced March 2025.
-
3D-Consistent Image Inpainting with Diffusion Models
Authors:
Leonid Antsfeld,
Boris Chidlovskii
Abstract:
We address the problem of 3D inconsistency of image inpainting based on diffusion models. We propose a generative model using image pairs that belong to the same scene. To achieve the 3D-consistent and semantically coherent inpainting, we modify the generative diffusion model by incorporating an alternative point of view of the scene into the denoising process. This creates an inductive bias that…
▽ More
We address the problem of 3D inconsistency of image inpainting based on diffusion models. We propose a generative model using image pairs that belong to the same scene. To achieve the 3D-consistent and semantically coherent inpainting, we modify the generative diffusion model by incorporating an alternative point of view of the scene into the denoising process. This creates an inductive bias that allows to recover 3D priors while training to denoise in 2D, without explicit 3D supervision. Training unconditional diffusion models with additional images as in-context guidance allows to harmonize the masked and non-masked regions while repainting and ensures the 3D consistency. We evaluate our method on one synthetic and three real-world datasets and show that it generates semantically coherent and 3D-consistent inpaintings and outperforms the state-of-art methods.
△ Less
Submitted 8 December, 2024;
originally announced December 2024.
-
Self-supervised Pretraining and Finetuning for Monocular Depth and Visual Odometry
Authors:
Boris Chidlovskii,
Leonid Antsfeld
Abstract:
For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art…
▽ More
For the task of simultaneous monocular depth and visual odometry estimation, we propose learning self-supervised transformer-based models in two steps. Our first step consists in a generic pretraining to learn 3D geometry, using cross-view completion objective (CroCo), followed by self-supervised finetuning on non-annotated videos. We show that our self-supervised models can reach state-of-the-art performance 'without bells and whistles' using standard components such as visual transformers, dense prediction transformers and adapters. We demonstrate the effectiveness of our proposed method by running evaluations on six benchmark datasets, both static and dynamic, indoor and outdoor, with synthetic and real images. For all datasets, our method outperforms state-of-the-art methods, in particular for depth prediction task.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Zero-BEV: Zero-shot Projection of Any First-Person Modality to BEV Maps
Authors:
Gianluca Monaci,
Leonid Antsfeld,
Boris Chidlovskii,
Christian Wolf
Abstract:
Bird's-eye view (BEV) maps are an important geometrically structured representation widely used in robotics, in particular self-driving vehicles and terrestrial robots. Existing algorithms either require depth information for the geometric projection, which is not always reliably available, or are trained end-to-end in a fully supervised way to map visual first-person observations to BEV represent…
▽ More
Bird's-eye view (BEV) maps are an important geometrically structured representation widely used in robotics, in particular self-driving vehicles and terrestrial robots. Existing algorithms either require depth information for the geometric projection, which is not always reliably available, or are trained end-to-end in a fully supervised way to map visual first-person observations to BEV representation, and are therefore restricted to the output modality they have been trained for. In contrast, we propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map. This is achieved by disentangling the geometric inverse perspective projection from the modality transformation, eg. RGB to occupancy. The method is general and we showcase experiments projecting to BEV three different modalities: semantic segmentation, motion vectors and object bounding boxes detected in first person. We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.
△ Less
Submitted 25 March, 2024; v1 submitted 21 February, 2024;
originally announced February 2024.
-
Learning to navigate efficiently and precisely in real environments
Authors:
Guillaume Bono,
Hervé Poirier,
Leonid Antsfeld,
Gianluca Monaci,
Boris Chidlovskii,
Christian Wolf
Abstract:
In the context of autonomous navigation of terrestrial robots, the creation of realistic models for agent dynamics and sensing is a widespread habit in the robotics literature and in commercial applications, where they are used for model based control and/or for localization and mapping. The more recent Embodied AI literature, on the other hand, focuses on modular or end-to-end agents trained in s…
▽ More
In the context of autonomous navigation of terrestrial robots, the creation of realistic models for agent dynamics and sensing is a widespread habit in the robotics literature and in commercial applications, where they are used for model based control and/or for localization and mapping. The more recent Embodied AI literature, on the other hand, focuses on modular or end-to-end agents trained in simulators like Habitat or AI-Thor, where the emphasis is put on photo-realistic rendering and scene diversity, but high-fidelity robot motion is assigned a less privileged role. The resulting sim2real gap significantly impacts transfer of the trained models to real robotic platforms. In this work we explore end-to-end training of agents in simulation in settings which minimize the sim2real gap both, in sensing and in actuation. Our agent directly predicts (discretized) velocity commands, which are maintained through closed-loop control in the real robot. The behavior of the real robot (including the underlying low-level controller) is identified and simulated in a modified Habitat simulator. Noise models for odometry and localization further contribute in lowering the sim2real gap. We evaluate on real navigation scenarios, explore different localization and point goal calculation methods and report significant gains in performance and robustness compared to prior work.
△ Less
Submitted 25 January, 2024;
originally announced January 2024.
-
End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon
Authors:
Guillaume Bono,
Leonid Antsfeld,
Boris Chidlovskii,
Philippe Weinzaepfel,
Christian Wolf
Abstract:
Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectN…
▽ More
Most recent work in goal oriented visual navigation resorts to large-scale machine learning in simulated environments. The main challenge lies in learning compact representations generalizable to unseen environments and in learning high-capacity perception modules capable of reasoning on high-dimensional input. The latter is particularly difficult when the goal is not given as a category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception module needs to learn a comparison strategy requiring to solve an underlying visual correspondence problem. This has been shown to be difficult from reward alone or with standard auxiliary tasks. We address this problem through a sequence of two pretext tasks, which serve as a prior for what we argue is one of the main bottleneck in perception, extremely wide-baseline relative pose estimation and visibility prediction in complex scenes. The first pretext task, cross-view completion is a proxy for the underlying visual correspondence problem, while the second task addresses goal detection and finding directly. We propose a new dual encoder with a large-capacity binocular ViT model and show that correspondence solutions naturally emerge from the training signals. Experiments show significant improvements and SOTA performance on the two benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics and height differ between observation and goal.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Learning with a Mole: Transferable latent spatial representations for navigation without reconstruction
Authors:
Guillaume Bono,
Leonid Antsfeld,
Assem Sadek,
Gianluca Monaci,
Christian Wolf
Abstract:
Agents navigating in 3D environments require some form of memory, which should hold a compact and actionable representation of the history of observations useful for decision taking and planning. In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation, whereas classical robotics addresses this with scene reconstruction resultin…
▽ More
Agents navigating in 3D environments require some form of memory, which should hold a compact and actionable representation of the history of observations useful for decision taking and planning. In most end-to-end learning approaches the representation is latent and usually does not have a clearly defined interpretation, whereas classical robotics addresses this with scene reconstruction resulting in some form of map, usually estimated with geometry and sensor models and/or learning. In this work we propose to learn an actionable representation of the scene independently of the targeted downstream task and without explicitly optimizing reconstruction. The learned representation is optimized by a blind auxiliary agent trained to navigate with it on multiple short sub episodes branching out from a waypoint and, most importantly, without any direct visual observation. We argue and show that the blindness property is important and forces the (trained) latent representation to be the only means for planning. With probing experiments we show that the learned representation optimizes navigability and not reconstruction. On downstream tasks we show that it is robust to changes in distribution, in particular the sim2real gap, which we evaluate with a real physical robot in a real office building, significantly improving performance.
△ Less
Submitted 29 September, 2023; v1 submitted 6 June, 2023;
originally announced June 2023.
-
CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow
Authors:
Philippe Weinzaepfel,
Thomas Lucas,
Vincent Leroy,
Yohann Cabon,
Vaibhav Arora,
Romain Brégier,
Gabriela Csurka,
Leonid Antsfeld,
Boris Chidlovskii,
Jérôme Revaud
Abstract:
Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of self-supervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent cross-v…
▽ More
Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of self-supervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene which makes it well suited for binocular downstream tasks. The applicability of this concept has so far been limited in at least two ways: (a) by the difficulty of collecting real-world image pairs -- in practice only synthetic data have been used -- and (b) by the lack of generalization of vanilla transformers to dense downstream tasks for which relative position is more meaningful than absolute position. We explore three avenues of improvement. First, we introduce a method to collect suitable real-world image pairs at large scale. Second, we experiment with relative positional embeddings and show that they enable vision transformers to perform substantially better. Third, we scale up vision transformer based cross-completion architectures, which is made possible by the use of large amounts of data. With these improvements, we show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques like correlation volume, iterative estimation, image warping or multi-scale reasoning, thus paving the way towards universal vision models.
△ Less
Submitted 18 August, 2023; v1 submitted 18 November, 2022;
originally announced November 2022.
-
CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion
Authors:
Philippe Weinzaepfel,
Vincent Leroy,
Thomas Lucas,
Romain Brégier,
Yohann Cabon,
Vaibhav Arora,
Leonid Antsfeld,
Boris Chidlovskii,
Gabriela Csurka,
Jérôme Revaud
Abstract:
Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm. A pretext task is constructed by masking patches in an input image, and this masked content is then predicted by a neural network using visible patches as sole input. This pre-training leads to state-of-the-art performance when finetuned for high-level semantic tasks, e.g. image classification and object d…
▽ More
Masked Image Modeling (MIM) has recently been established as a potent pre-training paradigm. A pretext task is constructed by masking patches in an input image, and this masked content is then predicted by a neural network using visible patches as sole input. This pre-training leads to state-of-the-art performance when finetuned for high-level semantic tasks, e.g. image classification and object detection. In this paper we instead seek to learn representations that transfer well to a wide variety of 3D vision and lower-level geometric downstream tasks, such as depth prediction or optical flow estimation. Inspired by MIM, we propose an unsupervised representation learning task trained from pairs of images showing the same scene from different viewpoints. More precisely, we propose the pretext task of cross-view completion where the first input image is partially masked, and this masked content has to be reconstructed from the visible content and the second image. In single-view MIM, the masked content often cannot be inferred precisely from the visible portion only, so the model learns to act as a prior influenced by high-level semantics. In contrast, this ambiguity can be resolved with cross-view completion from the second unmasked image, on the condition that the model is able to understand the spatial relationship between the two images. Our experiments show that our pretext task leads to significantly improved performance for monocular 3D vision downstream tasks such as depth estimation. In addition, our model can be directly applied to binocular downstream tasks like optical flow or relative camera pose estimation, for which we obtain competitive results without bells and whistles, i.e., using a generic architecture without any task-specific design.
△ Less
Submitted 12 January, 2023; v1 submitted 19 October, 2022;
originally announced October 2022.
-
Magnetic Field Sensing for Pedestrian and Robot Indoor Positioning
Authors:
Leonid Antsfeld,
Boris Chidlovskii
Abstract:
In this paper we address the problem of indoor localization using magnetic field data in two setups, when data is collected by (i) human-held mobile phone and (ii) by localization robots that perturb magnetic data with their own electromagnetic field. For the first setup, we revise the state of the art approaches and propose a novel extended pipeline to benefit from the presence of magnetic anomal…
▽ More
In this paper we address the problem of indoor localization using magnetic field data in two setups, when data is collected by (i) human-held mobile phone and (ii) by localization robots that perturb magnetic data with their own electromagnetic field. For the first setup, we revise the state of the art approaches and propose a novel extended pipeline to benefit from the presence of magnetic anomalies in indoor environment created by different ferromagnetic objects. We capture changes of the Earth's magnetic field due to indoor magnetic anomalies and transform them in multi-variate times series. We then convert temporal patterns into visual ones. We use methods of Recurrence Plots, Gramian Angular Fields and Markov Transition Fields to represent magnetic field time series as image sequences. We regress the continuous values of user position in a deep neural network that combines convolutional and recurrent layers. For the second setup, we analyze how magnetic field data get perturbed by robots' electromagnetic field. We add an alignment step to the main pipeline, in order to compensate the mismatch between train and test sets obtained by different robots. We test our methods on two public (MagPie and IPIN'20) and one proprietary (Hyundai department store) datasets. We report evaluation results and show that our methods outperform the state of the art methods by a large margin.
△ Less
Submitted 26 August, 2021;
originally announced August 2021.
-
Deep Smartphone Sensors-WiFi Fusion for Indoor Positioning and Tracking
Authors:
Leonid Antsfeld,
Boris Chidlovskii,
Emilio Sansano-Sansano
Abstract:
We address the indoor localization problem, where the goal is to predict user's trajectory from the data collected by their smartphone, using inertial sensors such as accelerometer, gyroscope and magnetometer, as well as other environment and network sensors such as barometer and WiFi. Our system implements a deep learning based pedestrian dead reckoning (deep PDR) model that provides a high-rate…
▽ More
We address the indoor localization problem, where the goal is to predict user's trajectory from the data collected by their smartphone, using inertial sensors such as accelerometer, gyroscope and magnetometer, as well as other environment and network sensors such as barometer and WiFi. Our system implements a deep learning based pedestrian dead reckoning (deep PDR) model that provides a high-rate estimation of the relative position of the user. Using Kalman Filter, we correct the PDR's drift using WiFi that provides a prediction of the user's absolute position each time a WiFi scan is received. Finally, we adjust Kalman Filter results with a map-free projection method that takes into account the physical constraints of the environment (corridors, doors, etc.) and projects the prediction on the possible walkable paths. We test our pipeline on IPIN'19 Indoor Localization challenge dataset and demonstrate that it improves the winner's results by 20\% using the challenge evaluation protocol.
△ Less
Submitted 21 November, 2020;
originally announced November 2020.