-
HOFT: Householder Orthogonal Fine-tuning
Authors:
Alejandro Moreno Arcas,
Albert Sanchis,
Jorge Civera,
Alfons Juan
Abstract:
Adaptation of foundation models using low-rank methods is a widespread approach. Another way to adapt these models is to employ orthogonal fine-tuning methods, which are less time and memory efficient despite their good generalization properties. In this work, we propose Householder Orthogonal Fine-tuning (HOFT), a novel orthogonal fine-tuning method that aims to alleviate time and space complexit…
▽ More
Adaptation of foundation models using low-rank methods is a widespread approach. Another way to adapt these models is to employ orthogonal fine-tuning methods, which are less time and memory efficient despite their good generalization properties. In this work, we propose Householder Orthogonal Fine-tuning (HOFT), a novel orthogonal fine-tuning method that aims to alleviate time and space complexity. Moreover, some theoretical properties of the orthogonal fine-tuning paradigm are explored. From this exploration, Scaled Householder Orthogonal Fine-tuning (SHOFT) is proposed. Both HOFT and SHOFT are evaluated in downstream tasks, namely commonsense reasoning, machine translation, subject-driven generation and mathematical reasoning. Compared with state-of-the-art adaptation methods, HOFT and SHOFT show comparable or better results.
△ Less
Submitted 22 May, 2025;
originally announced May 2025.
-
SLAM&Render: A Benchmark for the Intersection Between Neural Rendering, Gaussian Splatting and SLAM
Authors:
Samuel Cerezo,
Gaetano Meli,
Tomás Berriel Martins,
Kirill Safronov,
Javier Civera
Abstract:
Models and methods originally developed for novel view synthesis and scene rendering, such as Neural Radiance Fields (NeRF) and Gaussian Splatting, are increasingly being adopted as representations in Simultaneous Localization and Mapping (SLAM). However, existing datasets fail to include the specific challenges of both fields, such as multimodality and sequentiality in SLAM or generalization acro…
▽ More
Models and methods originally developed for novel view synthesis and scene rendering, such as Neural Radiance Fields (NeRF) and Gaussian Splatting, are increasingly being adopted as representations in Simultaneous Localization and Mapping (SLAM). However, existing datasets fail to include the specific challenges of both fields, such as multimodality and sequentiality in SLAM or generalization across viewpoints and illumination conditions in neural rendering. To bridge this gap, we introduce SLAM&Render, a novel dataset designed to benchmark methods in the intersection between SLAM and novel view rendering. It consists of 40 sequences with synchronized RGB, depth, IMU, robot kinematic data, and ground-truth pose streams. By releasing robot kinematic data, the dataset also enables the assessment of novel SLAM strategies when applied to robot manipulators. The dataset sequences span five different setups featuring consumer and industrial objects under four different lighting conditions, with separate training and test trajectories per scene, as well as object rearrangements. Our experimental results, obtained with several baselines from the literature, validate SLAM&Render as a relevant benchmark for this emerging research area.
△ Less
Submitted 21 April, 2025; v1 submitted 18 April, 2025;
originally announced April 2025.
-
VSLAM-LAB: A Comprehensive Framework for Visual SLAM Methods and Datasets
Authors:
Alejandro Fontan,
Tobias Fischer,
Javier Civera,
Michael Milford
Abstract:
Visual Simultaneous Localization and Mapping (VSLAM) research faces significant challenges due to fragmented toolchains, complex system configurations, and inconsistent evaluation methodologies. To address these issues, we present VSLAM-LAB, a unified framework designed to streamline the development, evaluation, and deployment of VSLAM systems. VSLAM-LAB simplifies the entire workflow by enabling…
▽ More
Visual Simultaneous Localization and Mapping (VSLAM) research faces significant challenges due to fragmented toolchains, complex system configurations, and inconsistent evaluation methodologies. To address these issues, we present VSLAM-LAB, a unified framework designed to streamline the development, evaluation, and deployment of VSLAM systems. VSLAM-LAB simplifies the entire workflow by enabling seamless compilation and configuration of VSLAM algorithms, automated dataset downloading and preprocessing, and standardized experiment design, execution, and evaluation--all accessible through a single command-line interface. The framework supports a wide range of VSLAM systems and datasets, offering broad compatibility and extendability while promoting reproducibility through consistent evaluation metrics and analysis tools. By reducing implementation complexity and minimizing configuration overhead, VSLAM-LAB empowers researchers to focus on advancing VSLAM methodologies and accelerates progress toward scalable, real-world solutions. We demonstrate the ease with which user-relevant benchmarks can be created: here, we introduce difficulty-level-based categories, but one could envision environment-specific or condition-specific categories.
△ Less
Submitted 6 April, 2025;
originally announced April 2025.
-
MVSAnywhere: Zero-Shot Multi-View Stereo
Authors:
Sergio Izquierdo,
Mohamed Sayed,
Michael Firman,
Guillermo Garcia-Hernando,
Daniyar Turmukhambetov,
Javier Civera,
Oisin Mac Aodha,
Gabriel Brostow,
Jamie Watson
Abstract:
Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision. However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs. outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to i…
▽ More
Computing accurate depth from multiple views is a fundamental and longstanding challenge in computer vision. However, most existing approaches do not generalize well across different domains and scene types (e.g. indoor vs. outdoor). Training a general-purpose multi-view stereo model is challenging and raises several questions, e.g. how to best make use of transformer-based architectures, how to incorporate additional metadata when there is a variable number of input views, and how to estimate the range of valid depths which can vary considerably across different scenes and is typically not known a priori? To address these issues, we introduce MVSA, a novel and versatile Multi-View Stereo architecture that aims to work Anywhere by generalizing across diverse domains and depth ranges. MVSA combines monocular and multi-view cues with an adaptive cost volume to deal with scale-related issues. We demonstrate state-of-the-art zero-shot depth estimation on the Robust Multi-View Depth Benchmark, surpassing existing multi-view stereo and monocular baselines.
△ Less
Submitted 28 March, 2025;
originally announced March 2025.
-
AnyCalib: On-Manifold Learning for Model-Agnostic Single-View Camera Calibration
Authors:
Javier Tirado-Garín,
Javier Civera
Abstract:
We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. Current methods are predominantly tailored to specific camera models and/or require extrinsic cues, such as the direction of gravity, to be visible in the image. In contrast, we argue that the perspective and distortion cues inherent in images ar…
▽ More
We present AnyCalib, a method for calibrating the intrinsic parameters of a camera from a single in-the-wild image, that is agnostic to the camera model. Current methods are predominantly tailored to specific camera models and/or require extrinsic cues, such as the direction of gravity, to be visible in the image. In contrast, we argue that the perspective and distortion cues inherent in images are sufficient for model-agnostic camera calibration. To demonstrate this, we frame the calibration process as the regression of the rays corresponding to each pixel. We show, for the first time, that this intermediate representation allows for a closed-form recovery of the intrinsics for a wide range of camera models, including but not limited to: pinhole, Brown-Conrady and Kannala-Brandt. Our approach also applies to edited -- cropped and stretched -- images. Experimentally, we demonstrate that AnyCalib consistently outperforms alternative methods, including 3D foundation models, despite being trained on orders of magnitude less data. Code is available at https://github.com/javrtg/AnyCalib.
△ Less
Submitted 16 March, 2025;
originally announced March 2025.
-
S-Graphs 2.0 -- A Hierarchical-Semantic Optimization and Loop Closure for SLAM
Authors:
Hriday Bavle,
Jose Luis Sanchez-Lopez,
Muhammad Shaheer,
Javier Civera,
Holger Voos
Abstract:
The hierarchical structure of 3D scene graphs shows a high relevance for representations purposes, as it fits common patterns from man-made environments. But, additionally, the semantic and geometric information in such hierarchical representations could be leveraged to speed up the optimization and management of map elements and robot poses.
In this direction, we present our work Situational Gr…
▽ More
The hierarchical structure of 3D scene graphs shows a high relevance for representations purposes, as it fits common patterns from man-made environments. But, additionally, the semantic and geometric information in such hierarchical representations could be leveraged to speed up the optimization and management of map elements and robot poses.
In this direction, we present our work Situational Graphs 2.0 (S-Graphs 2.0), which leverages the hierarchical structure of indoor scenes for efficient data management and optimization. Our algorithm begins by constructing a situational graph that represents the environment into four layers: Keyframes, Walls, Rooms, and Floors. Our first novelty lies in the front-end, which includes a floor detection module capable of identifying stairways and assigning floor-level semantic relations to the underlying layers. Floor-level semantics allows us to propose a floor-based loop closure strategy, that effectively rejects false positive closures that typically appear due to aliasing between different floors of a building. Our second novelty lies in leveraging our representation hierarchy in the optimization. Our proposal consists of: (1) local optimization over a window of recent keyframes and their connected components across the four representation layers, (2) floor-level global optimization, which focuses only on keyframes and their connections within the current floor during loop closures, and (3) room-level local optimization, marginalizing redundant keyframes that share observations within the room, which reduces the computational footprint. We validate our algorithm extensively in different real multi-floor environments. Our approach shows state-of-art-art accuracy metrics in large-scale multi-floor environments, estimating hierarchical representations up to 10x faster, in average, than competing baselines
△ Less
Submitted 28 February, 2025; v1 submitted 25 February, 2025;
originally announced February 2025.
-
Single-Shot Metric Depth from Focused Plenoptic Cameras
Authors:
Blanca Lasheras-Hernandez,
Klaus H. Strobl,
Sergio Izquierdo,
Tim Bodenmüller,
Rudolph Triebel,
Javier Civera
Abstract:
Metric depth estimation from visual sensors is crucial for robots to perceive, navigate, and interact with their environment. Traditional range imaging setups, such as stereo or structured light cameras, face hassles including calibration, occlusions, and hardware demands, with accuracy limited by the baseline between cameras. Single- and multi-view monocular depth offers a more compact alternativ…
▽ More
Metric depth estimation from visual sensors is crucial for robots to perceive, navigate, and interact with their environment. Traditional range imaging setups, such as stereo or structured light cameras, face hassles including calibration, occlusions, and hardware demands, with accuracy limited by the baseline between cameras. Single- and multi-view monocular depth offers a more compact alternative, but is constrained by the unobservability of the metric scale. Light field imaging provides a promising solution for estimating metric depth by using a unique lens configuration through a single device. However, its application to single-view dense metric depth is under-addressed mainly due to the technology's high cost, the lack of public benchmarks, and proprietary geometrical models and software. Our work explores the potential of focused plenoptic cameras for dense metric depth. We propose a novel pipeline that predicts metric depth from a single plenoptic camera shot by first generating a sparse metric point cloud using machine learning, which is then used to scale and align a dense relative depth map regressed by a foundation depth model, resulting in dense metric depth. To validate it, we curated the Light Field & Stereo Image Dataset (LFS) of real-world light field images with stereo depth labels, filling a current gap in existing resources. Experimental results show that our pipeline produces accurate metric depth predictions, laying a solid groundwork for future research in this field.
△ Less
Submitted 12 March, 2025; v1 submitted 3 December, 2024;
originally announced December 2024.
-
Look Ma, No Ground Truth! Ground-Truth-Free Tuning of Structure from Motion and Visual SLAM
Authors:
Alejandro Fontan,
Javier Civera,
Tobias Fischer,
Michael Milford
Abstract:
Evaluation is critical to both developing and tuning Structure from Motion (SfM) and Visual SLAM (VSLAM) systems, but is universally reliant on high-quality geometric ground truth -- a resource that is not only costly and time-intensive but, in many cases, entirely unobtainable. This dependency on ground truth restricts SfM and SLAM applications across diverse environments and limits scalability t…
▽ More
Evaluation is critical to both developing and tuning Structure from Motion (SfM) and Visual SLAM (VSLAM) systems, but is universally reliant on high-quality geometric ground truth -- a resource that is not only costly and time-intensive but, in many cases, entirely unobtainable. This dependency on ground truth restricts SfM and SLAM applications across diverse environments and limits scalability to real-world scenarios. In this work, we propose a novel ground-truth-free (GTF) evaluation methodology that eliminates the need for geometric ground truth, instead using sensitivity estimation via sampling from both original and noisy versions of input images. Our approach shows strong correlation with traditional ground-truth-based benchmarks and supports GTF hyperparameter tuning. Removing the need for ground truth opens up new opportunities to leverage a much larger number of dataset sources, and for self-supervised and online tuning, with the potential for a data-driven breakthrough analogous to what has occurred in generative AI.
△ Less
Submitted 1 December, 2024;
originally announced December 2024.
-
Open-Vocabulary Online Semantic Mapping for SLAM
Authors:
Tomas Berriel Martins,
Martin R. Oswald,
Javier Civera
Abstract:
This paper presents an Open-Vocabulary Online 3D semantic mapping pipeline, that we denote by its acronym OVO. Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method. Notably, our OVO has a significantly lower computational and memory footprint than…
▽ More
This paper presents an Open-Vocabulary Online 3D semantic mapping pipeline, that we denote by its acronym OVO. Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method. Notably, our OVO has a significantly lower computational and memory footprint than offline baselines, while also showing better segmentation metrics than them. Along with superior segmentation performance, we also show experimental results of our mapping contributions integrated with two different SLAM backbones (Gaussian-SLAM and ORB-SLAM2), being the first ones demonstrating end-to-end open-vocabulary online 3D reconstructions without relying on ground-truth camera poses or scene geometry.
△ Less
Submitted 10 March, 2025; v1 submitted 22 November, 2024;
originally announced November 2024.
-
Addressing the challenges of loop detection in agricultural environments
Authors:
Nicolás Soncini,
Javier Civera,
Taihú Pire
Abstract:
While visual SLAM systems are well studied and achieve impressive results in indoor and urban settings, natural, outdoor and open-field environments are much less explored and still present relevant research challenges. Visual navigation and local mapping have shown a relatively good performance in open-field environments. However, globally consistent mapping and long-term localization still depen…
▽ More
While visual SLAM systems are well studied and achieve impressive results in indoor and urban settings, natural, outdoor and open-field environments are much less explored and still present relevant research challenges. Visual navigation and local mapping have shown a relatively good performance in open-field environments. However, globally consistent mapping and long-term localization still depend on the robustness of loop detection and closure, for which the literature is scarce. In this work we propose a novel method to pave the way towards robust loop detection in open fields, particularly in agricultural settings, based on local feature search and stereo geometric refinement, with a final stage of relative pose estimation. Our method consistently achieves good loop detections, with a median error of 15cm. We aim to characterize open fields as a novel environment for loop detection, understanding the limitations and problems that arise when dealing with them.
△ Less
Submitted 30 August, 2024; v1 submitted 28 August, 2024;
originally announced August 2024.
-
Tightly Coupled SLAM with Imprecise Architectural Plans
Authors:
Muhammad Shaheer,
Jose Andres Millan-Romera,
Hriday Bavle,
Marco Giberna,
Jose Luis Sanchez-Lopez,
Javier Civera,
Holger Voos
Abstract:
Robots navigating indoor environments often have access to architectural plans, which can serve as prior knowledge to enhance their localization and mapping capabilities. While some SLAM algorithms leverage these plans for global localization in real-world environments, they typically overlook a critical challenge: the "as-planned" architectural designs frequently deviate from the "as-built" real-…
▽ More
Robots navigating indoor environments often have access to architectural plans, which can serve as prior knowledge to enhance their localization and mapping capabilities. While some SLAM algorithms leverage these plans for global localization in real-world environments, they typically overlook a critical challenge: the "as-planned" architectural designs frequently deviate from the "as-built" real-world environments. To address this gap, we present a novel algorithm that tightly couples LIDAR-based simultaneous localization and mapping with architectural plans under the presence of deviations. Our method utilizes a multi-layered semantic representation to not only localize the robot, but also to estimate global alignment and structural deviations between "as-planned" and as-built environments in real-time. To validate our approach, we performed experiments in simulated and real datasets demonstrating robustness to structural deviations up to 35 cm and 15 degrees. On average, our method achieves 43% less localization error than baselines in simulated environments, while in real environments, the as-built 3D maps show 7% lower average alignment error
△ Less
Submitted 12 June, 2025; v1 submitted 3 August, 2024;
originally announced August 2024.
-
Alignment Scores: Robust Metrics for Multiview Pose Accuracy Evaluation
Authors:
Seong Hun Lee,
Javier Civera
Abstract:
We propose three novel metrics for evaluating the accuracy of a set of estimated camera poses given the ground truth: Translation Alignment Score (TAS), Rotation Alignment Score (RAS), and Pose Alignment Score (PAS). The TAS evaluates the translation accuracy independently of the rotations, and the RAS evaluates the rotation accuracy independently of the translations. The PAS is the average of the…
▽ More
We propose three novel metrics for evaluating the accuracy of a set of estimated camera poses given the ground truth: Translation Alignment Score (TAS), Rotation Alignment Score (RAS), and Pose Alignment Score (PAS). The TAS evaluates the translation accuracy independently of the rotations, and the RAS evaluates the rotation accuracy independently of the translations. The PAS is the average of the two scores, evaluating the combined accuracy of both translations and rotations. The TAS is computed in four steps: (1) Find the upper quartile of the closest-pair-distances, $d$. (2) Align the estimated trajectory to the ground truth using a robust registration method. (3) Collect all distance errors and obtain the cumulative frequencies for multiple thresholds ranging from $0.01d$ to $d$ with a resolution $0.01d$. (4) Add up these cumulative frequencies and normalize them such that the theoretical maximum is 1. The TAS has practical advantages over the existing metrics in that (1) it is robust to outliers and collinear motion, and (2) there is no need to adjust parameters on different datasets. The RAS is computed in a similar manner to the TAS and is also shown to be more robust against outliers than the existing rotation metrics. We verify our claims through extensive simulations and provide in-depth discussion of the strengths and weaknesses of the proposed metrics.
△ Less
Submitted 2 August, 2024; v1 submitted 29 July, 2024;
originally announced July 2024.
-
Close, But Not There: Boosting Geographic Distance Sensitivity in Visual Place Recognition
Authors:
Sergio Izquierdo,
Javier Civera
Abstract:
Visual Place Recognition (VPR) plays a critical role in many localization and mapping pipelines. It consists of retrieving the closest sample to a query image, in a certain embedding space, from a database of geotagged references. The image embedding is learned to effectively describe a place despite variations in visual appearance, viewpoint, and geometric changes. In this work, we formulate how…
▽ More
Visual Place Recognition (VPR) plays a critical role in many localization and mapping pipelines. It consists of retrieving the closest sample to a query image, in a certain embedding space, from a database of geotagged references. The image embedding is learned to effectively describe a place despite variations in visual appearance, viewpoint, and geometric changes. In this work, we formulate how limitations in the Geographic Distance Sensitivity of current VPR embeddings result in a high probability of incorrectly sorting the top-k retrievals, negatively impacting the recall. In order to address this issue in single-stage VPR, we propose a novel mining strategy, CliqueMining, that selects positive and negative examples by sampling cliques from a graph of visually similar images. Our approach boosts the sensitivity of VPR embeddings at small distance ranges, significantly improving the state of the art on relevant benchmarks. In particular, we raise recall@1 from 75% to 82% in MSLS Challenge, and from 76% to 90% in Nordland. Models and code are available at https://github.com/serizba/cliquemining.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Feature Splatting for Better Novel View Synthesis with Low Overlap
Authors:
T. Berriel Martins,
Javier Civera
Abstract:
3D Gaussian Splatting has emerged as a very promising scene representation, achieving state-of-the-art quality in novel view synthesis significantly faster than competing alternatives. However, its use of spherical harmonics to represent scene colors limits the expressivity of 3D Gaussians and, as a consequence, the capability of the representation to generalize as we move away from the training v…
▽ More
3D Gaussian Splatting has emerged as a very promising scene representation, achieving state-of-the-art quality in novel view synthesis significantly faster than competing alternatives. However, its use of spherical harmonics to represent scene colors limits the expressivity of 3D Gaussians and, as a consequence, the capability of the representation to generalize as we move away from the training views. In this paper, we propose to encode the color information of 3D Gaussians into per-Gaussian feature vectors, which we denote as Feature Splatting (FeatSplat). To synthesize a novel view, Gaussians are first "splatted" into the image plane, then the corresponding feature vectors are alpha-blended, and finally the blended vector is decoded by a small MLP to render the RGB pixel values. To further inform the model, we concatenate a camera embedding to the blended feature vector, to condition the decoding also on the viewpoint information. Our experiments show that these novel model for encoding the radiance considerably improves novel view synthesis for low overlap views that are distant from the training views. Finally, we also show the capacity and convenience of our feature vector representation, demonstrating its capability not only to generate RGB values for novel views, but also their per-pixel semantic labels. Code available at https://github.com/tberriel/FeatSplat .
Keywords: Gaussian Splatting, Novel View Synthesis, Feature Splatting
△ Less
Submitted 30 September, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Camera Motion Estimation from RGB-D-Inertial Scene Flow
Authors:
Samuel Cerezo,
Javier Civera
Abstract:
In this paper, we introduce a novel formulation for camera motion estimation that integrates RGB-D images and inertial data through scene flow. Our goal is to accurately estimate the camera motion in a rigid 3D environment, along with the state of the inertial measurement unit (IMU). Our proposed method offers the flexibility to operate as a multi-frame optimization or to marginalize older data, t…
▽ More
In this paper, we introduce a novel formulation for camera motion estimation that integrates RGB-D images and inertial data through scene flow. Our goal is to accurately estimate the camera motion in a rigid 3D environment, along with the state of the inertial measurement unit (IMU). Our proposed method offers the flexibility to operate as a multi-frame optimization or to marginalize older data, thus effectively utilizing past measurements. To assess the performance of our method, we conducted evaluations using both synthetic data from the ICL-NUIM dataset and real data sequences from the OpenLORIS-Scene dataset. Our results show that the fusion of these two sensors enhances the accuracy of camera motion estimation when compared to using only visual data.
△ Less
Submitted 26 April, 2024;
originally announced April 2024.
-
Unifying Local and Global Multimodal Features for Place Recognition in Aliased and Low-Texture Environments
Authors:
Alberto García-Hernández,
Riccardo Giubilato,
Klaus H. Strobl,
Javier Civera,
Rudolph Triebel
Abstract:
Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. This paper presents a novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that 1) leverages multi-modality by cross-attention blocks between vision and LiDAR features, and 2) includes…
▽ More
Perceptual aliasing and weak textures pose significant challenges to the task of place recognition, hindering the performance of Simultaneous Localization and Mapping (SLAM) systems. This paper presents a novel model, called UMF (standing for Unifying Local and Global Multimodal Features) that 1) leverages multi-modality by cross-attention blocks between vision and LiDAR features, and 2) includes a re-ranking stage that re-orders based on local feature matching the top-k candidates retrieved using a global representation. Our experiments, particularly on sequences captured on a planetary-analogous environment, show that UMF outperforms significantly previous baselines in those challenging aliased environments. Since our work aims to enhance the reliability of SLAM in all situations, we also explore its performance on the widely used RobotCar dataset, for broader applicability. Code and models are available at https://github.com/DLR-RM/UMF
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
PCR-99: A Practical Method for Point Cloud Registration with 99 Percent Outliers
Authors:
Seong Hun Lee,
Javier Civera,
Patrick Vandewalle
Abstract:
We propose a robust method for point cloud registration that can handle both unknown scales and extreme outlier ratios. Our method, dubbed PCR-99, uses a deterministic 3-point sampling approach with two novel mechanisms that significantly boost the speed: (1) an improved ordering of the samples based on pairwise scale consistency, prioritizing the point correspondences that are more likely to be i…
▽ More
We propose a robust method for point cloud registration that can handle both unknown scales and extreme outlier ratios. Our method, dubbed PCR-99, uses a deterministic 3-point sampling approach with two novel mechanisms that significantly boost the speed: (1) an improved ordering of the samples based on pairwise scale consistency, prioritizing the point correspondences that are more likely to be inliers, and (2) an efficient outlier rejection scheme based on triplet scale consistency, prescreening bad samples and reducing the number of hypotheses to be tested. Our evaluation shows that, up to 98% outlier ratio, the proposed method achieves comparable performance to the state of the art. At 99% outlier ratio, however, it outperforms the state of the art for both known-scale and unknown-scale problems. Especially for the latter, we observe a clear superiority in terms of robustness and speed.
△ Less
Submitted 9 September, 2024; v1 submitted 26 February, 2024;
originally announced February 2024.
-
From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation
Authors:
Javier Tirado-Garín,
Javier Civera
Abstract:
Estimating the relative camera pose from $n \geq 5$ correspondences between two calibrated views is a fundamental task in computer vision. This process typically involves two stages: 1) estimating the essential matrix between the views, and 2) disambiguating among the four candidate relative poses that satisfy the epipolar geometry. In this paper, we demonstrate a novel approach that, for the firs…
▽ More
Estimating the relative camera pose from $n \geq 5$ correspondences between two calibrated views is a fundamental task in computer vision. This process typically involves two stages: 1) estimating the essential matrix between the views, and 2) disambiguating among the four candidate relative poses that satisfy the epipolar geometry. In this paper, we demonstrate a novel approach that, for the first time, bypasses the second stage. Specifically, we show that it is possible to directly estimate the correct relative camera pose from correspondences without needing a post-processing step to enforce the cheirality constraint on the correspondences. Building on recent advances in certifiable non-minimal optimization, we frame the relative pose estimation as a Quadratically Constrained Quadratic Program (QCQP). By applying the appropriate constraints, we ensure the estimation of a camera pose that corresponds to a valid 3D geometry and that is globally optimal when certified. We validate our method through exhaustive synthetic and real-world experiments, confirming the efficacy, efficiency and accuracy of the proposed approach. Code is available at https://github.com/javrtg/C2P.
△ Less
Submitted 27 March, 2024; v1 submitted 10 December, 2023;
originally announced December 2023.
-
Optimal Transport Aggregation for Visual Place Recognition
Authors:
Sergio Izquierdo,
Javier Civera
Abstract:
The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Ag…
▽ More
The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad.
△ Less
Submitted 27 June, 2024; v1 submitted 27 November, 2023;
originally announced November 2023.
-
Segmentation-Free Streaming Machine Translation
Authors:
Javier Iranzo-Sánchez,
Jorge Iranzo-Sánchez,
Adrià Giménez,
Jorge Civera,
Alfons Juan
Abstract:
Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and i…
▽ More
Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate segmentation step which splits the transcription stream into sentence-like units. However, the incorporation of a hard segmentation constrains the MT system and is a source of errors. This paper proposes a Segmentation-Free framework that enables the model to translate an unsegmented source stream by delaying the segmentation decision until the translation has been generated. Extensive experiments show how the proposed Segmentation-Free framework has better quality-latency trade-off than competing approaches that use an independent segmentation model. Software, data and models will be released upon paper acceptance.
△ Less
Submitted 25 May, 2024; v1 submitted 26 September, 2023;
originally announced September 2023.
-
Motion-Bias-Free Feature-Based SLAM
Authors:
Alejandro Fontan,
Javier Civera,
Michael Milford
Abstract:
For SLAM to be safely deployed in unstructured real world environments, it must possess several key properties that are not encompassed by conventional benchmarks. In this paper we show that SLAM commutativity, that is, consistency in trajectory estimates on forward and reverse traverses of the same route, is a significant issue for the state of the art. Current pipelines show a significant bias b…
▽ More
For SLAM to be safely deployed in unstructured real world environments, it must possess several key properties that are not encompassed by conventional benchmarks. In this paper we show that SLAM commutativity, that is, consistency in trajectory estimates on forward and reverse traverses of the same route, is a significant issue for the state of the art. Current pipelines show a significant bias between forward and reverse directions of travel, that is in addition inconsistent regarding which direction of travel exhibits better performance. In this paper we propose several contributions to feature-based SLAM pipelines that remedies the motion bias problem. In a comprehensive evaluation across four datasets, we show that our contributions implemented in ORB-SLAM2 substantially reduce the bias between forward and backward motion and additionally improve the aggregated trajectory error. Removing the SLAM motion bias has significant relevance for the wide range of robotics and computer vision applications where performance consistency is important.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
Robust Single Rotation Averaging Revisited
Authors:
Seong Hun Lee,
Javier Civera
Abstract:
In this work, we propose a novel method for robust single rotation averaging that can efficiently handle an extremely large fraction of outliers. Our approach is to minimize the total truncated least unsquared deviations (TLUD) cost of geodesic distances. The proposed algorithm consists of three steps: First, we consider each input rotation as a potential initial solution and choose the one that y…
▽ More
In this work, we propose a novel method for robust single rotation averaging that can efficiently handle an extremely large fraction of outliers. Our approach is to minimize the total truncated least unsquared deviations (TLUD) cost of geodesic distances. The proposed algorithm consists of three steps: First, we consider each input rotation as a potential initial solution and choose the one that yields the least sum of truncated chordal deviations. Next, we obtain the inlier set using the initial solution and compute its chordal $L_2$-mean. Finally, starting from this estimate, we iteratively compute the geodesic $L_1$-mean of the inliers using the Weiszfeld algorithm on $SO(3)$. An extensive evaluation shows that our method is robust against up to 99% outliers given a sufficient number of accurate inliers, outperforming the current state of the art.
△ Less
Submitted 10 September, 2024; v1 submitted 11 September, 2023;
originally announced September 2023.
-
Faster Optimization in S-Graphs Exploiting Hierarchy
Authors:
Hriday Bavle,
Jose Luis Sanchez-Lopez,
Javier Civera,
Holger Voos
Abstract:
3D scene graphs hierarchically represent the environment appropriately organizing different environmental entities in various layers. Our previous work on situational graphs extends the concept of 3D scene graph to SLAM by tightly coupling the robot poses with the scene graph entities, achieving state-of-the-art results. Though, one of the limitations of S-Graphs is scalability in really large env…
▽ More
3D scene graphs hierarchically represent the environment appropriately organizing different environmental entities in various layers. Our previous work on situational graphs extends the concept of 3D scene graph to SLAM by tightly coupling the robot poses with the scene graph entities, achieving state-of-the-art results. Though, one of the limitations of S-Graphs is scalability in really large environments due to the increased graph size over time, increasing the computational complexity.
To overcome this limitation in this work we present an initial research of an improved version of S-Graphs exploiting the hierarchy to reduce the graph size by marginalizing redundant robot poses and their connections to the observations of the same structural entities. Firstly, we propose the generation and optimization of room-local graphs encompassing all graph entities within a room-like structure. These room-local graphs are used to compress the S-Graphs marginalizing the redundant robot keyframes within the given room. We then perform windowed local optimization of the compressed graph at regular time-distance intervals. A global optimization of the compressed graph is performed every time a loop closure is detected. We show similar accuracy compared to the baseline while showing a 39.81% reduction in the computation time with respect to the baseline.
△ Less
Submitted 22 August, 2023;
originally announced August 2023.
-
LightDepth: Single-View Depth Self-Supervision from Illumination Decline
Authors:
Javier Rodríguez-Puigvert,
Víctor M. Batlle,
J. M. M. Montiel,
Ruben Martinez-Cantin,
Pascal Fua,
Juan D. Tardós,
Javier Civera
Abstract:
Single-view depth estimation can be remarkably effective if there is enough ground-truth depth data for supervised training. However, there are scenarios, especially in medicine in the case of endoscopies, where such data cannot be obtained. In such cases, multi-view self-supervision and synthetic-to-real transfer serve as alternative approaches, however, with a considerable performance reduction…
▽ More
Single-view depth estimation can be remarkably effective if there is enough ground-truth depth data for supervised training. However, there are scenarios, especially in medicine in the case of endoscopies, where such data cannot be obtained. In such cases, multi-view self-supervision and synthetic-to-real transfer serve as alternative approaches, however, with a considerable performance reduction in comparison to supervised case. Instead, we propose a single-view self-supervised method that achieves a performance similar to the supervised case. In some medical devices, such as endoscopes, the camera and light sources are co-located at a small distance from the target surfaces. Thus, we can exploit that, for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance to the surface, providing a strong single-view self-supervisory signal. In our experiments, our self-supervised models deliver accuracies comparable to those of fully supervised ones, while being applicable without depth ground-truth data.
△ Less
Submitted 19 September, 2023; v1 submitted 21 August, 2023;
originally announced August 2023.
-
GNSS-stereo-inertial SLAM for arable farming
Authors:
Javier Cremona,
Javier Civera,
Ernesto Kofman,
Taihú Pire
Abstract:
The accelerating pace in the automation of agricultural tasks demands highly accurate and robust localization systems for field robots. Simultaneous Localization and Mapping (SLAM) methods inevitably accumulate drift on exploratory trajectories and primarily rely on place revisiting and loop closing to keep a bounded global localization error. Loop closure techniques are significantly challenging…
▽ More
The accelerating pace in the automation of agricultural tasks demands highly accurate and robust localization systems for field robots. Simultaneous Localization and Mapping (SLAM) methods inevitably accumulate drift on exploratory trajectories and primarily rely on place revisiting and loop closing to keep a bounded global localization error. Loop closure techniques are significantly challenging in agricultural fields, as the local visual appearance of different views is very similar and might change easily due to weather effects. A suitable alternative in practice is to employ global sensor positioning systems jointly with the rest of the robot sensors. In this paper we propose and implement the fusion of global navigation satellite system (GNSS), stereo views, and inertial measurements for localization purposes. Specifically, we incorporate, in a tightly coupled manner, GNSS measurements into the stereo-inertial ORB-SLAM3 pipeline. We thoroughly evaluate our implementation in the sequences of the Rosario data set, recorded by an autonomous robot in soybean fields, and our own in-house data. Our data includes measurements from a conventional GNSS, rarely included in evaluations of state-of-the-art approaches. We characterize the performance of GNSS-stereo-inertial SLAM in this application case, reporting pose error reductions between 10% and 30% compared to visual-inertial and loosely coupled GNSS-stereo-inertial baselines. In addition to such analysis, we also release the code of our implementation as open source.
△ Less
Submitted 24 July, 2023;
originally announced July 2023.
-
The Drunkard's Odometry: Estimating Camera Motion in Deforming Scenes
Authors:
David Recasens,
Martin R. Oswald,
Marc Pollefeys,
Javier Civera
Abstract:
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. Deformable odometry and SLAM pi…
▽ More
Estimating camera motion in deformable scenes poses a complex and open research challenge. Most existing non-rigid structure from motion techniques assume to observe also static scene parts besides deforming scene parts in order to establish an anchoring reference. However, this assumption does not hold true in certain relevant application cases such as endoscopies. Deformable odometry and SLAM pipelines, which tackle the most challenging scenario of exploratory trajectories, suffer from a lack of robustness and proper quantitative evaluation methodologies. To tackle this issue with a common benchmark, we introduce the Drunkard's Dataset, a challenging collection of synthetic data targeting visual navigation and reconstruction in deformable environments. This dataset is the first large set of exploratory camera trajectories with ground truth inside 3D scenes where every surface exhibits non-rigid deformations over time. Simulations in realistic 3D buildings lets us obtain a vast amount of data and ground truth labels, including camera poses, RGB images and depth, optical flow and normal maps at high resolution and quality. We further present a novel deformable odometry method, dubbed the Drunkard's Odometry, which decomposes optical flow estimates into rigid-body camera motion and non-rigid scene deformations. In order to validate our data, our work contains an evaluation of several baselines as well as a novel tracking error metric which does not require ground truth data. Dataset and code: https://davidrecasens.github.io/TheDrunkard'sOdometry/
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
DAC: Detector-Agnostic Spatial Covariances for Deep Local Features
Authors:
Javier Tirado-Garín,
Frederik Warburg,
Javier Civera
Abstract:
Current deep visual local feature detectors do not model the spatial uncertainty of detected features, producing suboptimal results in downstream applications. In this work, we propose two post-hoc covariance estimates that can be plugged into any pretrained deep feature detector: a simple, isotropic covariance estimate that uses the predicted score at a given pixel location, and a full covariance…
▽ More
Current deep visual local feature detectors do not model the spatial uncertainty of detected features, producing suboptimal results in downstream applications. In this work, we propose two post-hoc covariance estimates that can be plugged into any pretrained deep feature detector: a simple, isotropic covariance estimate that uses the predicted score at a given pixel location, and a full covariance estimate via the local structure tensor of the learned score maps. Both methods are easy to implement and can be applied to any deep feature detector. We show that these covariances are directly related to errors in feature matching, leading to improvements in downstream tasks, including solving the perspective-n-point problem and motion-only bundle adjustment. Code is available at https://github.com/javrtg/DAC
△ Less
Submitted 15 August, 2023; v1 submitted 20 May, 2023;
originally announced May 2023.
-
Ray-Patch: An Efficient Querying for Light Field Transformers
Authors:
T. Berriel Martins,
Javier Civera
Abstract:
In this paper we propose the Ray-Patch querying, a novel model to efficiently query transformers to decode implicit representations into target views. Our Ray-Patch decoding reduces the computational footprint and increases inference speed up to one order of magnitude compared to previous models, without losing global attention, and hence maintaining specific task metrics. The key idea of our nove…
▽ More
In this paper we propose the Ray-Patch querying, a novel model to efficiently query transformers to decode implicit representations into target views. Our Ray-Patch decoding reduces the computational footprint and increases inference speed up to one order of magnitude compared to previous models, without losing global attention, and hence maintaining specific task metrics. The key idea of our novel querying is to split the target image into a set of patches, then querying the transformer for each patch to extract a set of feature vectors, which are finally decoded into the target image using convolutional layers. Our experimental results, implementing Ray-Patch in 3 different architectures and evaluating it in 2 different tasks and datasets, demonstrate and quantify the effectiveness of our method, specifically a notable boost in rendering speed for the same task metrics.
△ Less
Submitted 17 August, 2023; v1 submitted 16 May, 2023;
originally announced May 2023.
-
Graph-based Global Robot Simultaneous Localization and Mapping using Architectural Plans
Authors:
Muhammad Shaheer,
Jose Andres Millan-Romera,
Hriday Bavle,
Jose Luis Sanchez-Lopez,
Javier Civera,
Holger Voos
Abstract:
In this paper, we propose a solution for graph-based global robot simultaneous localization and mapping (SLAM) using architectural plans. Before the start of the robot operation, the previously available architectural plan of the building is converted into our proposed architectural graph (A-Graph). When the robot starts its operation, it uses its onboard LIDAR and odometry to carry out an online…
▽ More
In this paper, we propose a solution for graph-based global robot simultaneous localization and mapping (SLAM) using architectural plans. Before the start of the robot operation, the previously available architectural plan of the building is converted into our proposed architectural graph (A-Graph). When the robot starts its operation, it uses its onboard LIDAR and odometry to carry out an online SLAM relying on our situational graph (S-Graph), which includes both, a representation of the environment with multiple levels of abstractions, such as walls or rooms, and their relationships, as well as the robot poses with their associated keyframes. Our novel graph-to-graph matching method is used to relate the aforementioned S-Graph and A-Graph, which are aligned and merged, resulting in our novel informed Situational Graph (iS-Graph). Our iS-Graph not only provides graph-based global robot localization, but it extends the graph-based SLAM capabilities of the S-Graph by incorporating into it the prior knowledge of the environment existing in the architectural plan
△ Less
Submitted 16 May, 2023;
originally announced May 2023.
-
Graph-based Global Robot Localization Informing Situational Graphs with Architectural Graphs
Authors:
Muhammad Shaheer,
Jose Andres Millan-Romera,
Hriday Bavle,
Jose Luis Sanchez-Lopez,
Javier Civera,
Holger Voos
Abstract:
In this paper, we propose a solution for legged robot localization using architectural plans. Our specific contributions towards this goal are several. Firstly, we develop a method for converting the plan of a building into what we denote as an architectural graph (A-Graph). When the robot starts moving in an environment, we assume it has no knowledge about it, and it estimates an online situation…
▽ More
In this paper, we propose a solution for legged robot localization using architectural plans. Our specific contributions towards this goal are several. Firstly, we develop a method for converting the plan of a building into what we denote as an architectural graph (A-Graph). When the robot starts moving in an environment, we assume it has no knowledge about it, and it estimates an online situational graph representation (S-Graph) of its surroundings. We develop a novel graph-to-graph matching method, in order to relate the S-Graph estimated online from the robot sensors and the A-Graph extracted from the building plans. Note the challenge in this, as the S-Graph may show a partial view of the full A-Graph, their nodes are heterogeneous and their reference frames are different. After the matching, both graphs are aligned and merged, resulting in what we denote as an informed Situational Graph (iS-Graph), with which we achieve global robot localization and exploitation of prior knowledge from the building plans. Our experiments show that our pipeline shows a higher robustness and a significantly lower pose error than several LiDAR localization baselines.
△ Less
Submitted 3 March, 2023;
originally announced March 2023.
-
S-Graphs+: Real-time Localization and Mapping leveraging Hierarchical Representations
Authors:
Hriday Bavle,
Jose Luis Sanchez-Lopez,
Muhammad Shaheer,
Javier Civera,
Holger Voos
Abstract:
In this paper, we present an evolved version of Situational Graphs, which jointly models in a single optimizable factor graph (1) a pose graph, as a set of robot keyframes comprising associated measurements and robot poses, and (2) a 3D scene graph, as a high-level representation of the environment that encodes its different geometric elements with semantic attributes and the relational informatio…
▽ More
In this paper, we present an evolved version of Situational Graphs, which jointly models in a single optimizable factor graph (1) a pose graph, as a set of robot keyframes comprising associated measurements and robot poses, and (2) a 3D scene graph, as a high-level representation of the environment that encodes its different geometric elements with semantic attributes and the relational information between them.
Specifically, our S-Graphs+ is a novel four-layered factor graph that includes: (1) a keyframes layer with robot pose estimates, (2) a walls layer representing wall surfaces, (3) a rooms layer encompassing sets of wall planes, and (4) a floors layer gathering the rooms within a given floor level. The above graph is optimized in real-time to obtain a robust and accurate estimate of the robots pose and its map, simultaneously constructing and leveraging high-level information of the environment. To extract this high-level information, we present novel room and floor segmentation algorithms utilizing the mapped wall planes and free-space clusters.
We tested S-Graphs+ on multiple datasets, including simulated and real data of indoor environments from varying construction sites, and on a real public dataset of several indoor office areas. On average over our datasets, S-Graphs+ outperforms the accuracy of the second-best method by a margin of 10.67%, while extending the robot situational awareness by a richer scene model. Moreover, we make the software available as a docker file.
△ Less
Submitted 26 May, 2023; v1 submitted 22 December, 2022;
originally announced December 2022.
-
What's Wrong with the Absolute Trajectory Error?
Authors:
Seong Hun Lee,
Javier Civera
Abstract:
One of the limitations of the commonly used Absolute Trajectory Error (ATE) is that it is highly sensitive to outliers. As a result, in the presence of just a few outliers, it often fails to reflect the varying accuracy as the inlier trajectory error or the number of outliers varies. In this work, we propose an alternative error metric for evaluating the accuracy of the reconstructed camera trajec…
▽ More
One of the limitations of the commonly used Absolute Trajectory Error (ATE) is that it is highly sensitive to outliers. As a result, in the presence of just a few outliers, it often fails to reflect the varying accuracy as the inlier trajectory error or the number of outliers varies. In this work, we propose an alternative error metric for evaluating the accuracy of the reconstructed camera trajectory. Our metric, named Discernible Trajectory Error (DTE), is computed in five steps: (1) Shift the ground-truth and estimated trajectories such that both of their geometric medians are located at the origin. (2) Rotate the estimated trajectory such that it minimizes the sum of geodesic distances between the corresponding camera orientations. (3) Scale the estimated trajectory such that the median distance of the cameras to their geometric median is the same as that of the ground truth. (4) Compute, winsorize and normalize the distances between the corresponding cameras. (5) Obtain the DTE by taking the average of the mean and the root-mean-square (RMS) of the resulting distances. This metric is an attractive alternative to the ATE, in that it is capable of discerning the varying trajectory accuracy as the inlier trajectory error or the number of outliers varies. Using the similar idea, we also propose a novel rotation error metric, named Discernible Rotation Error (DRE), which has similar advantages to the DTE. Furthermore, we propose a simple yet effective method for calibrating the camera-to-marker rotation, which is needed for the computation of our metrics. Our methods are verified through extensive simulations.
△ Less
Submitted 9 September, 2024; v1 submitted 10 December, 2022;
originally announced December 2022.
-
SfM-TTR: Using Structure from Motion for Test-Time Refinement of Single-View Depth Networks
Authors:
Sergio Izquierdo,
Javier Civera
Abstract:
Estimating a dense depth map from a single view is geometrically ill-posed, and state-of-the-art methods rely on learning depth's relation with visual appearance using deep neural networks. On the other hand, Structure from Motion (SfM) leverages multi-view constraints to produce very accurate but sparse maps, as matching across images is typically limited by locally discriminative texture. In thi…
▽ More
Estimating a dense depth map from a single view is geometrically ill-posed, and state-of-the-art methods rely on learning depth's relation with visual appearance using deep neural networks. On the other hand, Structure from Motion (SfM) leverages multi-view constraints to produce very accurate but sparse maps, as matching across images is typically limited by locally discriminative texture. In this work, we combine the strengths of both approaches by proposing a novel test-time refinement (TTR) method, denoted as SfM-TTR, that boosts the performance of single-view depth networks at test time using SfM multi-view cues. Specifically, and differently from the state of the art, we use sparse SfM point clouds as test-time self-supervisory signal, fine-tuning the network encoder to learn a better representation of the test scene. Our results show how the addition of SfM-TTR to several state-of-the-art self-supervised and supervised networks improves significantly their performance, outperforming previous TTR baselines mainly based on photometric multi-view consistency. The code is available at https://github.com/serizba/SfM-TTR.
△ Less
Submitted 31 March, 2023; v1 submitted 24 November, 2022;
originally announced November 2022.
-
Advanced Situational Graphs for Robot Navigation in Structured Indoor Environments
Authors:
Hriday Bavle,
Jose Luis Sanchez-Lopez,
Muhammad Shaheer,
Javier Civera,
Holger Voos
Abstract:
Mobile robots extract information from its environment to understand their current situation to enable intelligent decision making and autonomous task execution. In our previous work, we introduced the concept of Situation Graphs (S-Graphs) which combines in a single optimizable graph, the robot keyframes and the representation of the environment with geometric, semantic and topological abstractio…
▽ More
Mobile robots extract information from its environment to understand their current situation to enable intelligent decision making and autonomous task execution. In our previous work, we introduced the concept of Situation Graphs (S-Graphs) which combines in a single optimizable graph, the robot keyframes and the representation of the environment with geometric, semantic and topological abstractions. Although S-Graphs were built and optimized in real-time and demonstrated state-of-the-art results, they are limited to specific structured environments with specific hand-tuned dimensions of rooms and corridors.
In this work, we present an advanced version of the Situational Graphs (S-Graphs+), consisting of the five layered optimizable graph that includes (1) metric layer along with the graph of free-space clusters (2) keyframe layer where the robot poses are registered (3) metric-semantic layer consisting of the extracted planar walls (4) novel rooms layer constraining the extracted planar walls (5) novel floors layer encompassing the rooms within a given floor level. S-Graphs+ demonstrates improved performance over S-Graphs efficiently extracting the room information while simultaneously improving the pose estimate of the robot, thus extending the robots situational awareness in the form of a five layered environmental model.
△ Less
Submitted 16 November, 2022;
originally announced November 2022.
-
RODIAN: Robustified Median
Authors:
Seong Hun Lee,
Javier Civera
Abstract:
We propose a robust method for averaging numbers contaminated by a large proportion of outliers. Our method, dubbed RODIAN, is inspired by the key idea of MINPRAN [1]: We assume that the outliers are uniformly distributed within the range of the data and we search for the region that is least likely to contain outliers only. The median of the data within this region is then taken as RODIAN. Our ap…
▽ More
We propose a robust method for averaging numbers contaminated by a large proportion of outliers. Our method, dubbed RODIAN, is inspired by the key idea of MINPRAN [1]: We assume that the outliers are uniformly distributed within the range of the data and we search for the region that is least likely to contain outliers only. The median of the data within this region is then taken as RODIAN. Our approach can accurately estimate the true mean of data with more than 50% outliers and runs in time $O(n\log n)$. Unlike other robust techniques, it is completely deterministic and does not rely on a known inlier error bound. Our extensive evaluation shows that RODIAN is much more robust than the median and the least-median-of-squares. This result also holds in the case of non-uniform outlier distributions.
△ Less
Submitted 18 November, 2022; v1 submitted 3 June, 2022;
originally announced June 2022.
-
EndoMapper dataset of complete calibrated endoscopy procedures
Authors:
Pablo Azagra,
Carlos Sostres,
Ángel Ferrandez,
Luis Riazuelo,
Clara Tomasini,
Oscar León Barbed,
Javier Morlana,
David Recasens,
Victor M. Batlle,
Juan J. Gómez-Rodríguez,
Richard Elvira,
Julia López,
Cristina Oriol,
Javier Civera,
Juan D. Tardós,
Ana Cristina Murillo,
Angel Lanas,
José M. M. Montiel
Abstract:
Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introdu…
▽ More
Computer-assisted systems are becoming broadly used in medicine. In endoscopy, most research focuses on the automatic detection of polyps or other pathologies, but localization and navigation of the endoscope are completely performed manually by physicians. To broaden this research and bring spatial Artificial Intelligence to endoscopies, data from complete procedures is needed. This paper introduces the Endomapper dataset, the first collection of complete endoscopy sequences acquired during regular medical practice, making secondary use of medical data. Its main purpose is to facilitate the development and evaluation of Visual Simultaneous Localization and Mapping (VSLAM) methods in real endoscopy data. The dataset contains more than 24 hours of video. It is the first endoscopic dataset that includes endoscope calibration as well as the original calibration videos. Meta-data and annotations associated with the dataset vary from the anatomical landmarks, procedure labeling, segmentations, reconstructions, simulated sequences with ground truth and same patient procedures. The software used in this paper is publicly available.
△ Less
Submitted 10 October, 2023; v1 submitted 29 April, 2022;
originally announced April 2022.
-
From Simultaneous to Streaming Machine Translation by Leveraging Streaming History
Authors:
Javier Iranzo-Sánchez,
Jorge Civera,
Alfons Juan
Abstract:
Simultaneous Machine Translation is the task of incrementally translating an input sentence before it is fully available. Currently, simultaneous translation is carried out by translating each sentence independently of the previously translated text. More generally, Streaming MT can be understood as an extension of Simultaneous MT to the incremental translation of a continuous input text stream. I…
▽ More
Simultaneous Machine Translation is the task of incrementally translating an input sentence before it is fully available. Currently, simultaneous translation is carried out by translating each sentence independently of the previously translated text. More generally, Streaming MT can be understood as an extension of Simultaneous MT to the incremental translation of a continuous input text stream. In this work, a state-of-the-art simultaneous sentence-level MT system is extended to the streaming setup by leveraging the streaming history. Extensive empirical results are reported on IWSLT Translation Tasks, showing that leveraging the streaming history leads to significant quality gains. In particular, the proposed system proves to compare favorably to the best performing systems.
△ Less
Submitted 31 March, 2022; v1 submitted 4 March, 2022;
originally announced March 2022.
-
Situational Graphs for Robot Navigation in Structured Indoor Environments
Authors:
Hriday Bavle,
Jose Luis Sanchez-Lopez,
Muhammad Shaheer,
Javier Civera,
Holger Voos
Abstract:
Mobile robots should be aware of their situation, comprising the deep understanding of their surrounding environment along with the estimation of its own state, to successfully make intelligent decisions and execute tasks autonomously in real environments. 3D scene graphs are an emerging field of research that propose to represent the environment in a joint model comprising geometric, semantic and…
▽ More
Mobile robots should be aware of their situation, comprising the deep understanding of their surrounding environment along with the estimation of its own state, to successfully make intelligent decisions and execute tasks autonomously in real environments. 3D scene graphs are an emerging field of research that propose to represent the environment in a joint model comprising geometric, semantic and relational/topological dimensions. Although 3D scene graphs have already been combined with SLAM techniques to provide robots with situational understanding, further research is still required to effectively deploy them on-board mobile robots.
To this end, we present in this paper a novel, real-time, online built Situational Graph (S-Graph), which combines in a single optimizable graph, the representation of the environment with the aforementioned three dimensions, together with the robot pose. Our method utilizes odometry readings and planar surfaces extracted from 3D LiDAR scans, to construct and optimize in real-time a three layered S-Graph that includes (1) a robot tracking layer where the robot poses are registered, (2) a metric-semantic layer with features such as planar walls and (3) our novel topological layer constraining the planar walls using higher-level features such as corridors and rooms. Our proposal does not only demonstrate state-of-the-art results for pose estimation of the robot, but also contributes with a metric-semantic-topological model of the environment
△ Less
Submitted 1 July, 2022; v1 submitted 24 February, 2022;
originally announced February 2022.
-
Danish Airs and Grounds: A Dataset for Aerial-to-Street-Level Place Recognition and Localization
Authors:
Andrea Vallone,
Frederik Warburg,
Hans Hansen,
Søren Hauberg,
Javier Civera
Abstract:
Place recognition and visual localization are particularly challenging in wide baseline configurations. In this paper, we contribute with the \emph{Danish Airs and Grounds} (DAG) dataset, a large collection of street-level and aerial images targeting such cases. Its main challenge lies in the extreme viewing-angle difference between query and reference images with consequent changes in illuminatio…
▽ More
Place recognition and visual localization are particularly challenging in wide baseline configurations. In this paper, we contribute with the \emph{Danish Airs and Grounds} (DAG) dataset, a large collection of street-level and aerial images targeting such cases. Its main challenge lies in the extreme viewing-angle difference between query and reference images with consequent changes in illumination and perspective. The dataset is larger and more diverse than current publicly available data, including more than 50 km of road in urban, suburban and rural areas. All images are associated with accurate 6-DoF metadata that allows the benchmarking of visual localization methods.
We also propose a map-to-image re-localization pipeline, that first estimates a dense 3D reconstruction from the aerial images and then matches query street-level images to street-level renderings of the 3D model. The dataset can be downloaded at: https://frederikwarburg.github.io/DAG
△ Less
Submitted 3 February, 2022;
originally announced February 2022.
-
A Model for Multi-View Residual Covariances based on Perspective Deformation
Authors:
Alejandro Fontan,
Laura Oliva,
Javier Civera,
Rudolph Triebel
Abstract:
In this work, we derive a model for the covariance of the visual residuals in multi-view SfM, odometry and SLAM setups. The core of our approach is the formulation of the residual covariances as a combination of geometric and photometric noise sources. And our key novel contribution is the derivation of a term modelling how local 2D patches suffer from perspective deformation when imaging 3D surfa…
▽ More
In this work, we derive a model for the covariance of the visual residuals in multi-view SfM, odometry and SLAM setups. The core of our approach is the formulation of the residual covariances as a combination of geometric and photometric noise sources. And our key novel contribution is the derivation of a term modelling how local 2D patches suffer from perspective deformation when imaging 3D surfaces around a point. Together, these add up to an efficient and general formulation which not only improves the accuracy of both feature-based and direct methods, but can also be used to estimate more accurate measures of the state entropy and hence better founded point visibility thresholds. We validate our model with synthetic and real data and integrate it into photometric and feature-based Bundle Adjustment, improving their accuracy with a negligible overhead.
△ Less
Submitted 1 February, 2022;
originally announced February 2022.
-
Jacobian Computation for Cumulative B-Splines on SE(3) and Application to Continuous-Time Object Tracking
Authors:
Javier Tirado,
Javier Civera
Abstract:
In this paper we propose a method that estimates the $SE(3)$ continuous trajectories (orientation and translation) of the dynamic rigid objects present in a scene, from multiple RGB-D views. Specifically, we fit the object trajectories to cumulative B-Splines curves, which allow us to interpolate, at any intermediate time stamp, not only their poses but also their linear and angular velocities and…
▽ More
In this paper we propose a method that estimates the $SE(3)$ continuous trajectories (orientation and translation) of the dynamic rigid objects present in a scene, from multiple RGB-D views. Specifically, we fit the object trajectories to cumulative B-Splines curves, which allow us to interpolate, at any intermediate time stamp, not only their poses but also their linear and angular velocities and accelerations. Additionally, we derive in this work the analytical $SE(3)$ Jacobians needed by the optimization, being applicable to any other approach that uses this type of curves. To the best of our knowledge this is the first work that proposes 6-DoF continuous-time object tracking, which we endorse with significant computational cost reduction thanks to our analytical derivations. We evaluate our proposal in synthetic data and in a public benchmark, showing competitive results in localization and significant improvements in velocity estimation in comparison to discrete-time approaches.
△ Less
Submitted 24 May, 2022; v1 submitted 25 January, 2022;
originally announced January 2022.
-
On the Uncertain Single-View Depths in Colonoscopies
Authors:
Javier Rodríguez-Puigvert,
David Recasens,
Javier Civera,
Rubén Martínez-Cantín
Abstract:
Estimating depth information from endoscopic images is a prerequisite for a wide set of AI-assisted technologies, such as accurate localization and measurement of tumors, or identification of non-inspected areas. As the domain specificity of colonoscopies -- deformable low-texture environments with fluids, poor lighting conditions and abrupt sensor motions -- pose challenges to multi-view 3D recon…
▽ More
Estimating depth information from endoscopic images is a prerequisite for a wide set of AI-assisted technologies, such as accurate localization and measurement of tumors, or identification of non-inspected areas. As the domain specificity of colonoscopies -- deformable low-texture environments with fluids, poor lighting conditions and abrupt sensor motions -- pose challenges to multi-view 3D reconstructions, single-view depth learning stands out as a promising line of research. Depth learning can be extended in a Bayesian setting, which enables continual learning, improves decision making and can be used to compute confidence intervals or quantify uncertainty for in-body measurements. In this paper, we explore for the first time Bayesian deep networks for single-view depth estimation in colonoscopies. Our specific contribution is two-fold: 1) an exhaustive analysis of scalable Bayesian networks for depth learning in different datasets, highlighting challenges and conclusions regarding synthetic-to-real domain changes and supervised vs. self-supervised methods; and 2) a novel teacher-student approach to deep depth learning that takes into account the teacher uncertainty.
△ Less
Submitted 20 July, 2022; v1 submitted 16 December, 2021;
originally announced December 2021.
-
HARA: A Hierarchical Approach for Robust Rotation Averaging
Authors:
Seong Hun Lee,
Javier Civera
Abstract:
We propose a novel hierarchical approach for multiple rotation averaging, dubbed HARA. Our method incrementally initializes the rotation graph based on a hierarchy of triplet support. The key idea is to build a spanning tree by prioritizing the edges with many strong triplet supports and gradually adding those with weaker and fewer supports. This reduces the risk of adding outliers in the spanning…
▽ More
We propose a novel hierarchical approach for multiple rotation averaging, dubbed HARA. Our method incrementally initializes the rotation graph based on a hierarchy of triplet support. The key idea is to build a spanning tree by prioritizing the edges with many strong triplet supports and gradually adding those with weaker and fewer supports. This reduces the risk of adding outliers in the spanning tree. As a result, we obtain a robust initial solution that enables us to filter outliers prior to nonlinear optimization. With minimal modification, our approach can also integrate the knowledge of the number of valid 2D-2D correspondences. We perform extensive evaluations on both synthetic and real datasets, demonstrating state-of-the-art results.
△ Less
Submitted 29 March, 2022; v1 submitted 16 November, 2021;
originally announced November 2021.
-
Bayesian Deep Neural Networks for Supervised Learning of Single-View Depth
Authors:
Javier Rodríguez-Puigvert,
Rubén Martínez-Cantín,
Javier Civera
Abstract:
Uncertainty quantification is essential for robotic perception, as overconfident or point estimators can lead to collisions and damages to the environment and the robot. In this paper, we evaluate scalable approaches to uncertainty quantification in single-view supervised depth learning, specifically MC dropout and deep ensembles. For MC dropout, in particular, we explore the effect of the dropout…
▽ More
Uncertainty quantification is essential for robotic perception, as overconfident or point estimators can lead to collisions and damages to the environment and the robot. In this paper, we evaluate scalable approaches to uncertainty quantification in single-view supervised depth learning, specifically MC dropout and deep ensembles. For MC dropout, in particular, we explore the effect of the dropout at different levels in the architecture. We show that adding dropout in all layers of the encoder brings better results than other variations found in the literature. This configuration performs similarly to deep ensembles with a much lower memory footprint, which is relevant forapplications. Finally, we explore the use of depth uncertainty for pseudo-RGBD ICP and demonstrate its potential to estimate accurate two-view relative motion with the real scale.
△ Less
Submitted 15 December, 2021; v1 submitted 29 April, 2021;
originally announced April 2021.
-
Stream-level Latency Evaluation for Simultaneous Machine Translation
Authors:
Javier Iranzo-Sánchez,
Jorge Civera,
Alfons Juan
Abstract:
Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time, and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the s…
▽ More
Simultaneous machine translation has recently gained traction thanks to significant quality improvements and the advent of streaming applications. Simultaneous translation systems need to find a trade-off between translation quality and response time, and with this purpose multiple latency measures have been proposed. However, latency evaluations for simultaneous translation are estimated at the sentence level, not taking into account the sequential nature of a streaming scenario. Indeed, these sentence-level latency measures are not well suited for continuous stream translation resulting in figures that are not coherent with the simultaneous translation policy of the system being assessed. This work proposes a stream-level adaptation of the current latency measures based on a re-segmentation approach applied to the output translation, that is successfully evaluated on streaming conditions for a reference IWSLT task.
△ Less
Submitted 8 September, 2021; v1 submitted 18 April, 2021;
originally announced April 2021.
-
Endo-Depth-and-Motion: Reconstruction and Tracking in Endoscopic Videos using Depth Networks and Photometric Constraints
Authors:
David Recasens,
José Lamarca,
José M. Fácil,
J. M. M. Montiel,
Javier Civera
Abstract:
Estimating a scene reconstruction and the camera motion from in-body videos is challenging due to several factors, e.g. the deformation of in-body cavities or the lack of texture. In this paper we present Endo-Depth-and-Motion, a pipeline that estimates the 6-degrees-of-freedom camera pose and dense 3D scene models from monocular endoscopic videos. Our approach leverages recent advances in self-su…
▽ More
Estimating a scene reconstruction and the camera motion from in-body videos is challenging due to several factors, e.g. the deformation of in-body cavities or the lack of texture. In this paper we present Endo-Depth-and-Motion, a pipeline that estimates the 6-degrees-of-freedom camera pose and dense 3D scene models from monocular endoscopic videos. Our approach leverages recent advances in self-supervised depth networks to generate pseudo-RGBD frames, then tracks the camera pose using photometric residuals and fuses the registered depth maps in a volumetric representation. We present an extensive experimental evaluation in the public dataset Hamlyn, showing high-quality results and comparisons against relevant baselines. We also release all models and code for future comparisons.
△ Less
Submitted 3 July, 2021; v1 submitted 30 March, 2021;
originally announced March 2021.
-
Bayesian Triplet Loss: Uncertainty Quantification in Image Retrieval
Authors:
Frederik Warburg,
Martin Jørgensen,
Javier Civera,
Søren Hauberg
Abstract:
Uncertainty quantification in image retrieval is crucial for downstream decisions, yet it remains a challenging and largely unexplored problem. Current methods for estimating uncertainties are poorly calibrated, computationally expensive, or based on heuristics. We present a new method that views image embeddings as stochastic features rather than deterministic features. Our two main contributions…
▽ More
Uncertainty quantification in image retrieval is crucial for downstream decisions, yet it remains a challenging and largely unexplored problem. Current methods for estimating uncertainties are poorly calibrated, computationally expensive, or based on heuristics. We present a new method that views image embeddings as stochastic features rather than deterministic features. Our two main contributions are (1) a likelihood that matches the triplet constraint and that evaluates the probability of an anchor being closer to a positive than a negative; and (2) a prior over the feature space that justifies the conventional l2 normalization. To ensure computational efficiency, we derive a variational approximation of the posterior, called the Bayesian triplet loss, that produces state-of-the-art uncertainty estimates and matches the predictive performance of current state-of-the-art methods.
△ Less
Submitted 17 September, 2021; v1 submitted 25 November, 2020;
originally announced November 2020.
-
Rotation-Only Bundle Adjustment
Authors:
Seong Hun Lee,
Javier Civera
Abstract:
We propose a novel method for estimating the global rotations of the cameras independently of their positions and the scene structure. When two calibrated cameras observe five or more of the same points, their relative rotation can be recovered independently of the translation. We extend this idea to multiple views, thereby decoupling the rotation estimation from the translation and structure esti…
▽ More
We propose a novel method for estimating the global rotations of the cameras independently of their positions and the scene structure. When two calibrated cameras observe five or more of the same points, their relative rotation can be recovered independently of the translation. We extend this idea to multiple views, thereby decoupling the rotation estimation from the translation and structure estimation. Our approach provides several benefits such as complete immunity to inaccurate translations and structure, and the accuracy improvement when used with rotation averaging. We perform extensive evaluations on both synthetic and real datasets, demonstrating consistent and significant gains in accuracy when used with the state-of-the-art rotation averaging method.
△ Less
Submitted 27 March, 2021; v1 submitted 23 November, 2020;
originally announced November 2020.
-
DOT: Dynamic Object Tracking for Visual SLAM
Authors:
Irene Ballester,
Alejandro Fontan,
Javier Civera,
Klaus H. Strobl,
Rudolph Triebel
Abstract:
In this paper we present DOT (Dynamic Object Tracking), a front-end that added to existing SLAM systems can significantly improve their robustness and accuracy in highly dynamic environments. DOT combines instance segmentation and multi-view geometry to generate masks for dynamic objects in order to allow SLAM systems based on rigid scene models to avoid such image areas in their optimizations.…
▽ More
In this paper we present DOT (Dynamic Object Tracking), a front-end that added to existing SLAM systems can significantly improve their robustness and accuracy in highly dynamic environments. DOT combines instance segmentation and multi-view geometry to generate masks for dynamic objects in order to allow SLAM systems based on rigid scene models to avoid such image areas in their optimizations.
To determine which objects are actually moving, DOT segments first instances of potentially dynamic objects and then, with the estimated camera motion, tracks such objects by minimizing the photometric reprojection error. This short-term tracking improves the accuracy of the segmentation with respect to other approaches. In the end, only actually dynamic masks are generated. We have evaluated DOT with ORB-SLAM 2 in three public datasets. Our results show that our approach improves significantly the accuracy and robustness of ORB-SLAM 2, especially in highly dynamic scenes.
△ Less
Submitted 30 September, 2020;
originally announced October 2020.
-
The ABC130 barrel module prototyping programme for the ATLAS strip tracker
Authors:
Luise Poley,
Craig Sawyer,
Sagar Addepalli,
Anthony Affolder,
Bruno Allongue,
Phil Allport,
Eric Anderssen,
Francis Anghinolfi,
Jean-François Arguin,
Jan-Hendrik Arling,
Olivier Arnaez,
Nedaa Alexandra Asbah,
Joe Ashby,
Eleni Myrto Asimakopoulou,
Naim Bora Atlay,
Ludwig Bartsch,
Matthew J. Basso,
James Beacham,
Scott L. Beaupré,
Graham Beck,
Carl Beichert,
Laura Bergsten,
Jose Bernabeu,
Prajita Bhattarai,
Ingo Bloch
, et al. (224 additional authors not shown)
Abstract:
For the Phase-II Upgrade of the ATLAS Detector, its Inner Detector, consisting of silicon pixel, silicon strip and transition radiation sub-detectors, will be replaced with an all new 100 % silicon tracker, composed of a pixel tracker at inner radii and a strip tracker at outer radii. The future ATLAS strip tracker will include 11,000 silicon sensor modules in the central region (barrel) and 7,000…
▽ More
For the Phase-II Upgrade of the ATLAS Detector, its Inner Detector, consisting of silicon pixel, silicon strip and transition radiation sub-detectors, will be replaced with an all new 100 % silicon tracker, composed of a pixel tracker at inner radii and a strip tracker at outer radii. The future ATLAS strip tracker will include 11,000 silicon sensor modules in the central region (barrel) and 7,000 modules in the forward region (end-caps), which are foreseen to be constructed over a period of 3.5 years. The construction of each module consists of a series of assembly and quality control steps, which were engineered to be identical for all production sites. In order to develop the tooling and procedures for assembly and testing of these modules, two series of major prototyping programs were conducted: an early program using readout chips designed using a 250 nm fabrication process (ABCN-25) and a subsequent program using a follow-up chip set made using 130 nm processing (ABC130 and HCC130 chips). This second generation of readout chips was used for an extensive prototyping program that produced around 100 barrel-type modules and contributed significantly to the development of the final module layout. This paper gives an overview of the components used in ABC130 barrel modules, their assembly procedure and findings resulting from their tests.
△ Less
Submitted 7 September, 2020;
originally announced September 2020.