-
A Conic Transformation Approach for Solving the Perspective-Three-Point Problem
Authors:
Haidong Wu,
Snehal Bhayani,
Janne Heikkilä
Abstract:
We propose a conic transformation method to solve the Perspective-Three-Point (P3P) problem. In contrast to the current state-of-the-art solvers, which formulate the P3P problem by intersecting two conics and constructing a degenerate conic to find the intersection, our approach builds upon a new formulation based on a transformation that maps the two conics to a new coordinate system, where one o…
▽ More
We propose a conic transformation method to solve the Perspective-Three-Point (P3P) problem. In contrast to the current state-of-the-art solvers, which formulate the P3P problem by intersecting two conics and constructing a degenerate conic to find the intersection, our approach builds upon a new formulation based on a transformation that maps the two conics to a new coordinate system, where one of the conics becomes a standard parabola in a canonical form. This enables expressing one variable in terms of the other variable, and as a consequence, substantially simplifies the problem of finding the conic intersection. Moreover, the polynomial coefficients are fast to compute, and we only need to determine the real-valued intersection points, which avoids the requirement of using computationally expensive complex arithmetic. While the current state-of-the-art methods reduce the conic intersection problem to solving a univariate cubic equation, our approach, despite resulting in a quartic equation, is still faster thanks to this new simplified formulation. Extensive evaluations demonstrate that our method achieves higher speed while maintaining robustness and stability comparable to state-of-the-art methods.
△ Less
Submitted 2 April, 2025;
originally announced April 2025.
-
A Causal Adjustment Module for Debiasing Scene Graph Generation
Authors:
Li Liu,
Shuzhou Sun,
Shuaifeng Zhi,
Fan Shi,
Zhen Liu,
Janne Heikkilä,
Yongxiang Liu
Abstract:
While recent debiasing methods for Scene Graph Generation (SGG) have shown impressive performance, these efforts often attribute model bias solely to the long-tail distribution of relationships, overlooking the more profound causes stemming from skewed object and object pair distributions. In this paper, we employ causal inference techniques to model the causality among these observed skewed distr…
▽ More
While recent debiasing methods for Scene Graph Generation (SGG) have shown impressive performance, these efforts often attribute model bias solely to the long-tail distribution of relationships, overlooking the more profound causes stemming from skewed object and object pair distributions. In this paper, we employ causal inference techniques to model the causality among these observed skewed distributions. Our insight lies in the ability of causal inference to capture the unobservable causal effects between complex distributions, which is crucial for tracing the roots of model bias. Specifically, we introduce the Mediator-based Causal Chain Model (MCCM), which, in addition to modeling causality among objects, object pairs, and relationships, incorporates mediator variables, i.e., cooccurrence distribution, for complementing the causality. Following this, we propose the Causal Adjustment Module (CAModule) to estimate the modeled causal structure, using variables from MCCM as inputs to produce a set of adjustment factors aimed at correcting biased model predictions. Moreover, our method enables the composition of zero-shot relationships, thereby enhancing the model's ability to recognize such relationships. Experiments conducted across various SGG backbones and popular benchmarks demonstrate that CAModule achieves state-of-the-art mean recall rates, with significant improvements also observed on the challenging zero-shot recall rate metric.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
International AI Safety Report
Authors:
Yoshua Bengio,
Sören Mindermann,
Daniel Privitera,
Tamay Besiroglu,
Rishi Bommasani,
Stephen Casper,
Yejin Choi,
Philip Fox,
Ben Garfinkel,
Danielle Goldfarb,
Hoda Heidari,
Anson Ho,
Sayash Kapoor,
Leila Khalatbari,
Shayne Longpre,
Sam Manning,
Vasilios Mavroudis,
Mantas Mazeika,
Julian Michael,
Jessica Newman,
Kwan Yee Ng,
Chinasa T. Okolo,
Deborah Raji,
Girish Sastry,
Elizabeth Seger
, et al. (71 additional authors not shown)
Abstract:
The first International AI Safety Report comprehensively synthesizes the current evidence on the capabilities, risks, and safety of advanced AI systems. The report was mandated by the nations attending the AI Safety Summit in Bletchley, UK. Thirty nations, the UN, the OECD, and the EU each nominated a representative to the report's Expert Advisory Panel. A total of 100 AI experts contributed, repr…
▽ More
The first International AI Safety Report comprehensively synthesizes the current evidence on the capabilities, risks, and safety of advanced AI systems. The report was mandated by the nations attending the AI Safety Summit in Bletchley, UK. Thirty nations, the UN, the OECD, and the EU each nominated a representative to the report's Expert Advisory Panel. A total of 100 AI experts contributed, representing diverse perspectives and disciplines. Led by the report's Chair, these independent experts collectively had full discretion over the report's content.
△ Less
Submitted 29 January, 2025;
originally announced January 2025.
-
Uncovering Bias in Foundation Models: Impact, Testing, Harm, and Mitigation
Authors:
Shuzhou Sun,
Li Liu,
Yongxiang Liu,
Zhen Liu,
Shuanghui Zhang,
Janne Heikkilä,
Xiang Li
Abstract:
Bias in Foundation Models (FMs) - trained on vast datasets spanning societal and historical knowledge - poses significant challenges for fairness and equity across fields such as healthcare, education, and finance. These biases, rooted in the overrepresentation of stereotypes and societal inequalities in training data, exacerbate real-world discrimination, reinforce harmful stereotypes, and erode…
▽ More
Bias in Foundation Models (FMs) - trained on vast datasets spanning societal and historical knowledge - poses significant challenges for fairness and equity across fields such as healthcare, education, and finance. These biases, rooted in the overrepresentation of stereotypes and societal inequalities in training data, exacerbate real-world discrimination, reinforce harmful stereotypes, and erode trust in AI systems. To address this, we introduce Trident Probe Testing (TriProTesting), a systematic testing method that detects explicit and implicit biases using semantically designed probes. Here we show that FMs, including CLIP, ALIGN, BridgeTower, and OWLv2, demonstrate pervasive biases across single and mixed social attributes (gender, race, age, and occupation). Notably, we uncover mixed biases when social attributes are combined, such as gender x race, gender x age, and gender x occupation, revealing deeper layers of discrimination. We further propose Adaptive Logit Adjustment (AdaLogAdjustment), a post-processing technique that dynamically redistributes probability power to mitigate these biases effectively, achieving significant improvements in fairness without retraining models. These findings highlight the urgent need for ethical AI practices and interdisciplinary solutions to address biases not only at the model level but also in societal structures. Our work provides a scalable and interpretable solution that advances fairness in AI systems while offering practical insights for future research on fair AI technologies.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
An End-to-End Depth-Based Pipeline for Selfie Image Rectification
Authors:
Ahmed Alhawwary,
Phong Nguyen-Ha,
Janne Mustaniemi,
Janne Heikkilä
Abstract:
Portraits or selfie images taken from a close distance typically suffer from perspective distortion. In this paper, we propose an end-to-end deep learning-based rectification pipeline to mitigate the effects of perspective distortion. We learn to predict the facial depth by training a deep CNN. The estimated depth is utilized to adjust the camera-to-subject distance by moving the camera farther, i…
▽ More
Portraits or selfie images taken from a close distance typically suffer from perspective distortion. In this paper, we propose an end-to-end deep learning-based rectification pipeline to mitigate the effects of perspective distortion. We learn to predict the facial depth by training a deep CNN. The estimated depth is utilized to adjust the camera-to-subject distance by moving the camera farther, increasing the camera focal length, and reprojecting the 3D image features to the new perspective. The reprojected features are then fed to an inpainting module to fill in the missing pixels. We leverage a differentiable renderer to enable end-to-end training of our depth estimation and feature extraction nets to improve the rectified outputs. To boost the results of the inpainting module, we incorporate an auxiliary module to predict the horizontal movement of the camera which decreases the area that requires hallucination of challenging face parts such as ears. Unlike previous works, we process the full-frame input image at once without cropping the subject's face and processing it separately from the rest of the body, eliminating the need for complex post-processing steps to attach the face back to the subject's body. To train our network, we utilize the popular game engine Unreal Engine to generate a large synthetic face dataset containing various subjects, head poses, expressions, eyewear, clothes, and lighting. Quantitative and qualitative results show that our rectification pipeline outperforms previous methods, and produces comparable results with a time-consuming 3D GAN-based method while being more than 260 times faster.
△ Less
Submitted 26 December, 2024;
originally announced December 2024.
-
GS-Pose: Generalizable Segmentation-based 6D Object Pose Estimation with 3D Gaussian Splatting
Authors:
Dingding Cai,
Janne Heikkilä,
Esa Rahtu
Abstract:
This paper introduces GS-Pose, a unified framework for localizing and estimating the 6D pose of novel objects. GS-Pose begins with a set of posed RGB images of a previously unseen object and builds three distinct representations stored in a database. At inference, GS-Pose operates sequentially by locating the object in the input image, estimating its initial 6D pose using a retrieval approach, and…
▽ More
This paper introduces GS-Pose, a unified framework for localizing and estimating the 6D pose of novel objects. GS-Pose begins with a set of posed RGB images of a previously unseen object and builds three distinct representations stored in a database. At inference, GS-Pose operates sequentially by locating the object in the input image, estimating its initial 6D pose using a retrieval approach, and refining the pose with a render-and-compare method. The key insight is the application of the appropriate object representation at each stage of the process. In particular, for the refinement step, we leverage 3D Gaussian splatting, a novel differentiable rendering technique that offers high rendering speed and relatively low optimization time. Off-the-shelf toolchains and commodity hardware, such as mobile phones, can be used to capture new objects to be added to the database. Extensive evaluations on the LINEMOD and OnePose-LowTexture datasets demonstrate excellent performance, establishing the new state-of-the-art. Project page: https://dingdingcai.github.io/gs-pose.
△ Less
Submitted 14 August, 2024; v1 submitted 15 March, 2024;
originally announced March 2024.
-
A Novel Application of Polynomial Solvers in mmWave Analog Radio Beamforming
Authors:
Snehal Bhayani,
Praneeth Susarla,
S. S. Krishna Chaitanya Bulusu,
Olli Silven,
Markku Juntti,
Janne Heikkila
Abstract:
Beamforming is a signal processing technique where an array of antenna elements can be steered to transmit and receive radio signals in a specific direction. The usage of millimeter wave (mmWave) frequencies and multiple input multiple output (MIMO) beamforming are considered as the key innovations of 5th Generation (5G) and beyond communication systems. The technique initially performs a beam ali…
▽ More
Beamforming is a signal processing technique where an array of antenna elements can be steered to transmit and receive radio signals in a specific direction. The usage of millimeter wave (mmWave) frequencies and multiple input multiple output (MIMO) beamforming are considered as the key innovations of 5th Generation (5G) and beyond communication systems. The technique initially performs a beam alignment procedure, followed by data transfer in the aligned directions between the transmitter and the receiver. Traditionally, beam alignment involves periodical and exhaustive beam sweeping at both transmitter and the receiver, which is a slow process causing extra communication overhead with MIMO and massive MIMO radio units. In applications such as beam tracking, angular velocity, beam steering etc., the beam alignment procedure is optimized by estimating the beam directions using first order polynomial approximations. Recent learning-based SOTA strategies for fast mmWave beam alignment also require exploration over exhaustive beam pairs during the training procedure, causing overhead to learning strategies for higher antenna configurations. In this work, we first optimize the beam alignment cost functions e.g. the data rate, to reduce the beam sweeping overhead by applying polynomial approximations of its partial derivatives which can then be solved as a system of polynomial equations using well-known tools from algebraic geometry. At this point, a question arises: 'what is a good polynomial approximation?' In this work, we attempt to obtain a 'good polynomial approximation'. Preliminary experiments indicate that our estimated polynomial approximations attain a so-called sweet-spot in terms of the solver speed and accuracy, when evaluated on test beamforming problems.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
Unbiased Scene Graph Generation via Two-stage Causal Modeling
Authors:
Shuzhou Sun,
Shuaifeng Zhi,
Qing Liao,
Janne Heikkilä,
Li Liu
Abstract:
Despite the impressive performance of recent unbiased Scene Graph Generation (SGG) methods, the current debiasing literature mainly focuses on the long-tailed distribution problem, whereas it overlooks another source of bias, i.e., semantic confusion, which makes the SGG model prone to yield false predictions for similar relationships. In this paper, we explore a debiasing procedure for the SGG ta…
▽ More
Despite the impressive performance of recent unbiased Scene Graph Generation (SGG) methods, the current debiasing literature mainly focuses on the long-tailed distribution problem, whereas it overlooks another source of bias, i.e., semantic confusion, which makes the SGG model prone to yield false predictions for similar relationships. In this paper, we explore a debiasing procedure for the SGG task leveraging causal inference. Our central insight is that the Sparse Mechanism Shift (SMS) in causality allows independent intervention on multiple biases, thereby potentially preserving head category performance while pursuing the prediction of high-informative tail relationships. However, the noisy datasets lead to unobserved confounders for the SGG task, and thus the constructed causal models are always causal-insufficient to benefit from SMS. To remedy this, we propose Two-stage Causal Modeling (TsCM) for the SGG task, which takes the long-tailed distribution and semantic confusion as confounders to the Structural Causal Model (SCM) and then decouples the causal intervention into two stages. The first stage is causal representation learning, where we use a novel Population Loss (P-Loss) to intervene in the semantic confusion confounder. The second stage introduces the Adaptive Logit Adjustment (AL-Adjustment) to eliminate the long-tailed distribution confounder to complete causal calibration learning. These two stages are model agnostic and thus can be used in any SGG model that seeks unbiased predictions. Comprehensive experiments conducted on the popular SGG backbones and benchmarks show that our TsCM can achieve state-of-the-art performance in terms of mean recall rate. Furthermore, TsCM can maintain a higher recall rate than other debiasing methods, which indicates that our method can achieve a better tradeoff between head and tail relationships.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation
Authors:
Mayu Otani,
Riku Togashi,
Yu Sawai,
Ryosuke Ishigami,
Yuta Nakashima,
Esa Rahtu,
Janne Heikkilä,
Shin'ichi Satoh
Abstract:
Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standard…
▽ More
Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations.
△ Less
Submitted 4 April, 2023;
originally announced April 2023.
-
MSDA: Monocular Self-supervised Domain Adaptation for 6D Object Pose Estimation
Authors:
Dingding Cai,
Janne Heikkilä,
Esa Rahtu
Abstract:
Acquiring labeled 6D poses from real images is an expensive and time-consuming task. Though massive amounts of synthetic RGB images are easy to obtain, the models trained on them suffer from noticeable performance degradation due to the synthetic-to-real domain gap. To mitigate this degradation, we propose a practical self-supervised domain adaptation approach that takes advantage of real RGB(-D)…
▽ More
Acquiring labeled 6D poses from real images is an expensive and time-consuming task. Though massive amounts of synthetic RGB images are easy to obtain, the models trained on them suffer from noticeable performance degradation due to the synthetic-to-real domain gap. To mitigate this degradation, we propose a practical self-supervised domain adaptation approach that takes advantage of real RGB(-D) data without needing real pose labels. We first pre-train the model with synthetic RGB images and then utilize real RGB(-D) images to fine-tune the pre-trained model. The fine-tuning process is self-supervised by the RGB-based pose-aware consistency and the depth-guided object distance pseudo-label, which does not require the time-consuming online differentiable rendering. We build our domain adaptation method based on the recent pose estimator SC6D and evaluate it on the YCB-Video dataset. We experimentally demonstrate that our method achieves comparable performance against its fully-supervised counterpart while outperforming existing state-of-the-art approaches.
△ Less
Submitted 14 February, 2023;
originally announced February 2023.
-
Sparse resultant based minimal solvers in computer vision and their connection with the action matrix
Authors:
Snehal Bhayani,
Janne Heikkilä,
Zuzana Kukelova
Abstract:
Many computer vision applications require robust and efficient estimation of camera geometry from a minimal number of input data measurements, i.e., solving minimal problems in a RANSAC framework. Minimal problems are usually formulated as complex systems of sparse polynomials. The systems usually are overdetermined and consist of polynomials with algebraically constrained coefficients. Most state…
▽ More
Many computer vision applications require robust and efficient estimation of camera geometry from a minimal number of input data measurements, i.e., solving minimal problems in a RANSAC framework. Minimal problems are usually formulated as complex systems of sparse polynomials. The systems usually are overdetermined and consist of polynomials with algebraically constrained coefficients. Most state-of-the-art efficient polynomial solvers are based on the action matrix method that has been automated and highly optimized in recent years. On the other hand, the alternative theory of sparse resultants and Newton polytopes has been less successful for generating efficient solvers, primarily because the polytopes do not respect the constraints on the coefficients. Therefore, in this paper, we propose a simple iterative scheme to test various subsets of the Newton polytopes and search for the most efficient solver. Moreover, we propose to use an extra polynomial with a special form to further improve the solver efficiency via a Schur complement computation. We show that for some camera geometry problems our extra polynomial-based method leads to smaller and more stable solvers than the state-of-the-art Grobner basis-based solvers. The proposed method can be fully automated and incorporated into existing tools for automatic generation of efficient polynomial solvers. It provides a competitive alternative to popular Grobner basis-based methods for minimal problems in computer vision. We also study the conditions under which the minimal solvers generated by the state-of-the-art action matrix-based methods and the proposed extra polynomial resultant-based method, are equivalent. Specifically we consider a step-by-step comparison between the approaches based on the action matrix and the sparse resultant, followed by a set of substitutions, which would lead to equivalent minimal solvers.
△ Less
Submitted 1 September, 2023; v1 submitted 16 January, 2023;
originally announced January 2023.
-
BS3D: Building-scale 3D Reconstruction from RGB-D Images
Authors:
Janne Mustaniemi,
Juho Kannala,
Esa Rahtu,
Li Liu,
Janne Heikkilä
Abstract:
Various datasets have been proposed for simultaneous localization and mapping (SLAM) and related problems. Existing datasets often include small environments, have incomplete ground truth, or lack important sensor data, such as depth and infrared images. We propose an easy-to-use framework for acquiring building-scale 3D reconstruction using a consumer depth camera. Unlike complex and expensive ac…
▽ More
Various datasets have been proposed for simultaneous localization and mapping (SLAM) and related problems. Existing datasets often include small environments, have incomplete ground truth, or lack important sensor data, such as depth and infrared images. We propose an easy-to-use framework for acquiring building-scale 3D reconstruction using a consumer depth camera. Unlike complex and expensive acquisition setups, our system enables crowd-sourcing, which can greatly benefit data-hungry algorithms. Compared to similar systems, we utilize raw depth maps for odometry computation and loop closure refinement which results in better reconstructions. We acquire a building-scale 3D dataset (BS3D) and demonstrate its value by training an improved monocular depth estimation model. As a unique experiment, we benchmark visual-inertial odometry methods using both color and active infrared images.
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
Partially calibrated semi-generalized pose from hybrid point correspondences
Authors:
Snehal Bhayani,
Viktor Larsson,
Torsten Sattler,
Janne Heikkila,
Zuzana Kukelova
Abstract:
In this paper we study the problem of estimating the semi-generalized pose of a partially calibrated camera, i.e., the pose of a perspective camera with unknown focal length w.r.t. a generalized camera, from a hybrid set of 2D-2D and 2D-3D point correspondences. We study all possible camera configurations within the generalized camera system. To derive practical solvers to previously unsolved chal…
▽ More
In this paper we study the problem of estimating the semi-generalized pose of a partially calibrated camera, i.e., the pose of a perspective camera with unknown focal length w.r.t. a generalized camera, from a hybrid set of 2D-2D and 2D-3D point correspondences. We study all possible camera configurations within the generalized camera system. To derive practical solvers to previously unsolved challenging configurations, we test different parameterizations as well as different solving strategies based on the state-of-the-art methods for generating efficient polynomial solvers. We evaluate the three most promising solvers, i.e., the H51f solver with five 2D-2D correspondences and one 2D-3D correspondence viewed by the same camera inside generalized camera, the H32f solver with three 2D-2D and two 2D-3D correspondences, and the H13f solver with one 2D-2D and three 2D-3D correspondences, on synthetic and real data. We show that in the presence of noise in the 3D points these solvers provide better estimates than the corresponding absolute pose solvers.
△ Less
Submitted 29 September, 2022;
originally announced September 2022.
-
Cascaded and Generalizable Neural Radiance Fields for Fast View Synthesis
Authors:
Phong Nguyen-Ha,
Lam Huynh,
Esa Rahtu,
Jiri Matas,
Janne Heikkila
Abstract:
We present CG-NeRF, a cascade and generalizable neural radiance fields method for view synthesis. Recent generalizing view synthesis methods can render high-quality novel views using a set of nearby input views. However, the rendering speed is still slow due to the nature of uniformly-point sampling of neural radiance fields. Existing scene-specific methods can train and render novel views efficie…
▽ More
We present CG-NeRF, a cascade and generalizable neural radiance fields method for view synthesis. Recent generalizing view synthesis methods can render high-quality novel views using a set of nearby input views. However, the rendering speed is still slow due to the nature of uniformly-point sampling of neural radiance fields. Existing scene-specific methods can train and render novel views efficiently but can not generalize to unseen data. Our approach addresses the problems of fast and generalizing view synthesis by proposing two novel modules: a coarse radiance fields predictor and a convolutional-based neural renderer. This architecture infers consistent scene geometry based on the implicit neural fields and renders new views efficiently using a single GPU. We first train CG-NeRF on multiple 3D scenes of the DTU dataset, and the network can produce high-quality and accurate novel views on unseen real and synthetic data using only photometric losses. Moreover, our method can leverage a denser set of reference images of a single scene to produce accurate novel views without relying on additional explicit representations and still maintains the high-speed rendering of the pre-trained model. Experimental results show that CG-NeRF outperforms state-of-the-art generalizable neural rendering methods on various synthetic and real datasets.
△ Less
Submitted 19 November, 2023; v1 submitted 9 August, 2022;
originally announced August 2022.
-
SC6D: Symmetry-agnostic and Correspondence-free 6D Object Pose Estimation
Authors:
Dingding Cai,
Janne Heikkilä,
Esa Rahtu
Abstract:
This paper presents an efficient symmetry-agnostic and correspondence-free framework, referred to as SC6D, for 6D object pose estimation from a single monocular RGB image. SC6D requires neither the 3D CAD model of the object nor any prior knowledge of the symmetries. The pose estimation is decomposed into three sub-tasks: a) object 3D rotation representation learning and matching; b) estimation of…
▽ More
This paper presents an efficient symmetry-agnostic and correspondence-free framework, referred to as SC6D, for 6D object pose estimation from a single monocular RGB image. SC6D requires neither the 3D CAD model of the object nor any prior knowledge of the symmetries. The pose estimation is decomposed into three sub-tasks: a) object 3D rotation representation learning and matching; b) estimation of the 2D location of the object center; and c) scale-invariant distance estimation (the translation along the z-axis) via classification. SC6D is evaluated on three benchmark datasets, T-LESS, YCB-V, and ITODD, and results in state-of-the-art performance on the T-LESS dataset. Moreover, SC6D is computationally much more efficient than the previous state-of-the-art method SurfEmb. The implementation and pre-trained models are publicly available at https://github.com/dingdingcai/SC6D-pose.
△ Less
Submitted 18 September, 2022; v1 submitted 3 August, 2022;
originally announced August 2022.
-
AxIoU: An Axiomatically Justified Measure for Video Moment Retrieval
Authors:
Riku Togashi,
Mayu Otani,
Yuta Nakashima,
Esa Rahtu,
Janne Heikkila,
Tetsuya Sakai
Abstract:
Evaluation measures have a crucial impact on the direction of research. Therefore, it is of utmost importance to develop appropriate and reliable evaluation measures for new applications where conventional measures are not well suited. Video Moment Retrieval (VMR) is one such application, and the current practice is to use R@$K,θ$ for evaluating VMR systems. However, this measure has two disadvant…
▽ More
Evaluation measures have a crucial impact on the direction of research. Therefore, it is of utmost importance to develop appropriate and reliable evaluation measures for new applications where conventional measures are not well suited. Video Moment Retrieval (VMR) is one such application, and the current practice is to use R@$K,θ$ for evaluating VMR systems. However, this measure has two disadvantages. First, it is rank-insensitive: It ignores the rank positions of successfully localised moments in the top-$K$ ranked list by treating the list as a set. Second, it binarizes the Intersection over Union (IoU) of each retrieved video moment using the threshold $θ$ and thereby ignoring fine-grained localisation quality of ranked moments.
We propose an alternative measure for evaluating VMR, called Average Max IoU (AxIoU), which is free from the above two problems. We show that AxIoU satisfies two important axioms for VMR evaluation, namely, \textbf{Invariance against Redundant Moments} and \textbf{Monotonicity with respect to the Best Moment}, and also that R@$K,θ$ satisfies the first axiom only. We also empirically examine how AxIoU agrees with R@$K,θ$, as well as its stability with respect to change in the test data and human-annotated temporal boundaries.
△ Less
Submitted 30 March, 2022;
originally announced March 2022.
-
Optimal Correction Cost for Object Detection Evaluation
Authors:
Mayu Otani,
Riku Togashi,
Yuta Nakashima,
Esa Rahtu,
Janne Heikkilä,
Shin'ichi Satoh
Abstract:
Mean Average Precision (mAP) is the primary evaluation measure for object detection. Although object detection has a broad range of applications, mAP evaluates detectors in terms of the performance of ranked instance retrieval. Such the assumption for the evaluation task does not suit some downstream tasks. To alleviate the gap between downstream tasks and the evaluation scenario, we propose Optim…
▽ More
Mean Average Precision (mAP) is the primary evaluation measure for object detection. Although object detection has a broad range of applications, mAP evaluates detectors in terms of the performance of ranked instance retrieval. Such the assumption for the evaluation task does not suit some downstream tasks. To alleviate the gap between downstream tasks and the evaluation scenario, we propose Optimal Correction Cost (OC-cost), which assesses detection accuracy at image level. OC-cost computes the cost of correcting detections to ground truths as a measure of accuracy. The cost is obtained by solving an optimal transportation problem between the detections and the ground truths. Unlike mAP, OC-cost is designed to penalize false positive and false negative detections properly, and every image in a dataset is treated equally. Our experimental result validates that OC-cost has better agreement with human preference than a ranking-based measure, i.e., mAP for a single image. We also show that detectors' rankings by OC-cost are more consistent on different data splits than mAP. Our goal is not to replace mAP with OC-cost but provide an additional tool to evaluate detectors from another aspect. To help future researchers and developers choose a target measure, we provide a series of experiments to clarify how mAP and OC-cost differ.
△ Less
Submitted 27 March, 2022;
originally announced March 2022.
-
Fast Neural Architecture Search for Lightweight Dense Prediction Networks
Authors:
Lam Huynh,
Esa Rahtu,
Jiri Matas,
Janne Heikkila
Abstract:
We present LDP, a lightweight dense prediction neural architecture search (NAS) framework. Starting from a pre-defined generic backbone, LDP applies the novel Assisted Tabu Search for efficient architecture exploration. LDP is fast and suitable for various dense estimation problems, unlike previous NAS methods that are either computational demanding or deployed only for a single subtask. The perfo…
▽ More
We present LDP, a lightweight dense prediction neural architecture search (NAS) framework. Starting from a pre-defined generic backbone, LDP applies the novel Assisted Tabu Search for efficient architecture exploration. LDP is fast and suitable for various dense estimation problems, unlike previous NAS methods that are either computational demanding or deployed only for a single subtask. The performance of LPD is evaluated on monocular depth estimation, semantic segmentation, and image super-resolution tasks on diverse datasets, including NYU-Depth-v2, KITTI, Cityscapes, COCO-stuff, DIV2K, Set5, Set14, BSD100, Urban100. Experiments show that the proposed framework yields consistent improvements on all tested dense prediction tasks, while being $5\%-315\%$ more compact in terms of the number of model parameters than prior arts.
△ Less
Submitted 9 March, 2022; v1 submitted 3 March, 2022;
originally announced March 2022.
-
OVE6D: Object Viewpoint Encoding for Depth-based 6D Object Pose Estimation
Authors:
Dingding Cai,
Janne Heikkilä,
Esa Rahtu
Abstract:
This paper proposes a universal framework, called OVE6D, for model-based 6D object pose estimation from a single depth image and a target object mask. Our model is trained using purely synthetic data rendered from ShapeNet, and, unlike most of the existing methods, it generalizes well on new real-world objects without any fine-tuning. We achieve this by decomposing the 6D pose into viewpoint, in-p…
▽ More
This paper proposes a universal framework, called OVE6D, for model-based 6D object pose estimation from a single depth image and a target object mask. Our model is trained using purely synthetic data rendered from ShapeNet, and, unlike most of the existing methods, it generalizes well on new real-world objects without any fine-tuning. We achieve this by decomposing the 6D pose into viewpoint, in-plane rotation around the camera optical axis and translation, and introducing novel lightweight modules for estimating each component in a cascaded manner. The resulting network contains less than 4M parameters while demonstrating excellent performance on the challenging T-LESS and Occluded LINEMOD datasets without any dataset-specific training. We show that OVE6D outperforms some contemporary deep learning-based pose estimation methods specifically trained for individual objects or datasets with real-world training data.
The implementation and the pre-trained model will be made publicly available.
△ Less
Submitted 7 April, 2022; v1 submitted 2 March, 2022;
originally announced March 2022.
-
Free-Viewpoint RGB-D Human Performance Capture and Rendering
Authors:
Phong Nguyen-Ha,
Nikolaos Sarafianos,
Christoph Lassner,
Janne Heikkila,
Tony Tung
Abstract:
Capturing and faithfully rendering photo-realistic humans from novel views is a fundamental problem for AR/VR applications. While prior work has shown impressive performance capture results in laboratory settings, it is non-trivial to achieve casual free-viewpoint human capture and rendering for unseen identities with high fidelity, especially for facial expressions, hands, and clothes. To tackle…
▽ More
Capturing and faithfully rendering photo-realistic humans from novel views is a fundamental problem for AR/VR applications. While prior work has shown impressive performance capture results in laboratory settings, it is non-trivial to achieve casual free-viewpoint human capture and rendering for unseen identities with high fidelity, especially for facial expressions, hands, and clothes. To tackle these challenges we introduce a novel view synthesis framework that generates realistic renders from unseen views of any human captured from a single-view and sparse RGB-D sensor, similar to a low-cost depth camera, and without actor-specific models. We propose an architecture to create dense feature maps in novel views obtained by sphere-based neural rendering, and create complete renders using a global context inpainting model. Additionally, an enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details. We show that our method generates high-quality novel views of synthetic and real human actors given a single-stream, sparse RGB-D input. It generalizes to unseen identities, and new poses and faithfully reconstructs facial expressions. Our approach outperforms prior view synthesis methods and is robust to different levels of depth sparsity.
△ Less
Submitted 2 August, 2022; v1 submitted 27 December, 2021;
originally announced December 2021.
-
Lightweight Monocular Depth with a Novel Neural Architecture Search Method
Authors:
Lam Huynh,
Phong Nguyen,
Jiri Matas,
Esa Rahtu,
Janne Heikkila
Abstract:
This paper presents a novel neural architecture search method, called LiDNAS, for generating lightweight monocular depth estimation models. Unlike previous neural architecture search (NAS) approaches, where finding optimized networks are computationally highly demanding, the introduced novel Assisted Tabu Search leads to efficient architecture exploration. Moreover, we construct the search space o…
▽ More
This paper presents a novel neural architecture search method, called LiDNAS, for generating lightweight monocular depth estimation models. Unlike previous neural architecture search (NAS) approaches, where finding optimized networks are computationally highly demanding, the introduced novel Assisted Tabu Search leads to efficient architecture exploration. Moreover, we construct the search space on a pre-defined backbone network to balance layer diversity and search space size. The LiDNAS method outperforms the state-of-the-art NAS approach, proposed for disparity and depth estimation, in terms of search efficiency and output model performance. The LiDNAS optimized models achieve results superior to compact depth estimation state-of-the-art on NYU-Depth-v2, KITTI, and ScanNet, while being 7%-500% more compact in size, i.e the number of model parameters.
△ Less
Submitted 25 August, 2021;
originally announced August 2021.
-
Monocular Depth Estimation Primed by Salient Point Detection and Normalized Hessian Loss
Authors:
Lam Huynh,
Matteo Pedone,
Phong Nguyen,
Jiri Matas,
Esa Rahtu,
Janne Heikkila
Abstract:
Deep neural networks have recently thrived on single image depth estimation. That being said, current developments on this topic highlight an apparent compromise between accuracy and network size. This work proposes an accurate and lightweight framework for monocular depth estimation based on a self-attention mechanism stemming from salient point detection. Specifically, we utilize a sparse set of…
▽ More
Deep neural networks have recently thrived on single image depth estimation. That being said, current developments on this topic highlight an apparent compromise between accuracy and network size. This work proposes an accurate and lightweight framework for monocular depth estimation based on a self-attention mechanism stemming from salient point detection. Specifically, we utilize a sparse set of keypoints to train a FuSaNet model that consists of two major components: Fusion-Net and Saliency-Net. In addition, we introduce a normalized Hessian loss term invariant to scaling and shear along the depth direction, which is shown to substantially improve the accuracy. The proposed method achieves state-of-the-art results on NYU-Depth-v2 and KITTI while using 3.1-38.4 times smaller model in terms of the number of parameters than baseline approaches. Experiments on the SUN-RGBD further demonstrate the generalizability of the proposed method.
△ Less
Submitted 25 August, 2021;
originally announced August 2021.
-
Calibrated and Partially Calibrated Semi-Generalized Homographies
Authors:
Snehal Bhayani,
Torsten Sattler,
Daniel Barath,
Patrik Beliansky,
Janne Heikkila,
Zuzana Kukelova
Abstract:
In this paper, we propose the first minimal solutions for estimating the semi-generalized homography given a perspective and a generalized camera. The proposed solvers use five 2D-2D image point correspondences induced by a scene plane. One of them assumes the perspective camera to be fully calibrated, while the other solver estimates the unknown focal length together with the absolute pose parame…
▽ More
In this paper, we propose the first minimal solutions for estimating the semi-generalized homography given a perspective and a generalized camera. The proposed solvers use five 2D-2D image point correspondences induced by a scene plane. One of them assumes the perspective camera to be fully calibrated, while the other solver estimates the unknown focal length together with the absolute pose parameters. This setup is particularly important in structure-from-motion and image-based localization pipelines, where a new camera is localized in each step with respect to a set of known cameras and 2D-3D correspondences might not be available. As a consequence of a clever parametrization and the elimination ideal method, our approach only needs to solve a univariate polynomial of degree five or three. The proposed solvers are stable and efficient as demonstrated by a number of synthetic and real-world experiments.
△ Less
Submitted 11 October, 2021; v1 submitted 11 March, 2021;
originally announced March 2021.
-
Boosting Monocular Depth Estimation with Lightweight 3D Point Fusion
Authors:
Lam Huynh,
Phong Nguyen-Ha,
Jiri Matas,
Esa Rahtu,
Janne Heikkila
Abstract:
In this paper, we propose enhancing monocular depth estimation by adding 3D points as depth guidance. Unlike existing depth completion methods, our approach performs well on extremely sparse and unevenly distributed point clouds, which makes it agnostic to the source of the 3D points. We achieve this by introducing a novel multi-scale 3D point fusion network that is both lightweight and efficient.…
▽ More
In this paper, we propose enhancing monocular depth estimation by adding 3D points as depth guidance. Unlike existing depth completion methods, our approach performs well on extremely sparse and unevenly distributed point clouds, which makes it agnostic to the source of the 3D points. We achieve this by introducing a novel multi-scale 3D point fusion network that is both lightweight and efficient. We demonstrate its versatility on two different depth estimation problems where the 3D points have been acquired with conventional structure-from-motion and LiDAR. In both cases, our network performs on par with state-of-the-art depth completion methods and achieves significantly higher accuracy when only a small number of points is used while being more compact in terms of the number of parameters. We show that our method outperforms some contemporary deep learning based multi-view stereo and structure-from-motion methods both in accuracy and in compactness.
△ Less
Submitted 25 August, 2021; v1 submitted 18 December, 2020;
originally announced December 2020.
-
RGBD-Net: Predicting color and depth images for novel views synthesis
Authors:
Phong Nguyen-Ha,
Animesh Karnewar,
Lam Huynh,
Esa Rahtu,
Jiri Matas,
Janne Heikkila
Abstract:
We propose a new cascaded architecture for novel view synthesis, called RGBD-Net, which consists of two core components: a hierarchical depth regression network and a depth-aware generator network. The former one predicts depth maps of the target views by using adaptive depth scaling, while the latter one leverages the predicted depths and renders spatially and temporally consistent target images.…
▽ More
We propose a new cascaded architecture for novel view synthesis, called RGBD-Net, which consists of two core components: a hierarchical depth regression network and a depth-aware generator network. The former one predicts depth maps of the target views by using adaptive depth scaling, while the latter one leverages the predicted depths and renders spatially and temporally consistent target images. In the experimental evaluation on standard datasets, RGBD-Net not only outperforms the state-of-the-art by a clear margin, but it also generalizes well to new scenes without per-scene optimization. Moreover, we show that RGBD-Net can be optionally trained without depth supervision while still retaining high-quality rendering. Thanks to the depth regression network, RGBD-Net can be also used for creating dense 3D point clouds that are more accurate than those produced by some state-of-the-art multi-view stereo methods.
△ Less
Submitted 9 July, 2021; v1 submitted 29 November, 2020;
originally announced November 2020.
-
Uncovering Hidden Challenges in Query-Based Video Moment Retrieval
Authors:
Mayu Otani,
Yuta Nakashima,
Esa Rahtu,
Janne Heikkilä
Abstract:
The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and…
▽ More
The query-based moment retrieval is a problem of localising a specific clip from an untrimmed video according a query sentence. This is a challenging task that requires interpretation of both the natural language query and the video content. Like in many other areas in computer vision and machine learning, the progress in query-based moment retrieval is heavily driven by the benchmark datasets and, therefore, their quality has significant impact on the field. In this paper, we present a series of experiments assessing how well the benchmark results reflect the true progress in solving the moment retrieval task. Our results indicate substantial biases in the popular datasets and unexpected behaviour of the state-of-the-art models. Moreover, we present new sanity check experiments and approaches for visualising the results. Finally, we suggest possible directions to improve the temporal sentence grounding in the future. Our code for this paper is available at https://mayu-ot.github.io/hidden-challenges-MR .
△ Less
Submitted 7 October, 2020; v1 submitted 1 September, 2020;
originally announced September 2020.
-
Computing stable resultant-based minimal solvers by hiding a variable
Authors:
Snehal Bhayani,
Zuzana Kukelova,
Janne Heikkilä
Abstract:
Many computer vision applications require robust and efficient estimation of camera geometry. The robust estimation is usually based on solving camera geometry problems from a minimal number of input data measurements, i.e., solving minimal problems, in a RANSAC-style framework. Minimal problems often result in complex systems of polynomial equations. The existing state-of-the-art methods for solv…
▽ More
Many computer vision applications require robust and efficient estimation of camera geometry. The robust estimation is usually based on solving camera geometry problems from a minimal number of input data measurements, i.e., solving minimal problems, in a RANSAC-style framework. Minimal problems often result in complex systems of polynomial equations. The existing state-of-the-art methods for solving such systems are either based on Gröbner bases and the action matrix method, which have been extensively studied and optimized in the recent years or recently proposed approach based on a sparse resultant computation using an extra variable.
In this paper, we study an interesting alternative sparse resultant-based method for solving sparse systems of polynomial equations by hiding one variable. This approach results in a larger eigenvalue problem than the action matrix and extra variable sparse resultant-based methods; however, it does not need to compute an inverse or elimination of large matrices that may be numerically unstable. The proposed approach includes several improvements to the standard sparse resultant algorithms, which significantly improves the efficiency and stability of the hidden variable resultant-based solvers as we demonstrate on several interesting computer vision problems. We show that for the studied problems, our sparse resultant based approach leads to more stable solvers than the state-of-the-art Gröbner bases-based solvers as well as existing sparse resultant-based solvers, especially in close to critical configurations. Our new method can be fully automated and incorporated into existing tools for the automatic generation of efficient minimal solvers.
△ Less
Submitted 17 July, 2020;
originally announced July 2020.
-
Learning non-rigid surface reconstruction from spatio-temporal image patches
Authors:
Matteo Pedone,
Abdelrahman Mostafa,
Janne heikkilä
Abstract:
We present a method to reconstruct a dense spatio-temporal depth map of a non-rigidly deformable object directly from a video sequence. The estimation of depth is performed locally on spatio-temporal patches of the video, and then the full depth video of the entire shape is recovered by combining them together. Since the geometric complexity of a local spatio-temporal patch of a deforming non-rigi…
▽ More
We present a method to reconstruct a dense spatio-temporal depth map of a non-rigidly deformable object directly from a video sequence. The estimation of depth is performed locally on spatio-temporal patches of the video, and then the full depth video of the entire shape is recovered by combining them together. Since the geometric complexity of a local spatio-temporal patch of a deforming non-rigid object is often simple enough to be faithfully represented with a parametric model, we artificially generate a database of small deforming rectangular meshes rendered with different material properties and light conditions, along with their corresponding depth videos, and use such data to train a convolutional neural network. We tested our method on both synthetic and Kinect data and experimentally observed that the reconstruction error is significantly lower than the one obtained using other approaches like conventional non-rigid structure from motion.
△ Less
Submitted 18 June, 2020;
originally announced June 2020.
-
Sequential View Synthesis with Transformer
Authors:
Phong Nguyen-Ha,
Lam Huynh,
Esa Rahtu,
Janne Heikkila
Abstract:
This paper addresses the problem of novel view synthesis by means of neural rendering, where we are interested in predicting the novel view at an arbitrary camera pose based on a given set of input images from other viewpoints. Using the known query pose and input poses, we create an ordered set of observations that leads to the target view. Thus, the problem of single novel view synthesis is refo…
▽ More
This paper addresses the problem of novel view synthesis by means of neural rendering, where we are interested in predicting the novel view at an arbitrary camera pose based on a given set of input images from other viewpoints. Using the known query pose and input poses, we create an ordered set of observations that leads to the target view. Thus, the problem of single novel view synthesis is reformulated as a sequential view prediction task. In this paper, the proposed Transformer-based Generative Query Network (T-GQN) extends the neural-rendering methods by adding two new concepts. First, we use multi-view attention learning between context images to obtain multiple implicit scene representations. Second, we introduce a sequential rendering decoder to predict an image sequence, including the target view, based on the learned representations. Finally, we evaluate our model on various challenging datasets and demonstrate that our model not only gives consistent predictions but also doesn't require any retraining for finetuning.
△ Less
Submitted 22 September, 2020; v1 submitted 9 April, 2020;
originally announced April 2020.
-
Guiding Monocular Depth Estimation Using Depth-Attention Volume
Authors:
Lam Huynh,
Phong Nguyen-Ha,
Jiri Matas,
Esa Rahtu,
Janne Heikkila
Abstract:
Recovering the scene depth from a single image is an ill-posed problem that requires additional priors, often referred to as monocular depth cues, to disambiguate different 3D interpretations. In recent works, those priors have been learned in an end-to-end manner from large datasets by using deep neural networks. In this paper, we propose guiding depth estimation to favor planar structures that a…
▽ More
Recovering the scene depth from a single image is an ill-posed problem that requires additional priors, often referred to as monocular depth cues, to disambiguate different 3D interpretations. In recent works, those priors have been learned in an end-to-end manner from large datasets by using deep neural networks. In this paper, we propose guiding depth estimation to favor planar structures that are ubiquitous especially in indoor environments. This is achieved by incorporating a non-local coplanarity constraint to the network with a novel attention mechanism called depth-attention volume (DAV). Experiments on two popular indoor datasets, namely NYU-Depth-v2 and ScanNet, show that our method achieves state-of-the-art depth estimation results while using only a fraction of the number of parameters needed by the competing methods.
△ Less
Submitted 16 August, 2020; v1 submitted 6 April, 2020;
originally announced April 2020.
-
A sparse resultant based method for efficient minimal solvers
Authors:
Snehal Bhayani,
Zuzana Kukelova,
Janne Heikkilä
Abstract:
Many computer vision applications require robust and efficient estimation of camera geometry. The robust estimation is usually based on solving camera geometry problems from a minimal number of input data measurements, i.e. solving minimal problems in a RANSAC framework. Minimal problems often result in complex systems of polynomial equations. Many state-of-the-art efficient polynomial solvers to…
▽ More
Many computer vision applications require robust and efficient estimation of camera geometry. The robust estimation is usually based on solving camera geometry problems from a minimal number of input data measurements, i.e. solving minimal problems in a RANSAC framework. Minimal problems often result in complex systems of polynomial equations. Many state-of-the-art efficient polynomial solvers to these problems are based on Gröbner bases and the action-matrix method that has been automatized and highly optimized in recent years. In this paper we study an alternative algebraic method for solving systems of polynomial equations, i.e., the sparse resultant-based method and propose a novel approach to convert the resultant constraint to an eigenvalue problem. This technique can significantly improve the efficiency and stability of existing resultant-based solvers. We applied our new resultant-based method to a large variety of computer vision problems and show that for most of the considered problems, the new method leads to solvers that are the same size as the the best available Gröbner basis solvers and of similar accuracy. For some problems the new sparse-resultant based method leads to even smaller and more stable solvers than the state-of-the-art Gröbner basis solvers. Our new method can be fully automatized and incorporated into existing tools for automatic generation of efficient polynomial solvers and as such it represents a competitive alternative to popular Gröbner basis methods for minimal problems in computer vision.
△ Less
Submitted 21 December, 2019;
originally announced December 2019.
-
Improving land cover segmentation across satellites using domain adaptation
Authors:
Nadir Bengana,
Janne Heikkilä
Abstract:
Land use and land cover mapping are essential to various fields of study, including forestry, agriculture, and urban management. Using earth observation satellites both facilitate and accelerate the task. Lately, deep learning methods have proven to be excellent at automating the mapping via semantic image segmentation. However, because deep neural networks require large amounts of labeled data, i…
▽ More
Land use and land cover mapping are essential to various fields of study, including forestry, agriculture, and urban management. Using earth observation satellites both facilitate and accelerate the task. Lately, deep learning methods have proven to be excellent at automating the mapping via semantic image segmentation. However, because deep neural networks require large amounts of labeled data, it is not easy to exploit the full potential of satellite imagery. Additionally, the land cover tends to differ in appearance from one region to another; therefore, having labeled data from one location does not necessarily help in mapping others. Furthermore, satellite images come in various multispectral bands (the bands could range from RGB to over twelve bands). In this paper, we aim at using domain adaptation to solve the aforementioned problems. We applied a well-performing domain adaptation approach on datasets we have built using RGB images from Sentinel-2, WorldView-2, and Pleiades-1 satellites with Corine Land Cover as ground-truth labels. We have also used the DeepGlobe land cover dataset. Experiments show a significant improvement over results obtained without the use of domain adaptation. In some cases, an improvement of over 20% MIoU. At times it even manages to correct errors in the ground-truth labels.
△ Less
Submitted 1 April, 2020; v1 submitted 25 November, 2019;
originally announced December 2019.
-
Predicting Novel Views Using Generative Adversarial Query Network
Authors:
Phong Nguyen-Ha,
Lam Huynh,
Esa Rahtu,
Janne Heikkila
Abstract:
The problem of predicting a novel view of the scene using an arbitrary number of observations is a challenging problem for computers as well as for humans. This paper introduces the Generative Adversarial Query Network (GAQN), a general learning framework for novel view synthesis that combines Generative Query Network (GQN) and Generative Adversarial Networks (GANs). The conventional GQN encodes i…
▽ More
The problem of predicting a novel view of the scene using an arbitrary number of observations is a challenging problem for computers as well as for humans. This paper introduces the Generative Adversarial Query Network (GAQN), a general learning framework for novel view synthesis that combines Generative Query Network (GQN) and Generative Adversarial Networks (GANs). The conventional GQN encodes input views into a latent representation that is used to generate a new view through a recurrent variational decoder. The proposed GAQN builds on this work by adding two novel aspects: First, we extend the current GQN architecture with an adversarial loss function for improving the visual quality and convergence speed. Second, we introduce a feature-matching loss function for stabilizing the training procedure. The experiments demonstrate that GAQN is able to produce high-quality results and faster convergence compared to the conventional approach.
△ Less
Submitted 10 April, 2019;
originally announced April 2019.
-
Rethinking the Evaluation of Video Summaries
Authors:
Mayu Otani,
Yuta Nakashima,
Esa Rahtu,
Janne Heikkilä
Abstract:
Video summarization is a technique to create a short skim of the original video while preserving the main stories/content. There exists a substantial interest in automatizing this process due to the rapid growth of the available material. The recent progress has been facilitated by public benchmark datasets, which enable easy and fair comparison of methods. Currently the established evaluation pro…
▽ More
Video summarization is a technique to create a short skim of the original video while preserving the main stories/content. There exists a substantial interest in automatizing this process due to the rapid growth of the available material. The recent progress has been facilitated by public benchmark datasets, which enable easy and fair comparison of methods. Currently the established evaluation protocol is to compare the generated summary with respect to a set of reference summaries provided by the dataset. In this paper, we will provide in-depth assessment of this pipeline using two popular benchmark datasets. Surprisingly, we observe that randomly generated summaries achieve comparable or better performance to the state-of-the-art. In some cases, the random summaries outperform even the human generated summaries in leave-one-out experiments. Moreover, it turns out that the video segmentation, which is often considered as a fixed pre-processing method, has the most significant impact on the performance measure. Based on our observations, we propose alternative approaches for assessing the importance scores as well as an intuitive visualization of correlation between the estimated scoring and human annotations.
△ Less
Submitted 11 April, 2019; v1 submitted 27 March, 2019;
originally announced March 2019.
-
An efficient solution for semantic segmentation: ShuffleNet V2 with atrous separable convolutions
Authors:
Sercan Türkmen,
Janne Heikkilä
Abstract:
Assigning a label to each pixel in an image, namely semantic segmentation, has been an important task in computer vision, and has applications in autonomous driving, robotic navigation, localization, and scene understanding. Fully convolutional neural networks have proved to be a successful solution for the task over the years but most of the work being done focuses primarily on accuracy. In this…
▽ More
Assigning a label to each pixel in an image, namely semantic segmentation, has been an important task in computer vision, and has applications in autonomous driving, robotic navigation, localization, and scene understanding. Fully convolutional neural networks have proved to be a successful solution for the task over the years but most of the work being done focuses primarily on accuracy. In this paper, we present a computationally efficient approach to semantic segmentation, while achieving a high mean intersection over union (mIOU), 70.33% on Cityscapes challenge. The network proposed is capable of running real-time on mobile devices. In addition, we make our code and model weights publicly available.
△ Less
Submitted 3 April, 2019; v1 submitted 20 February, 2019;
originally announced February 2019.
-
LSD$_2$ -- Joint Denoising and Deblurring of Short and Long Exposure Images with CNNs
Authors:
Janne Mustaniemi,
Juho Kannala,
Jiri Matas,
Simo Särkkä,
Janne Heikkilä
Abstract:
The paper addresses the problem of acquiring high-quality photographs with handheld smartphone cameras in low-light imaging conditions. We propose an approach based on capturing pairs of short and long exposure images in rapid succession and fusing them into a single high-quality photograph. Unlike existing methods, we take advantage of both images simultaneously and perform a joint denoising and…
▽ More
The paper addresses the problem of acquiring high-quality photographs with handheld smartphone cameras in low-light imaging conditions. We propose an approach based on capturing pairs of short and long exposure images in rapid succession and fusing them into a single high-quality photograph. Unlike existing methods, we take advantage of both images simultaneously and perform a joint denoising and deblurring using a convolutional neural network. A novel approach is introduced to generate realistic short-long exposure image pairs. The method produces good images in extremely challenging conditions and outperforms existing denoising and deblurring methods. It also enables exposure fusion in the presence of motion blur.
△ Less
Submitted 1 September, 2020; v1 submitted 23 November, 2018;
originally announced November 2018.
-
Gyroscope-Aided Motion Deblurring with Deep Networks
Authors:
Janne Mustaniemi,
Juho Kannala,
Simo Särkkä,
Jiri Matas,
Janne Heikkilä
Abstract:
We propose a deblurring method that incorporates gyroscope measurements into a convolutional neural network (CNN). With the help of such measurements, it can handle extremely strong and spatially-variant motion blur. At the same time, the image data is used to overcome the limitations of gyro-based blur estimation. To train our network, we also introduce a novel way of generating realistic trainin…
▽ More
We propose a deblurring method that incorporates gyroscope measurements into a convolutional neural network (CNN). With the help of such measurements, it can handle extremely strong and spatially-variant motion blur. At the same time, the image data is used to overcome the limitations of gyro-based blur estimation. To train our network, we also introduce a novel way of generating realistic training data using the gyroscope. The evaluation shows a clear improvement in visual quality over the state-of-the-art while achieving real-time performance. Furthermore, the method is shown to improve the performance of existing feature detectors and descriptors against the motion blur.
△ Less
Submitted 23 November, 2018; v1 submitted 1 October, 2018;
originally announced October 2018.
-
Leveraging Unlabeled Whole-Slide-Images for Mitosis Detection
Authors:
Saad Ullah Akram,
Talha Qaiser,
Simon Graham,
Juho Kannala,
Janne Heikkilä,
Nasir Rajpoot
Abstract:
Mitosis count is an important biomarker for prognosis of various cancers. At present, pathologists typically perform manual counting on a few selected regions of interest in breast whole-slide-images (WSIs) of patient biopsies. This task is very time-consuming, tedious and subjective. Automated mitosis detection methods have made great advances in recent years. However, these methods require exhau…
▽ More
Mitosis count is an important biomarker for prognosis of various cancers. At present, pathologists typically perform manual counting on a few selected regions of interest in breast whole-slide-images (WSIs) of patient biopsies. This task is very time-consuming, tedious and subjective. Automated mitosis detection methods have made great advances in recent years. However, these methods require exhaustive labeling of a large number of selected regions of interest. This task is very expensive because expert pathologists are needed for reliable and accurate annotations. In this paper, we present a semi-supervised mitosis detection method which is designed to leverage a large number of unlabeled breast cancer WSIs. As a result, our method capitalizes on the growing number of digitized histology images, without relying on exhaustive annotations, subsequently improving mitosis detection. Our method first learns a mitosis detector from labeled data, uses this detector to mine additional mitosis samples from unlabeled WSIs, and then trains the final model using this larger and diverse set of mitosis samples. The use of unlabeled data improves F1-score by $\sim$5\% compared to our best performing fully-supervised model on the TUPAC validation set. Our submission (single model) to TUPAC challenge ranks highly on the leaderboard with an F1-score of 0.64.
△ Less
Submitted 31 July, 2018;
originally announced July 2018.
-
Fast Motion Deblurring for Feature Detection and Matching Using Inertial Measurements
Authors:
Janne Mustaniemi,
Juho Kannala,
Simo Särkkä,
Jiri Matas,
Janne Heikkilä
Abstract:
Many computer vision and image processing applications rely on local features. It is well-known that motion blur decreases the performance of traditional feature detectors and descriptors. We propose an inertial-based deblurring method for improving the robustness of existing feature detectors and descriptors against the motion blur. Unlike most deblurring algorithms, the method can handle spatial…
▽ More
Many computer vision and image processing applications rely on local features. It is well-known that motion blur decreases the performance of traditional feature detectors and descriptors. We propose an inertial-based deblurring method for improving the robustness of existing feature detectors and descriptors against the motion blur. Unlike most deblurring algorithms, the method can handle spatially-variant blur and rolling shutter distortion. Furthermore, it is capable of running in real-time contrary to state-of-the-art algorithms. The limitations of inertial-based blur estimation are taken into account by validating the blur estimates using image data. The evaluation shows that when the method is used with traditional feature detector and descriptor, it increases the number of detected keypoints, provides higher repeatability and improves the localization accuracy. We also demonstrate that such features will lead to more accurate and complete reconstructions when used in the application of 3D visual reconstruction.
△ Less
Submitted 22 May, 2018;
originally announced May 2018.
-
Accurate 3-D Reconstruction with RGB-D Cameras using Depth Map Fusion and Pose Refinement
Authors:
Markus Ylimäki,
Juho Kannala,
Janne Heikkilä
Abstract:
Depth map fusion is an essential part in both stereo and RGB-D based 3-D reconstruction pipelines. Whether produced with a passive stereo reconstruction or using an active depth sensor, such as Microsoft Kinect, the depth maps have noise and may have poor initial registration. In this paper, we introduce a method which is capable of handling outliers, and especially, even significant registration…
▽ More
Depth map fusion is an essential part in both stereo and RGB-D based 3-D reconstruction pipelines. Whether produced with a passive stereo reconstruction or using an active depth sensor, such as Microsoft Kinect, the depth maps have noise and may have poor initial registration. In this paper, we introduce a method which is capable of handling outliers, and especially, even significant registration errors. The proposed method first fuses a sequence of depth maps into a single non-redundant point cloud so that the redundant points are merged together by giving more weight to more certain measurements. Then, the original depth maps are re-registered to the fused point cloud to refine the original camera extrinsic parameters. The fusion is then performed again with the refined extrinsic parameters. This procedure is repeated until the result is satisfying or no significant changes happen between iterations. The method is robust to outliers and erroneous depth measurements as well as even significant depth map registration errors due to inaccurate initial camera poses.
△ Less
Submitted 24 April, 2018;
originally announced April 2018.
-
Cell Tracking via Proposal Generation and Selection
Authors:
Saad Ullah Akram,
Juho Kannala,
Lauri Eklund,
Janne Heikkilä
Abstract:
Microscopy imaging plays a vital role in understanding many biological processes in development and disease. The recent advances in automation of microscopes and development of methods and markers for live cell imaging has led to rapid growth in the amount of image data being captured. To efficiently and reliably extract useful insights from these captured sequences, automated cell tracking is ess…
▽ More
Microscopy imaging plays a vital role in understanding many biological processes in development and disease. The recent advances in automation of microscopes and development of methods and markers for live cell imaging has led to rapid growth in the amount of image data being captured. To efficiently and reliably extract useful insights from these captured sequences, automated cell tracking is essential. This is a challenging problem due to large variation in the appearance and shapes of cells depending on many factors including imaging methodology, biological characteristics of cells, cell matrix composition, labeling methodology, etc. Often cell tracking methods require a sequence-specific segmentation method and manual tuning of many tracking parameters, which limits their applicability to sequences other than those they are designed for. In this paper, we propose 1) a deep learning based cell proposal method, which proposes candidates for cells along with their scores, and 2) a cell tracking method, which links proposals in adjacent frames in a graphical model using edges representing different cellular events and poses joint cell detection and tracking as the selection of a subset of cell and edge proposals. Our method is completely automated and given enough training data can be applied to a wide variety of microscopy sequences. We evaluate our method on multiple fluorescence and phase contrast microscopy sequences containing cells of various shapes and appearances from ISBI cell tracking challenge, and show that our method outperforms existing cell tracking methods.
Code is available at: https://github.com/SaadUllahAkram/CellTracker
△ Less
Submitted 9 May, 2017;
originally announced May 2017.
-
Inertial-Based Scale Estimation for Structure from Motion on Mobile Devices
Authors:
Janne Mustaniemi,
Juho Kannala,
Simo Särkkä,
Jiri Matas,
Janne Heikkilä
Abstract:
Structure from motion algorithms have an inherent limitation that the reconstruction can only be determined up to the unknown scale factor. Modern mobile devices are equipped with an inertial measurement unit (IMU), which can be used for estimating the scale of the reconstruction. We propose a method that recovers the metric scale given inertial measurements and camera poses. In the process, we al…
▽ More
Structure from motion algorithms have an inherent limitation that the reconstruction can only be determined up to the unknown scale factor. Modern mobile devices are equipped with an inertial measurement unit (IMU), which can be used for estimating the scale of the reconstruction. We propose a method that recovers the metric scale given inertial measurements and camera poses. In the process, we also perform a temporal and spatial alignment of the camera and the IMU. Therefore, our solution can be easily combined with any existing visual reconstruction software. The method can cope with noisy camera pose estimates, typically caused by motion blur or rolling shutter artifacts, via utilizing a Rauch-Tung-Striebel (RTS) smoother. Furthermore, the scale estimation is performed in the frequency domain, which provides more robustness to inaccurate sensor time stamps and noisy IMU samples than the previously used time domain representation. In contrast to previous methods, our approach has no parameters that need to be tuned for achieving a good performance. In the experiments, we show that the algorithm outperforms the state-of-the-art in both accuracy and convergence speed of the scale estimate. The accuracy of the scale is around $1\%$ from the ground truth depending on the recording. We also demonstrate that our method can improve the scale accuracy of the Project Tango's build-in motion tracking.
△ Less
Submitted 11 August, 2017; v1 submitted 29 November, 2016;
originally announced November 2016.
-
Video Summarization using Deep Semantic Features
Authors:
Mayu Otani,
Yuta Nakashima,
Esa Rahtu,
Janne Heikkilä,
Naokazu Yokoya
Abstract:
This paper presents a video summarization technique for an Internet video to provide a quick way to overview its content. This is a challenging problem because finding important or informative parts of the original video requires to understand its content. Furthermore the content of Internet videos is very diverse, ranging from home videos to documentaries, which makes video summarization much mor…
▽ More
This paper presents a video summarization technique for an Internet video to provide a quick way to overview its content. This is a challenging problem because finding important or informative parts of the original video requires to understand its content. Furthermore the content of Internet videos is very diverse, ranging from home videos to documentaries, which makes video summarization much more tough as prior knowledge is almost not available. To tackle this problem, we propose to use deep video features that can encode various levels of content semantics, including objects, actions, and scenes, improving the efficiency of standard video summarization techniques. For this, we design a deep neural network that maps videos as well as descriptions to a common semantic space and jointly trained it with associated pairs of videos and descriptions. To generate a video summary, we extract the deep features from each segment of the original video and apply a clustering-based summarization technique to them. We evaluate our video summaries using the SumMe dataset as well as baseline approaches. The results demonstrated the advantages of incorporating our deep semantic features in a video summarization technique.
△ Less
Submitted 27 September, 2016;
originally announced September 2016.
-
Learning Joint Representations of Videos and Sentences with Web Image Search
Authors:
Mayu Otani,
Yuta Nakashima,
Esa Rahtu,
Janne Heikkilä,
Naokazu Yokoya
Abstract:
Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following co…
▽ More
Our objective is video retrieval based on natural language queries. In addition, we consider the analogous problem of retrieving sentences or generating descriptions given an input video. Recent work has addressed the problem by embedding visual and textual inputs into a common space where semantic similarities correlate to distances. We also adopt the embedding approach, and make the following contributions: First, we utilize web image search in sentence embedding process to disambiguate fine-grained visual concepts. Second, we propose embedding models for sentence, image, and video inputs whose parameters are learned simultaneously. Finally, we show how the proposed model can be applied to description generation. Overall, we observe a clear improvement over the state-of-the-art methods in the video and sentence retrieval tasks. In description generation, the performance level is comparable to the current state-of-the-art, although our embeddings were trained for the retrieval tasks.
△ Less
Submitted 8 August, 2016;
originally announced August 2016.