-
LDPoly: Latent Diffusion for Polygonal Road Outline Extraction in Large-Scale Topographic Mapping
Authors:
Weiqin Jiao,
Hao Cheng,
George Vosselman,
Claudio Persello
Abstract:
Polygonal road outline extraction from high-resolution aerial images is an important task in large-scale topographic mapping, where roads are represented as vectorized polygons, capturing essential geometric features with minimal vertex redundancy. Despite its importance, no existing method has been explicitly designed for this task. While polygonal building outline extraction has been extensively…
▽ More
Polygonal road outline extraction from high-resolution aerial images is an important task in large-scale topographic mapping, where roads are represented as vectorized polygons, capturing essential geometric features with minimal vertex redundancy. Despite its importance, no existing method has been explicitly designed for this task. While polygonal building outline extraction has been extensively studied, the unique characteristics of roads, such as branching structures and topological connectivity, pose challenges to these methods. To address this gap, we introduce LDPoly, the first dedicated framework for extracting polygonal road outlines from high-resolution aerial images. Our method leverages a novel Dual-Latent Diffusion Model with a Channel-Embedded Fusion Module, enabling the model to simultaneously generate road masks and vertex heatmaps. A tailored polygonization method is then applied to obtain accurate vectorized road polygons with minimal vertex redundancy. We evaluate LDPoly on a new benchmark dataset, Map2ImLas, which contains detailed polygonal annotations for various topographic objects in several Dutch regions. Our experiments include both in-region and cross-region evaluations, with the latter designed to assess the model's generalization performance on unseen regions. Quantitative and qualitative results demonstrate that LDPoly outperforms state-of-the-art polygon extraction methods across various metrics, including pixel-level coverage, vertex efficiency, polygon regularity, and road connectivity. We also design two new metrics to assess polygon simplicity and boundary smoothness. Moreover, this work represents the first application of diffusion models for extracting precise vectorized object outlines without redundant vertices from remote-sensing imagery, paving the way for future advancements in this field.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation
Authors:
Kun Li,
George Vosselman,
Michael Ying Yang
Abstract:
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression. Recent advancements, particularly Transformer-based fusion designs, have demonstrated remarkable progress in this domain. However, existing methods primarily focus on refining visual features using language-aware guidance during the cr…
▽ More
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression. Recent advancements, particularly Transformer-based fusion designs, have demonstrated remarkable progress in this domain. However, existing methods primarily focus on refining visual features using language-aware guidance during the cross-modal fusion stage, neglecting the complementary vision-to-language flow. This limitation often leads to irrelevant or suboptimal representations. In addition, the diverse spatial scales of ground objects in aerial images pose significant challenges to the visual perception capabilities of existing models when conditioned on textual inputs. In this paper, we propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges for RRSIS. Specifically, we design a Bidirectional Alignment Module (BAM) with learnable query tokens to selectively and effectively represent visual and linguistic features, emphasizing regions associated with key tokens. BAM is further enhanced with a dynamic feature selection block, designed to provide both macro- and micro-level visual features, preserving global context and local details to facilitate more effective cross-modal interaction. Furthermore, SBANet incorporates a text-conditioned channel and spatial aggregator to bridge the gap between the encoder and decoder, enhancing cross-scale information exchange in complex aerial scenarios. Extensive experiments demonstrate that our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets, both quantitatively and qualitatively. The code will be released after publication.
△ Less
Submitted 6 January, 2025; v1 submitted 1 January, 2025;
originally announced January 2025.
-
RoIPoly: Vectorized Building Outline Extraction Using Vertex and Logit Embeddings
Authors:
Weiqin Jiao,
Hao Cheng,
Claudio Persello,
George Vosselman
Abstract:
Polygonal building outlines are crucial for geographic and cartographic applications. The existing approaches for outline extraction from aerial or satellite imagery are typically decomposed into subtasks, e.g., building masking and vectorization, or treat this task as a sequence-to-sequence prediction of ordered vertices. The former lacks efficiency, and the latter often generates redundant verti…
▽ More
Polygonal building outlines are crucial for geographic and cartographic applications. The existing approaches for outline extraction from aerial or satellite imagery are typically decomposed into subtasks, e.g., building masking and vectorization, or treat this task as a sequence-to-sequence prediction of ordered vertices. The former lacks efficiency, and the latter often generates redundant vertices, both resulting in suboptimal performance. To handle these issues, we propose a novel Region-of-Interest (RoI) query-based approach called RoIPoly. Specifically, we formulate each vertex as a query and constrain the query attention on the most relevant regions of a potential building, yielding reduced computational overhead and more efficient vertex level interaction. Moreover, we introduce a novel learnable logit embedding to facilitate vertex classification on the attention map; thus, no post-processing is needed for redundant vertex removal. We evaluated our method on the vectorized building outline extraction dataset CrowdAI and the 2D floorplan reconstruction dataset Structured3D. On the CrowdAI dataset, RoIPoly with a ResNet50 backbone outperforms existing methods with the same or better backbones on most MS-COCO metrics, especially on small buildings, and achieves competitive results in polygon quality and vertex redundancy without any post-processing. On the Structured3D dataset, our method achieves the second-best performance on most metrics among existing methods dedicated to 2D floorplan reconstruction, demonstrating our cross-domain generalization capability. The code will be released upon acceptance of this paper.
△ Less
Submitted 20 July, 2024;
originally announced July 2024.
-
PolyR-CNN: R-CNN for end-to-end polygonal building outline extraction
Authors:
Weiqin Jiao,
Claudio Persello,
George Vosselman
Abstract:
Polygonal building outline extraction has been a research focus in recent years. Most existing methods have addressed this challenging task by decomposing it into several subtasks and employing carefully designed architectures. Despite their accuracy, such pipelines often introduce inefficiencies during training and inference. This paper presents an end-to-end framework, denoted as PolyR-CNN, whic…
▽ More
Polygonal building outline extraction has been a research focus in recent years. Most existing methods have addressed this challenging task by decomposing it into several subtasks and employing carefully designed architectures. Despite their accuracy, such pipelines often introduce inefficiencies during training and inference. This paper presents an end-to-end framework, denoted as PolyR-CNN, which offers an efficient and fully integrated approach to predict vectorized building polygons and bounding boxes directly from remotely sensed images. Notably, PolyR-CNN leverages solely the features of the Region of Interest (RoI) for the prediction, thereby mitigating the necessity for complex designs. Furthermore, we propose a novel scheme with PolyR-CNN to extract detailed outline information from polygon vertex coordinates, termed vertex proposal feature, to guide the RoI features to predict more regular buildings. PolyR-CNN demonstrates the capacity to deal with buildings with holes through a simple post-processing method on the Inria dataset. Comprehensive experiments conducted on the CrowdAI dataset show that PolyR-CNN achieves competitive accuracy compared to state-of-the-art methods while significantly improving computational efficiency, i.e., achieving 79.2 Average Precision (AP), exhibiting a 15.9 AP gain and operating 2.5 times faster and four times lighter than the well-established end-to-end method PolyWorld. Replacing the backbone with a simple ResNet-50, PolyR-CNN maintains a 71.1 AP while running four times faster than PolyWorld.
△ Less
Submitted 20 July, 2024;
originally announced July 2024.
-
Learning from Exemplars for Interactive Image Segmentation
Authors:
Kun Li,
Hao Cheng,
George Vosselman,
Michael Ying Yang
Abstract:
Interactive image segmentation enables users to interact minimally with a machine, facilitating the gradual refinement of the segmentation mask for a target of interest. Previous studies have demonstrated impressive performance in extracting a single target mask through interactive segmentation. However, the information cues of previously interacted objects have been overlooked in the existing met…
▽ More
Interactive image segmentation enables users to interact minimally with a machine, facilitating the gradual refinement of the segmentation mask for a target of interest. Previous studies have demonstrated impressive performance in extracting a single target mask through interactive segmentation. However, the information cues of previously interacted objects have been overlooked in the existing methods, which can be further explored to speed up interactive segmentation for multiple targets in the same category. To this end, we introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category. Specifically, our model leverages transformer backbones to extract interaction-focused visual features from the image and the interactions to obtain a satisfactory mask of a target as an exemplar. For multiple objects, we propose an exemplar-informed module to enhance the learning of similarities among the objects of the target category. To combine attended features from different modules, we incorporate cross-attention blocks followed by a feature fusion module. Experiments conducted on mainstream benchmarks demonstrate that our models achieve superior performance compared to previous methods. Particularly, our model reduces users' labor by around 15\%, requiring two fewer clicks to achieve target IoUs 85\% and 90\%. The results highlight our models' potential as a flexible and practical annotation tool. The source code will be released after publication.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization
Authors:
Yao Wei,
Martin Renqiang Min,
George Vosselman,
Li Erran Li,
Michael Ying Yang
Abstract:
Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Conventional works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in object shape generation with gen…
▽ More
Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Conventional works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in object shape generation with generative models such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which suggests that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D indoor scenes from scene graph. To enrich the priors of the given scene graph inputs, large language model is utilized to aggregate the global-wise features with local node-wise and edge-wise features. With a unified graph encoder, graph features are extracted to guide joint layout-shape generation. Additional regularization is introduced to explicitly constrain the produced 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity. The source code will be released after publication.
△ Less
Submitted 26 August, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Multimodal Rationales for Explainable Visual Question Answering
Authors:
Kun Li,
George Vosselman,
Michael Ying Yang
Abstract:
Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. Prior works directly evaluate the answering models by simply calculating the accuracy of predicted answers. However, the inner reasoning behind the predictions is disregarded in such a "black box" system, and we cannot ascertain the trustworthiness of the predictions. Even mo…
▽ More
Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. Prior works directly evaluate the answering models by simply calculating the accuracy of predicted answers. However, the inner reasoning behind the predictions is disregarded in such a "black box" system, and we cannot ascertain the trustworthiness of the predictions. Even more concerning, in some cases, these models predict correct answers despite focusing on irrelevant visual regions or textual tokens. To develop an explainable and trustworthy answering system, we propose a novel model termed MRVQA (Multimodal Rationales for VQA), which provides visual and textual rationales to support its predicted answers. To measure the quality of generated rationales, a new metric vtS (visual-textual Similarity) score is introduced from both visual and textual perspectives. Considering the extra annotations distinct from standard VQA, MRVQA is trained and evaluated using samples synthesized from some existing datasets. Extensive experiments across three EVQA datasets demonstrate that MRVQA achieves new state-of-the-art results through additional rationale generation, enhancing the trustworthiness of the explainable VQA model. The code and the synthesized dataset are released under https://github.com/lik1996/MRVQA2025.
△ Less
Submitted 10 June, 2025; v1 submitted 6 February, 2024;
originally announced February 2024.
-
BuilDiff: 3D Building Shape Generation using Single-Image Conditional Point Cloud Diffusion Models
Authors:
Yao Wei,
George Vosselman,
Michael Ying Yang
Abstract:
3D building generation with low data acquisition costs, such as single image-to-3D, becomes increasingly important. However, most of the existing single image-to-3D building creation works are restricted to those images with specific viewing angles, hence they are difficult to scale to general-view images that commonly appear in practical cases. To fill this gap, we propose a novel 3D building sha…
▽ More
3D building generation with low data acquisition costs, such as single image-to-3D, becomes increasingly important. However, most of the existing single image-to-3D building creation works are restricted to those images with specific viewing angles, hence they are difficult to scale to general-view images that commonly appear in practical cases. To fill this gap, we propose a novel 3D building shape generation method exploiting point cloud diffusion models with image conditioning schemes, which demonstrates flexibility to the input images. By cooperating two conditional diffusion models and introducing a regularization strategy during denoising process, our method is able to synthesize building roofs while maintaining the overall structures. We validate our framework on two newly built datasets and extensive experiments show that our method outperforms previous works in terms of building generation quality.
△ Less
Submitted 31 August, 2023;
originally announced September 2023.
-
Mono-hydra: Real-time 3D scene graph construction from monocular camera input with IMU
Authors:
U. V. B. L. Udugama,
G. Vosselman,
F. Nex
Abstract:
The ability of robots to autonomously navigate through 3D environments depends on their comprehension of spatial concepts, ranging from low-level geometry to high-level semantics, such as objects, places, and buildings. To enable such comprehension, 3D scene graphs have emerged as a robust tool for representing the environment as a layered graph of concepts and their relationships. However, buildi…
▽ More
The ability of robots to autonomously navigate through 3D environments depends on their comprehension of spatial concepts, ranging from low-level geometry to high-level semantics, such as objects, places, and buildings. To enable such comprehension, 3D scene graphs have emerged as a robust tool for representing the environment as a layered graph of concepts and their relationships. However, building these representations using monocular vision systems in real-time remains a difficult task that has not been explored in depth. This paper puts forth a real-time spatial perception system Mono-Hydra, combining a monocular camera and an IMU sensor setup, focusing on indoor scenarios. However, the proposed approach is adaptable to outdoor applications, offering flexibility in its potential uses. The system employs a suite of deep learning algorithms to derive depth and semantics. It uses a robocentric visual-inertial odometry (VIO) algorithm based on square-root information, thereby ensuring consistent visual odometry with an IMU and a monocular camera. This system achieves sub-20 cm error in real-time processing at 15 fps, enabling real-time 3D scene graph construction using a laptop GPU (NVIDIA 3080). This enhances decision-making efficiency and effectiveness in simple camera setups, augmenting robotic system agility. We make Mono-Hydra publicly available at: https://github.com/UAV-Centre-ITC/Mono_Hydra
△ Less
Submitted 10 August, 2023;
originally announced August 2023.
-
Interactive Image Segmentation with Cross-Modality Vision Transformers
Authors:
Kun Li,
George Vosselman,
Michael Ying Yang
Abstract:
Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes. Recently, vision transformers have achieved a great success in several downstream visual tasks, and a few efforts have been made to bring this powerful architecture to interactive segmentation task. Howev…
▽ More
Interactive image segmentation aims to segment the target from the background with the manual guidance, which takes as input multimodal data such as images, clicks, scribbles, and bounding boxes. Recently, vision transformers have achieved a great success in several downstream visual tasks, and a few efforts have been made to bring this powerful architecture to interactive segmentation task. However, the previous works neglect the relations between two modalities and directly mock the way of processing purely visual information with self-attentions. In this paper, we propose a simple yet effective network for click-based interactive segmentation with cross-modality vision transformers. Cross-modality transformers exploits mutual information to better guide the learning process. The experiments on several benchmarks show that the proposed method achieves superior performance in comparison to the previous state-of-the-art models. The stability of our method in term of avoiding failure cases shows its potential to be a practical annotation tool. The code and pretrained models will be released under https://github.com/lik1996/iCMFormer.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Channel-Aware Distillation Transformer for Depth Estimation on Nano Drones
Authors:
Ning Zhang,
Francesco Nex,
George Vosselman,
Norman Kerle
Abstract:
Autonomous navigation of drones using computer vision has achieved promising performance. Nano-sized drones based on edge computing platforms are lightweight, flexible, and cheap, thus suitable for exploring narrow spaces. However, due to their extremely limited computing power and storage, vision algorithms designed for high-performance GPU platforms cannot be used for nano drones. To address thi…
▽ More
Autonomous navigation of drones using computer vision has achieved promising performance. Nano-sized drones based on edge computing platforms are lightweight, flexible, and cheap, thus suitable for exploring narrow spaces. However, due to their extremely limited computing power and storage, vision algorithms designed for high-performance GPU platforms cannot be used for nano drones. To address this issue this paper presents a lightweight CNN depth estimation network deployed on nano drones for obstacle avoidance. Inspired by Knowledge Distillation (KD), a Channel-Aware Distillation Transformer (CADiT) is proposed to facilitate the small network to learn knowledge from a larger network. The proposed method is validated on the KITTI dataset and tested on a nano drone Crazyflie, with an ultra-low power microprocessor GAP8.
△ Less
Submitted 18 March, 2023;
originally announced March 2023.
-
HRVQA: A Visual Question Answering Benchmark for High-Resolution Aerial Images
Authors:
Kun Li,
George Vosselman,
Michael Ying Yang
Abstract:
Visual question answering (VQA) is an important and challenging multimodal task in computer vision. Recently, a few efforts have been made to bring VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, not only the huge variation in the appearance, scale and orientation of the concepts in aeria…
▽ More
Visual question answering (VQA) is an important and challenging multimodal task in computer vision. Recently, a few efforts have been made to bring VQA task to aerial images, due to its potential real-world applications in disaster monitoring, urban planning, and digital earth product generation. However, not only the huge variation in the appearance, scale and orientation of the concepts in aerial images, but also the scarcity of the well-annotated datasets restricts the development of VQA in this domain. In this paper, we introduce a new dataset, HRVQA, which provides collected 53512 aerial images of 1024*1024 pixels and semi-automatically generated 1070240 QA pairs. To benchmark the understanding capability of VQA models for aerial images, we evaluate the relevant methods on HRVQA. Moreover, we propose a novel model, GFTransformer, with gated attention modules and a mutual fusion module. The experiments show that the proposed dataset is quite challenging, especially the specific attribute related questions. Our method achieves superior performance in comparison to the previous state-of-the-art approaches. The dataset and the source code will be released at https://hrvqa.nl/.
△ Less
Submitted 23 January, 2023;
originally announced January 2023.
-
Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation
Authors:
Ning Zhang,
Francesco Nex,
George Vosselman,
Norman Kerle
Abstract:
Self-supervised monocular depth estimation that does not require ground truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. This paper achieves comparable results with a lightweight ar…
▽ More
Self-supervised monocular depth estimation that does not require ground truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. This paper achieves comparable results with a lightweight architecture. Specifically, the efficient combination of CNNs and Transformers is investigated, and a hybrid architecture called Lite-Mono is presented. A Consecutive Dilated Convolutions (CDC) module and a Local-Global Features Interaction (LGFI) module are proposed. The former is used to extract rich multi-scale local features, and the latter takes advantage of the self-attention mechanism to encode long-range global information into the features. Experiments demonstrate that Lite-Mono outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters.
△ Less
Submitted 15 March, 2023; v1 submitted 23 November, 2022;
originally announced November 2022.
-
Flow-based GAN for 3D Point Cloud Generation from a Single Image
Authors:
Yao Wei,
George Vosselman,
Michael Ying Yang
Abstract:
Generating a 3D point cloud from a single 2D image is of great importance for 3D scene understanding applications. To reconstruct the whole 3D shape of the object shown in the image, the existing deep learning based approaches use either explicit or implicit generative modeling of point clouds, which, however, suffer from limited quality. In this work, we aim to alleviate this issue by introducing…
▽ More
Generating a 3D point cloud from a single 2D image is of great importance for 3D scene understanding applications. To reconstruct the whole 3D shape of the object shown in the image, the existing deep learning based approaches use either explicit or implicit generative modeling of point clouds, which, however, suffer from limited quality. In this work, we aim to alleviate this issue by introducing a hybrid explicit-implicit generative modeling scheme, which inherits the flow-based explicit generative models for sampling point clouds with arbitrary resolutions while improving the detailed 3D structures of point clouds by leveraging the implicit generative adversarial networks (GANs). We evaluate on the large-scale synthetic dataset ShapeNet, with the experimental results demonstrating the superior performance of the proposed method. In addition, the generalization ability of our method is demonstrated by performing on cross-category synthetic images as well as by testing on real images from PASCAL3D+ dataset.
△ Less
Submitted 8 October, 2022;
originally announced October 2022.
-
Bidirectional Multi-scale Attention Networks for Semantic Segmentation of Oblique UAV Imagery
Authors:
Ye Lyu,
George Vosselman,
Gui-Song Xia,
Michael Ying Yang
Abstract:
Semantic segmentation for aerial platforms has been one of the fundamental scene understanding task for the earth observation. Most of the semantic segmentation research focused on scenes captured in nadir view, in which objects have relatively smaller scale variation compared with scenes captured in oblique view. The huge scale variation of objects in oblique images limits the performance of deep…
▽ More
Semantic segmentation for aerial platforms has been one of the fundamental scene understanding task for the earth observation. Most of the semantic segmentation research focused on scenes captured in nadir view, in which objects have relatively smaller scale variation compared with scenes captured in oblique view. The huge scale variation of objects in oblique images limits the performance of deep neural networks (DNN) that process images in a single scale fashion. In order to tackle the scale variation issue, in this paper, we propose the novel bidirectional multi-scale attention networks, which fuse features from multiple scales bidirectionally for more adaptive and effective feature extraction. The experiments are conducted on the UAVid2020 dataset and have shown the effectiveness of our method. Our model achieved the state-of-the-art (SOTA) result with a mean intersection over union (mIoU) score of 70.80%.
△ Less
Submitted 5 February, 2021;
originally announced February 2021.
-
LGENet: Local and Global Encoder Network for Semantic Segmentation of Airborne Laser Scanning Point Clouds
Authors:
Yaping Lin,
George Vosselman,
Yanpeng Cao,
Michael Ying Yang
Abstract:
Interpretation of Airborne Laser Scanning (ALS) point clouds is a critical procedure for producing various geo-information products like 3D city models, digital terrain models and land use maps. In this paper, we present a local and global encoder network (LGENet) for semantic segmentation of ALS point clouds. Adapting the KPConv network, we first extract features by both 2D and 3D point convoluti…
▽ More
Interpretation of Airborne Laser Scanning (ALS) point clouds is a critical procedure for producing various geo-information products like 3D city models, digital terrain models and land use maps. In this paper, we present a local and global encoder network (LGENet) for semantic segmentation of ALS point clouds. Adapting the KPConv network, we first extract features by both 2D and 3D point convolutions to allow the network to learn more representative local geometry. Then global encoders are used in the network to exploit contextual information at the object and point level. We design a segment-based Edge Conditioned Convolution to encode the global context between segments. We apply a spatial-channel attention module at the end of the network, which not only captures the global interdependencies between points but also models interactions between channels. We evaluate our method on two ALS datasets namely, the ISPRS benchmark dataset and DCF2019 dataset. For the ISPRS benchmark dataset, our model achieves state-of-the-art results with an overall accuracy of 0.845 and an average F1 score of 0.737. With regards to the DFC2019 dataset, our proposed network achieves an overall accuracy of 0.984 and an average F1 score of 0.834.
△ Less
Submitted 18 December, 2020;
originally announced December 2020.
-
Plug & Play Convolutional Regression Tracker for Video Object Detection
Authors:
Ye Lyu,
Michael Ying Yang,
George Vosselman,
Gui-Song Xia
Abstract:
Video object detection targets to simultaneously localize the bounding boxes of the objects and identify their classes in a given video. One challenge for video object detection is to consistently detect all objects across the whole video. As the appearance of objects may deteriorate in some frames, features or detections from the other frames are commonly used to enhance the prediction. In this p…
▽ More
Video object detection targets to simultaneously localize the bounding boxes of the objects and identify their classes in a given video. One challenge for video object detection is to consistently detect all objects across the whole video. As the appearance of objects may deteriorate in some frames, features or detections from the other frames are commonly used to enhance the prediction. In this paper, we propose a Plug & Play scale-adaptive convolutional regression tracker for the video object detection task, which could be easily and compatibly implanted into the current state-of-the-art detection networks. As the tracker reuses the features from the detector, it is a very light-weighted increment to the detection network. The whole network performs at the speed close to a standard object detector. With our new video object detection pipeline design, image object detectors can be easily turned into efficient video object detectors without modifying any parameters. The performance is evaluated on the large-scale ImageNet VID dataset. Our Plug & Play design improves mAP score for the image detector by around 5% with only little speed drop.
△ Less
Submitted 2 March, 2020;
originally announced March 2020.
-
LIP: Learning Instance Propagation for Video Object Segmentation
Authors:
Ye Lyu,
George Vosselman,
Gui-Song Xia,
Michael Ying Yang
Abstract:
In recent years, the task of segmenting foreground objects from background in a video, i.e. video object segmentation (VOS), has received considerable attention. In this paper, we propose a single end-to-end trainable deep neural network, convolutional gated recurrent Mask-RCNN, for tackling the semi-supervised VOS task. We take advantage of both the instance segmentation network (Mask-RCNN) and t…
▽ More
In recent years, the task of segmenting foreground objects from background in a video, i.e. video object segmentation (VOS), has received considerable attention. In this paper, we propose a single end-to-end trainable deep neural network, convolutional gated recurrent Mask-RCNN, for tackling the semi-supervised VOS task. We take advantage of both the instance segmentation network (Mask-RCNN) and the visual memory module (Conv-GRU) to tackle the VOS task. The instance segmentation network predicts masks for instances, while the visual memory module learns to selectively propagate information for multiple instances simultaneously, which handles the appearance change, the variation of scale and pose and the occlusions between objects. After offline and online training under purely instance segmentation losses, our approach is able to achieve satisfactory results without any post-processing or synthetic video data augmentation. Experimental results on DAVIS 2016 dataset and DAVIS 2017 dataset have demonstrated the effectiveness of our method for video object segmentation task.
△ Less
Submitted 30 September, 2019;
originally announced October 2019.
-
Robust object extraction from remote sensing data
Authors:
Sophie Crommelinck,
Mila Koeva,
Michael Ying Yang,
George Vosselman
Abstract:
The extraction of object outlines has been a research topic during the last decades. In spite of advances in photogrammetry, remote sensing and computer vision, this task remains challenging due to object and data complexity. The development of object extraction approaches is promoted through publically available benchmark datasets and evaluation frameworks. Many aspects of performance evaluation…
▽ More
The extraction of object outlines has been a research topic during the last decades. In spite of advances in photogrammetry, remote sensing and computer vision, this task remains challenging due to object and data complexity. The development of object extraction approaches is promoted through publically available benchmark datasets and evaluation frameworks. Many aspects of performance evaluation have already been studied. This study collects the best practices from literature, puts the various aspects in one evaluation framework, and demonstrates its usefulness to a case study on mapping object outlines. The evaluation framework includes five dimensions: the robustness to changes in resolution, input, location, parameters, and application. Examples for investigating these dimensions are provided, as well as accuracy measures for their qualitative analysis. The measures consist of time efficiency and a procedure for line-based accuracy assessment regarding quantitative completeness and spatial correctness. The delineation approach to which the evaluation framework is applied, was previously introduced and is substantially improved in this study.
△ Less
Submitted 3 April, 2019;
originally announced April 2019.
-
Unsupervised Domain Adaptation for Multispectral Pedestrian Detection
Authors:
Dayan Guan,
Xing Luo,
Yanpeng Cao,
Jiangxin Yang,
Yanlong Cao,
George Vosselman,
Michael Ying Yang
Abstract:
Multimodal information (e.g., visible and thermal) can generate robust pedestrian detections to facilitate around-the-clock computer vision applications, such as autonomous driving and video surveillance. However, it still remains a crucial challenge to train a reliable detector working well in different multispectral pedestrian datasets without manual annotations. In this paper, we propose a nove…
▽ More
Multimodal information (e.g., visible and thermal) can generate robust pedestrian detections to facilitate around-the-clock computer vision applications, such as autonomous driving and video surveillance. However, it still remains a crucial challenge to train a reliable detector working well in different multispectral pedestrian datasets without manual annotations. In this paper, we propose a novel unsupervised domain adaptation framework for multispectral pedestrian detection, by iteratively generating pseudo annotations and updating the parameters of our designed multispectral pedestrian detector on target domain. Pseudo annotations are generated using the detector trained on source domain, and then updated by fixing the parameters of detector and minimizing the cross entropy loss without back-propagation. Training labels are generated using the pseudo annotations by considering the characteristics of similarity and complementarity between well-aligned visible and infrared image pairs. The parameters of detector are updated using the generated labels by minimizing our defined multi-detection loss function with back-propagation. The optimal parameters of detector can be obtained after iteratively updating the pseudo annotations and parameters. Experimental results show that our proposed unsupervised multimodal domain adaptation method achieves significantly higher detection performance than the approach without domain adaptation, and is competitive with the supervised multispectral pedestrian detectors.
△ Less
Submitted 7 April, 2019;
originally announced April 2019.
-
UAVid: A Semantic Segmentation Dataset for UAV Imagery
Authors:
Ye Lyu,
George Vosselman,
Guisong Xia,
Alper Yilmaz,
Michael Ying Yang
Abstract:
Semantic segmentation has been one of the leading research interests in computer vision recently. It serves as a perception foundation for many fields, such as robotics and autonomous driving. The fast development of semantic segmentation attributes enormously to the large scale datasets, especially for the deep learning related methods. There already exist several semantic segmentation datasets f…
▽ More
Semantic segmentation has been one of the leading research interests in computer vision recently. It serves as a perception foundation for many fields, such as robotics and autonomous driving. The fast development of semantic segmentation attributes enormously to the large scale datasets, especially for the deep learning related methods. There already exist several semantic segmentation datasets for comparison among semantic segmentation methods in complex urban scenes, such as the Cityscapes and CamVid datasets, where the side views of the objects are captured with a camera mounted on the driving car. There also exist semantic labeling datasets for the airborne images and the satellite images, where the top views of the objects are captured. However, only a few datasets capture urban scenes from an oblique Unmanned Aerial Vehicle (UAV) perspective, where both of the top view and the side view of the objects can be observed, providing more information for object recognition. In this paper, we introduce our UAVid dataset, a new high-resolution UAV semantic segmentation dataset as a complement, which brings new challenges, including large scale variation, moving object recognition and temporal consistency preservation. Our UAV dataset consists of 30 video sequences capturing 4K high-resolution images in slanted views. In total, 300 images have been densely labeled with 8 classes for the semantic labeling task. We have provided several deep learning baseline methods with pre-training, among which the proposed Multi-Scale-Dilation net performs the best via multi-scale feature extraction. Our UAVid website and the labeling tool have been published https://uavid.nl/.
△ Less
Submitted 18 May, 2020; v1 submitted 24 October, 2018;
originally announced October 2018.
-
Change Detection between Multimodal Remote Sensing Data Using Siamese CNN
Authors:
Zhenchao Zhang,
George Vosselman,
Markus Gerke,
Devis Tuia,
Michael Ying Yang
Abstract:
Detecting topographic changes in the urban environment has always been an important task for urban planning and monitoring. In practice, remote sensing data are often available in different modalities and at different time epochs. Change detection between multimodal data can be very challenging since the data show different characteristics. Given 3D laser scanning point clouds and 2D imagery from…
▽ More
Detecting topographic changes in the urban environment has always been an important task for urban planning and monitoring. In practice, remote sensing data are often available in different modalities and at different time epochs. Change detection between multimodal data can be very challenging since the data show different characteristics. Given 3D laser scanning point clouds and 2D imagery from different epochs, this paper presents a framework to detect building and tree changes. First, the 2D and 3D data are transformed to image patches, respectively. A Siamese CNN is then employed to detect candidate changes between the two epochs. Finally, the candidate patch-based changes are grouped and verified as individual object changes. Experiments on the urban data show that 86.4\% of patch pairs can be correctly classified by the model.
△ Less
Submitted 25 July, 2018;
originally announced July 2018.
-
Patch-based Evaluation of Dense Image Matching Quality
Authors:
Zhenchao Zhang,
Markus Gerke,
George Vosselman,
Michael Ying Yang
Abstract:
Airborne laser scanning and photogrammetry are two main techniques to obtain 3D data representing the object surface. Due to the high cost of laser scanning, we want to explore the potential of using point clouds derived by dense image matching (DIM), as effective alternatives to laser scanning data. We present a framework to evaluate point clouds from dense image matching and derived Digital Surf…
▽ More
Airborne laser scanning and photogrammetry are two main techniques to obtain 3D data representing the object surface. Due to the high cost of laser scanning, we want to explore the potential of using point clouds derived by dense image matching (DIM), as effective alternatives to laser scanning data. We present a framework to evaluate point clouds from dense image matching and derived Digital Surface Models (DSM) based on automatically extracted sample patches. Dense matching error and noise level are evaluated quantitatively at both the local level and whole block level. Experiments show that the optimal vertical accuracy achieved by dense matching is as follows: the mean offset to the reference data is 0.1 Ground Sampling Distance (GSD); the maximum offset goes up to 1.0 GSD. When additional oblique images are used in dense matching, the mean deviation, the variation of mean deviation and the level of random noise all get improved. We also detect a bias between the point cloud and DSM from a single photogrammetric workflow. This framework also allows to reveal inhomogeneity in the distribution of the dense matching errors due to over-fitted BBA network. Meanwhile, suggestions are given on the photogrammetric quality control.
△ Less
Submitted 25 July, 2018;
originally announced July 2018.
-
Towards Automated Cadastral Boundary Delineation from UAV Data
Authors:
Sophie Crommelinck,
Michael Ying Yang,
Mila Koeva,
Markus Gerke,
Rohan Bennett,
George Vosselman
Abstract:
Unmanned aerial vehicles (UAV) are evolving as an alternative tool to acquire land tenure data. UAVs can capture geospatial data at high quality and resolution in a cost-effective, transparent and flexible manner, from which visible land parcel boundaries, i.e., cadastral boundaries are delineable. This delineation is to no extent automated, even though physical objects automatically retrievable t…
▽ More
Unmanned aerial vehicles (UAV) are evolving as an alternative tool to acquire land tenure data. UAVs can capture geospatial data at high quality and resolution in a cost-effective, transparent and flexible manner, from which visible land parcel boundaries, i.e., cadastral boundaries are delineable. This delineation is to no extent automated, even though physical objects automatically retrievable through image analysis methods mark a large portion of cadastral boundaries. This study proposes (i) a workflow that automatically extracts candidate cadastral boundaries from UAV orthoimages and (ii) a tool for their semi-automatic processing to delineate final cadastral boundaries. The workflow consists of two state-of-the-art computer vision methods, namely gPb contour detection and SLIC superpixels that are transferred to remote sensing in this study. The tool combines the two methods, allows a semi-automatic final delineation and is implemented as a publicly available QGIS plugin. The approach does not yet aim to provide a comparable alternative to manual cadastral mapping procedures. However, the methodological development of the tool towards this goal is developed in this paper. A study with 13 volunteers investigates the design and implementation of the approach and gathers initial qualitative as well as quantitate results. The study revealed points for improvement, which are prioritized based on the study results and which will be addressed in future work.
△ Less
Submitted 6 September, 2017;
originally announced September 2017.