Search | arXiv e-print repository

Object-level Cross-view Geo-localization with Location Enhancement and Multi-Head Cross Attention

Authors: Zheyang Huang, Jagannath Aryal, Saeid Nahavandi, Xuequan Lu, Chee Peng Lim, Lei Wei, Hailing Zhou

Abstract: Cross-view geo-localization determines the location of a query image, captured by a drone or ground-based camera, by matching it to a geo-referenced satellite image. While traditional approaches focus on image-level localization, many applications, such as search-and-rescue, infrastructure inspection, and precision delivery, demand object-level accuracy. This enables users to prompt a specific obj… ▽ More Cross-view geo-localization determines the location of a query image, captured by a drone or ground-based camera, by matching it to a geo-referenced satellite image. While traditional approaches focus on image-level localization, many applications, such as search-and-rescue, infrastructure inspection, and precision delivery, demand object-level accuracy. This enables users to prompt a specific object with a single click on a drone image to retrieve precise geo-tagged information of the object. However, variations in viewpoints, timing, and imaging conditions pose significant challenges, especially when identifying visually similar objects in extensive satellite imagery. To address these challenges, we propose an Object-level Cross-view Geo-localization Network (OCGNet). It integrates user-specified click locations using Gaussian Kernel Transfer (GKT) to preserve location information throughout the network. This cue is dually embedded into the feature encoder and feature matching blocks, ensuring robust object-specific localization. Additionally, OCGNet incorporates a Location Enhancement (LE) module and a Multi-Head Cross Attention (MHCA) module to adaptively emphasize object-specific features or expand focus to relevant contextual regions when necessary. OCGNet achieves state-of-the-art performance on a public dataset, CVOGL. It also demonstrates few-shot learning capabilities, effectively generalizing from limited examples, making it suitable for diverse applications (https://github.com/ZheyangH/OCGNet). △ Less

Submitted 23 May, 2025; originally announced May 2025.

arXiv:2404.14955 [pdf, other]

doi 10.1016/j.neucom.2025.130428

A Comprehensive Survey for Hyperspectral Image Classification: The Evolution from Conventional to Transformers and Mamba Models

Authors: Muhammad Ahmad, Salvatore Distifano, Adil Mehmood Khan, Manuel Mazzara, Chenyu Li, Hao Li, Jagannath Aryal, Yao Ding, Gemine Vivone, Danfeng Hong

Abstract: Hyperspectral Image Classification (HSC) presents significant challenges owing to the high dimensionality and intricate nature of Hyperspectral (HS) data. While traditional Machine Learning (TML) approaches have demonstrated effectiveness, they often encounter substantial obstacles in real-world applications, including the variability of optimal feature sets, subjectivity in human-driven design, i… ▽ More Hyperspectral Image Classification (HSC) presents significant challenges owing to the high dimensionality and intricate nature of Hyperspectral (HS) data. While traditional Machine Learning (TML) approaches have demonstrated effectiveness, they often encounter substantial obstacles in real-world applications, including the variability of optimal feature sets, subjectivity in human-driven design, inherent biases, and methodological limitations. Specifically, TML suffers from the curse of dimensionality, difficulties in feature selection and extraction, insufficient consideration of spatial information, limited robustness against noise, scalability issues, and inadequate adaptability to complex data distributions. In recent years, Deep Learning (DL) techniques have emerged as robust solutions to address these challenges. This survey offers a comprehensive overview of current trends and future prospects in HSC, emphasizing advancements from DL models to the increasing adoption of Transformer and Mamba Model architectures. We systematically review key concepts, methodologies, and state-of-the-art approaches in DL for HSC. Furthermore, we investigate the potential of Transformer-based models and the Mamba Model in HSC, detailing their advantages and challenges. Emerging trends in HSC are explored, including in-depth discussions on Explainable AI and Interoperability concepts, alongside Diffusion Models for image denoising, feature extraction, and image fusion. Comprehensive experimental results were conducted on three HS datasets to substantiate the efficacy of various conventional DL models and Transformers. Additionally, we identify several open challenges and pertinent research questions in the field of HSC. Finally, we outline future research directions and potential applications aimed at enhancing the accuracy and efficiency of HSC. △ Less

Submitted 14 November, 2024; v1 submitted 23 April, 2024; originally announced April 2024.

Report number: https://doi.org/10.1016/j.neucom.2025.130428

arXiv:2311.03867 [pdf, other]

Supervised domain adaptation for building extraction from off-nadir aerial images

Authors: Bipul Neupane, Jagannath Aryal, Abbas Rajabifard

Abstract: Building extraction $-$ needed for inventory management and planning of urban environment $-$ is affected by the misalignment between labels and off-nadir source imagery in training data. Teacher-Student learning of noise-tolerant convolutional neural networks (CNNs) is the existing solution, but the Student networks typically have lower accuracy and cannot surpass the Teacher's performance. This… ▽ More Building extraction $-$ needed for inventory management and planning of urban environment $-$ is affected by the misalignment between labels and off-nadir source imagery in training data. Teacher-Student learning of noise-tolerant convolutional neural networks (CNNs) is the existing solution, but the Student networks typically have lower accuracy and cannot surpass the Teacher's performance. This paper proposes a supervised domain adaptation (SDA) of encoder-decoder networks (EDNs) between noisy and clean datasets to tackle the problem. EDNs are configured with high-performing lightweight encoders such as EfficientNet, ResNeSt, and MobileViT. The proposed method is compared against the existing Teacher-Student learning methods like knowledge distillation (KD) and deep mutual learning (DML) with three newly developed datasets. The methods are evaluated for different urban buildings (low-rise, mid-rise, high-rise, and skyscrapers), where misalignment increases with the increase in building height and spatial resolution. For a robust experimental design, 43 lightweight CNNs, five optimisers, nine loss functions, and seven EDNs are benchmarked to obtain the best-performing EDN for SDA. The SDA of the best-performing EDN from our study significantly outperformed KD and DML with up to 0.943, 0.868, 0.912, and 0.697 F1 scores in the low-rise, mid-rise, high-rise, and skyscrapers respectively. The proposed method and the experimental findings will be beneficial in training robust CNNs for building extraction. △ Less

Submitted 6 August, 2024; v1 submitted 7 November, 2023; originally announced November 2023.

Comments: This work has been submitted to Elsevier for possible publication

arXiv:2308.14076 [pdf, other]

doi 10.1109/LGRS.2023.3312643

A Novel Multi-scale Attention Feature Extraction Block for Aerial Remote Sensing Image Classification

Authors: Chiranjibi Sitaula, Jagannath Aryal, Avik Bhattacharya

Abstract: Classification of very high-resolution (VHR) aerial remote sensing (RS) images is a well-established research area in the remote sensing community as it provides valuable spatial information for decision-making. Existing works on VHR aerial RS image classification produce an excellent classification performance; nevertheless, they have a limited capability to well-represent VHR RS images having co… ▽ More Classification of very high-resolution (VHR) aerial remote sensing (RS) images is a well-established research area in the remote sensing community as it provides valuable spatial information for decision-making. Existing works on VHR aerial RS image classification produce an excellent classification performance; nevertheless, they have a limited capability to well-represent VHR RS images having complex and small objects, thereby leading to performance instability. As such, we propose a novel plug-and-play multi-scale attention feature extraction block (MSAFEB) based on multi-scale convolution at two levels with skip connection, producing discriminative/salient information at a deeper/finer level. The experimental study on two benchmark VHR aerial RS image datasets (AID and NWPU) demonstrates that our proposal achieves a stable/consistent performance (minimum standard deviation of $0.002$) and competent overall classification performance (AID: 95.85\% and NWPU: 94.09\%). △ Less

Submitted 27 August, 2023; originally announced August 2023.

Comments: The paper is under review in IEEE Geoscience and Remote Sensing Letters Journal (IEEE-GRSL). This version may be deleted and/or updated based on the journal's policy

Journal ref: IEEE Geoscience and Remote Sensing Letters, 2023

arXiv:2306.10274 [pdf, other]

doi 10.1109/TGRS.2024.3381976

Benchmarking Deep Learning Architectures for Urban Vegetation Point Cloud Semantic Segmentation from MLS

Authors: Aditya Aditya, Bharat Lohani, Jagannath Aryal, Stephan Winter

Abstract: Vegetation is crucial for sustainable and resilient cities providing various ecosystem services and well-being of humans. However, vegetation is under critical stress with rapid urbanization and expanding infrastructure footprints. Consequently, mapping of this vegetation is essential in the urban environment. Recently, deep learning for point cloud semantic segmentation has shown significant prog… ▽ More Vegetation is crucial for sustainable and resilient cities providing various ecosystem services and well-being of humans. However, vegetation is under critical stress with rapid urbanization and expanding infrastructure footprints. Consequently, mapping of this vegetation is essential in the urban environment. Recently, deep learning for point cloud semantic segmentation has shown significant progress. Advanced models attempt to obtain state-of-the-art performance on benchmark datasets, comprising multiple classes and representing real world scenarios. However, class specific segmentation with respect to vegetation points has not been explored. Therefore, selection of a deep learning model for vegetation points segmentation is ambiguous. To address this problem, we provide a comprehensive assessment of point-based deep learning models for semantic segmentation of vegetation class. We have selected seven representative point-based models, namely PointCNN, KPConv (omni-supervised), RandLANet, SCFNet, PointNeXt, SPoTr and PointMetaBase. These models are investigated on three different datasets, specifically Chandigarh, Toronto3D and Kerala, which are characterized by diverse nature of vegetation and varying scene complexity combined with changing per-point features and class-wise composition. PointMetaBase and KPConv (omni-supervised) achieve the highest mIoU on the Chandigarh (95.24%) and Toronto3D datasets (91.26%), respectively while PointCNN provides the highest mIoU on the Kerala dataset (85.68%). The paper develops a deeper insight, hitherto not reported, into the working of these models for vegetation segmentation and outlines the ingredients that should be included in a model specifically for vegetation segmentation. This paper is a step towards the development of a novel architecture for vegetation points segmentation. △ Less

Submitted 1 May, 2024; v1 submitted 17 June, 2023; originally announced June 2023.

Comments: The paper has been accepted for publication in IEEE Transactions on Geoscience and Remote Sensing. DOI: 10.1109/TGRS.2024.3381976

Journal ref: in IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1-14, 2024

arXiv:2305.04766 [pdf]

OSTA: One-shot Task-adaptive Channel Selection for Semantic Segmentation of Multichannel Images

Authors: Yuanzhi Cai, Jagannath Aryal, Yuan Fang, Hong Huang, Lei Fan

Abstract: Semantic segmentation of multichannel images is a fundamental task for many applications. Selecting an appropriate channel combination from the original multichannel image can improve the accuracy of semantic segmentation and reduce the cost of data storage, processing and future acquisition. Existing channel selection methods typically use a reasonable selection procedure to determine a desirable… ▽ More Semantic segmentation of multichannel images is a fundamental task for many applications. Selecting an appropriate channel combination from the original multichannel image can improve the accuracy of semantic segmentation and reduce the cost of data storage, processing and future acquisition. Existing channel selection methods typically use a reasonable selection procedure to determine a desirable channel combination, and then train a semantic segmentation network using that combination. In this study, the concept of pruning from a supernet is used for the first time to integrate the selection of channel combination and the training of a semantic segmentation network. Based on this concept, a One-Shot Task-Adaptive (OSTA) channel selection method is proposed for the semantic segmentation of multichannel images. OSTA has three stages, namely the supernet training stage, the pruning stage and the fine-tuning stage. The outcomes of six groups of experiments (L7Irish3C, L7Irish2C, L8Biome3C, L8Biome2C, RIT-18 and Semantic3D) demonstrated the effectiveness and efficiency of OSTA. OSTA achieved the highest segmentation accuracies in all tests (62.49% (mIoU), 75.40% (mIoU), 68.38% (mIoU), 87.63% (mIoU), 66.53% (mA) and 70.86% (mIoU), respectively). It even exceeded the highest accuracies of exhaustive tests (61.54% (mIoU), 74.91% (mIoU), 67.94% (mIoU), 87.32% (mIoU), 65.32% (mA) and 70.27% (mIoU), respectively), where all possible channel combinations were tested. All of this can be accomplished within a predictable and relatively efficient timeframe, ranging from 101.71% to 298.1% times the time required to train the segmentation network alone. In addition, there were interesting findings that were deemed valuable for several fields. △ Less

Submitted 8 May, 2023; originally announced May 2023.

arXiv:2305.00679 [pdf, other]

doi 10.1007/s00521-024-09446-y

Enhanced Multi-level Features for Very High Resolution Remote Sensing Scene Classification

Authors: Chiranjibi Sitaula, Sumesh KC, Jagannath Aryal

Abstract: Very high-resolution (VHR) remote sensing (RS) scene classification is a challenging task due to the higher inter-class similarity and intra-class variability problems. Recently, the existing deep learning (DL)-based methods have shown great promise in VHR RS scene classification. However, they still provide an unstable classification performance. To address such a problem, we, in this letter, pro… ▽ More Very high-resolution (VHR) remote sensing (RS) scene classification is a challenging task due to the higher inter-class similarity and intra-class variability problems. Recently, the existing deep learning (DL)-based methods have shown great promise in VHR RS scene classification. However, they still provide an unstable classification performance. To address such a problem, we, in this letter, propose a novel DL-based approach. For this, we devise an enhanced VHR attention module (EAM), followed by the atrous spatial pyramid pooling (ASPP) and global average pooling (GAP). This procedure imparts the enhanced features from the corresponding level. Then, the multi-level feature fusion is performed. Experimental results on two widely-used VHR RS datasets show that the proposed approach yields a competitive and stable/robust classification performance with the least standard deviation of 0.001. Further, the highest overall accuracies on the AID and the NWPU datasets are 95.39% and 93.04%, respectively. △ Less

Submitted 3 August, 2023; v1 submitted 1 May, 2023; originally announced May 2023.

Comments: This paper has been submitted to the journal for peer review. Based on the journal's policy and restrictions, this version may be updated or deleted

Journal ref: Neural Computing and Applications, 2024

arXiv:2303.09064 [pdf, other]

Dual skip connections in U-Net, ResUnet and U-Net3+ for remote extraction of buildings

Authors: Bipul Neupane, Jagannath Aryal, Abbas Rajabifard

Abstract: Urban buildings are extracted from high-resolution Earth observation (EO) images using semantic segmentation networks like U-Net and its successors. Each re-iteration aims to improve performance by employing a denser skip connection mechanism that harnesses multi-scale features for accurate object mapping. However, denser connections increase network parameters and do not necessarily contribute to… ▽ More Urban buildings are extracted from high-resolution Earth observation (EO) images using semantic segmentation networks like U-Net and its successors. Each re-iteration aims to improve performance by employing a denser skip connection mechanism that harnesses multi-scale features for accurate object mapping. However, denser connections increase network parameters and do not necessarily contribute to precise segmentation. In this paper, we develop three dual skip connection mechanisms for three networks (U-Net, ResUnet, and U-Net3+) to selectively deepen the essential feature maps for improved performance. The three mechanisms are evaluated on feature maps of different scales, producing nine new network configurations. They are evaluated against their original vanilla configurations on four building footprint datasets of different spatial resolutions, including a multi-resolution (0.3+0.6+1.2m) dataset that we develop for complex urban environments. The evaluation revealed that densifying the large- and small-scale features in U-Net and U-Net3+ produce up to 0.905 F1, more than TransUnet (0.903) and Swin-Unet (0.882) in our new dataset with up to 19x fewer parameters. The results conclude that selectively densifying feature maps and skip connections enhances network performance without a substantial increase in parameters. The findings and the new dataset will contribute to the computer vision domain and urban planning decision processes. △ Less

Submitted 22 October, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: This work has been submitted to Springer for possible publication

arXiv:2206.07326 [pdf, other]

doi 10.1007/s11042-023-15005-9

Recent Advances in Scene Image Representation and Classification

Authors: Chiranjibi Sitaula, Tej Bahadur Shahi, Faezeh Marzbanrad, Jagannath Aryal

Abstract: With the rise of deep learning algorithms nowadays, scene image representation methods have achieved a significant performance boost in classification. However, the performance is still limited because the scene images are mostly complex having higher intra-class dissimilarity and inter-class similarity problems. To deal with such problems, there have been several methods proposed in the literatur… ▽ More With the rise of deep learning algorithms nowadays, scene image representation methods have achieved a significant performance boost in classification. However, the performance is still limited because the scene images are mostly complex having higher intra-class dissimilarity and inter-class similarity problems. To deal with such problems, there have been several methods proposed in the literature with their advantages and limitations. A detailed study of previous works is necessary to understand their advantages and disadvantages in image representation and classification problems. In this paper, we review the existing scene image representation methods that are being widely used for image classification. For this, we, first, devise the taxonomy using the seminal existing methods proposed in the literature to this date {using deep learning (DL)-based, computer vision (CV)-based, and search engine (SE)-based methods}. Next, we compare their performance both qualitatively (e.g., quality of outputs, pros/cons, etc.) and quantitatively (e.g., accuracy). Last, we speculate on the prominent research directions in scene image representation tasks using {keyword growth and timeline analysis.} Overall, this survey provides in-depth insights and applications of recent scene image representation methods under three different methods. △ Less

Submitted 21 December, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

Comments: This paper is under review in Multimedia Tools and Applications (Springer) journal. This article may be deleted or updated based on the policies of the journal

Journal ref: Multimedia Tools and Applications, 2023

arXiv:2203.05161 [pdf, other]

Container Orchestration in Edge and Fog Computing Environments for Real-Time IoT Applications

Authors: Zhiyu Wang, Mohammad Goudarzi, Jagannath Aryal, Rajkumar Buyya

Abstract: Resource management is the principal factor to fully utilize the potential of Edge/Fog computing to execute real-time and critical IoT applications. Although some resource management frameworks exist, the majority are not designed based on distributed containerized components. Hence, they are not suitable for highly distributed and heterogeneous computing environments. Containerized resource manag… ▽ More Resource management is the principal factor to fully utilize the potential of Edge/Fog computing to execute real-time and critical IoT applications. Although some resource management frameworks exist, the majority are not designed based on distributed containerized components. Hence, they are not suitable for highly distributed and heterogeneous computing environments. Containerized resource management frameworks such as FogBus2 enable efficient distribution of framework's components alongside IoT applications' components. However, the management, deployment, health-check, and scalability of a large number of containers are challenging issues. To orchestrate a multitude of containers, several orchestration tools are developed. But, many of these orchestration tools are heavy-weight and have a high overhead, especially for resource-limited Edge/Fog nodes. Thus, for hybrid computing environments, consisting of heterogeneous Edge/Fog and/or Cloud nodes, lightweight container orchestration tools are required to support both resource-limited resources at the Edge/Fog and resource-rich resources at the Cloud. Thus, in this paper, we propose a feasible approach to build a hybrid and lightweight cluster based on K3s, for the FogBus2 framework that offers containerized resource management framework. This work addresses the challenge of creating lightweight computing clusters in hybrid computing environments. It also proposes three design patterns for the deployment of the FogBus2 framework in hybrid environments, including 1) Host Network, 2) Proxy Server, and 3) Environment Variable. The performance evaluation shows that the proposed approach improves the response time of real-time IoT applications up to 29% with acceptable and low overhead. △ Less

Submitted 10 March, 2022; originally announced March 2022.

Comments: 20 pages, 10 figures

Showing 1–10 of 10 results for author: Aryal, J