Search | arXiv e-print repository

arXiv:2507.20776 [pdf, ps, other]

RingMo-Agent: A Unified Remote Sensing Foundation Model for Multi-Platform and Multi-Modal Reasoning

Authors: Huiyang Hu, Peijin Wang, Yingchao Feng, Kaiwen Wei, Wenxin Yin, Wenhui Diao, Mengyu Wang, Hanbo Bi, Kaiyue Kang, Tong Ling, Kun Fu, Xian Sun

Abstract: Remote sensing (RS) images from multiple modalities and platforms exhibit diverse details due to differences in sensor characteristics and imaging perspectives. Existing vision-language research in RS largely relies on relatively homogeneous data sources. Moreover, they still remain limited to conventional visual perception tasks such as classification or captioning. As a result, these methods fai… ▽ More Remote sensing (RS) images from multiple modalities and platforms exhibit diverse details due to differences in sensor characteristics and imaging perspectives. Existing vision-language research in RS largely relies on relatively homogeneous data sources. Moreover, they still remain limited to conventional visual perception tasks such as classification or captioning. As a result, these methods fail to serve as a unified and standalone framework capable of effectively handling RS imagery from diverse sources in real-world applications. To address these issues, we propose RingMo-Agent, a model designed to handle multi-modal and multi-platform data that performs perception and reasoning tasks based on user textual instructions. Compared with existing models, RingMo-Agent 1) is supported by a large-scale vision-language dataset named RS-VL3M, comprising over 3 million image-text pairs, spanning optical, SAR, and infrared (IR) modalities collected from both satellite and UAV platforms, covering perception and challenging reasoning tasks; 2) learns modality adaptive representations by incorporating separated embedding layers to construct isolated features for heterogeneous modalities and reduce cross-modal interference; 3) unifies task modeling by introducing task-specific tokens and employing a token-based high-dimensional hidden state decoding mechanism designed for long-horizon spatial tasks. Extensive experiments on various RS vision-language tasks demonstrate that RingMo-Agent not only proves effective in both visual understanding and sophisticated analytical tasks, but also exhibits strong generalizability across different platforms and sensing modalities. △ Less

Submitted 28 July, 2025; originally announced July 2025.

Comments: 21 pages, 6 figures, 20 tables

arXiv:2507.02294 [pdf, ps, other]

ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation

Authors: Hanbo Bi, Yulong Xu, Ya Li, Yongqiang Mao, Boyuan Tong, Chongyang Li, Chunbo Lang, Wenhui Diao, Hongqi Wang, Yingchao Feng, Xian Sun

Abstract: The Segment Anything Model (SAM), with its prompt-driven paradigm, exhibits strong generalization in generic segmentation tasks. However, applying SAM to remote sensing (RS) images still faces two major challenges. First, manually constructing precise prompts for each image (e.g., points or boxes) is labor-intensive and inefficient, especially in RS scenarios with dense small objects or spatially… ▽ More The Segment Anything Model (SAM), with its prompt-driven paradigm, exhibits strong generalization in generic segmentation tasks. However, applying SAM to remote sensing (RS) images still faces two major challenges. First, manually constructing precise prompts for each image (e.g., points or boxes) is labor-intensive and inefficient, especially in RS scenarios with dense small objects or spatially fragmented distributions. Second, SAM lacks domain adaptability, as it is pre-trained primarily on natural images and struggles to capture RS-specific semantics and spatial characteristics, especially when segmenting novel or unseen classes. To address these issues, inspired by few-shot learning, we propose ViRefSAM, a novel framework that guides SAM utilizing only a few annotated reference images that contain class-specific objects. Without requiring manual prompts, ViRefSAM enables automatic segmentation of class-consistent objects across RS images. Specifically, ViRefSAM introduces two key components while keeping SAM's original architecture intact: (1) a Visual Contextual Prompt Encoder that extracts class-specific semantic clues from reference images and generates object-aware prompts via contextual interaction with target images; and (2) a Dynamic Target Alignment Adapter, integrated into SAM's image encoder, which mitigates the domain gap by injecting class-specific semantics into target image features, enabling SAM to dynamically focus on task-relevant regions. Extensive experiments on three few-shot segmentation benchmarks, including iSAID-5$^i$, LoveDA-2$^i$, and COCO-20$^i$, demonstrate that ViRefSAM enables accurate and automatic segmentation of unseen classes by leveraging only a few reference images and consistently outperforms existing few-shot segmentation methods across diverse datasets. △ Less

Submitted 3 July, 2025; originally announced July 2025.

arXiv:2504.11999 [pdf, other]

A Complex-valued SAR Foundation Model Based on Physically Inspired Representation Learning

Authors: Mengyu Wang, Hanbo Bi, Yingchao Feng, Linlin Xin, Shuo Gong, Tianqi Wang, Zhiyuan Yan, Peijin Wang, Wenhui Diao, Xian Sun

Abstract: Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient i… ▽ More Vision foundation models in remote sensing have been extensively studied due to their superior generalization on various downstream tasks. Synthetic Aperture Radar (SAR) offers all-day, all-weather imaging capabilities, providing significant advantages for Earth observation. However, establishing a foundation model for SAR image interpretation inevitably encounters the challenges of insufficient information utilization and poor interpretability. In this paper, we propose a remote sensing foundation model based on complex-valued SAR data, which simulates the polarimetric decomposition process for pre-training, i.e., characterizing pixel scattering intensity as a weighted combination of scattering bases and scattering coefficients, thereby endowing the foundation model with physical interpretability. Specifically, we construct a series of scattering queries, each representing an independent and meaningful scattering basis, which interact with SAR features in the scattering query decoder and output the corresponding scattering coefficient. To guide the pre-training process, polarimetric decomposition loss and power self-supervision loss are constructed. The former aligns the predicted coefficients with Yamaguchi coefficients, while the latter reconstructs power from the predicted coefficients and compares it to the input image's power. The performance of our foundation model is validated on six typical downstream tasks, achieving state-of-the-art results. Notably, the foundation model can extract stable feature representations and exhibits strong generalization, even in data-scarce conditions. △ Less

Submitted 16 April, 2025; originally announced April 2025.

arXiv:2504.08991 [pdf, other]

CMS RPC Non-Physics Event Data Automation Ideology

Authors: A. Dimitrov, M. Tytgat, K. Mota Amarilo, A. Samalan, K. Skovpen, G. A. Alves, E. Alves Coelho, F. Marujo da Silva, M. Barroso Ferreira Filho, E. M. Da Costa, D. De Jesus Damiao, S. Fonseca De Souza, R. Gomes De Souza, L. Mundim, H. Nogima, J. P. Pinheiro, A. Santoro, M. Thiel, A. Aleksandrov, R. Hadjiiska, P. Iaydjiev, M. Shopova, G. Sultanov, L. Litov, B. Pavlov , et al. (79 additional authors not shown)

Abstract: This paper presents a streamlined framework for real-time processing and analysis of condition data from the CMS experiment Resistive Plate Chambers (RPC). Leveraging data streaming, it uncovers correlations between RPC performance metrics, like currents and rates, and LHC luminosity or environmental conditions. The Java-based framework automates data handling and predictive modeling, integrating… ▽ More This paper presents a streamlined framework for real-time processing and analysis of condition data from the CMS experiment Resistive Plate Chambers (RPC). Leveraging data streaming, it uncovers correlations between RPC performance metrics, like currents and rates, and LHC luminosity or environmental conditions. The Java-based framework automates data handling and predictive modeling, integrating extensive datasets into synchronized, query-optimized tables. By segmenting LHC operations and analyzing larger virtual detector objects, the automation enhances monitoring precision, accelerates visualization, and provides predictive insights, revolutionizing RPC performance evaluation and future behavior modeling. △ Less

Submitted 11 April, 2025; originally announced April 2025.

Comments: CMS RPC Condition Data Automation, Java framework, 12 pages, 23 figures, CMS Condition database

arXiv:2504.03166 [pdf, other]

RingMoE: Mixture-of-Modality-Experts Multi-Modal Foundation Models for Universal Remote Sensing Image Interpretation

Authors: Hanbo Bi, Yingchao Feng, Boyuan Tong, Mengyu Wang, Haichen Yu, Yongqiang Mao, Hao Chang, Wenhui Diao, Peijin Wang, Yue Yu, Hanyang Peng, Yehong Zhang, Kun Fu, Xian Sun

Abstract: The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-sp… ▽ More The rapid advancement of foundation models has revolutionized visual representation learning in a self-supervised manner. However, their application in remote sensing (RS) remains constrained by a fundamental gap: existing models predominantly handle single or limited modalities, overlooking the inherently multi-modal nature of RS observations. Optical, synthetic aperture radar (SAR), and multi-spectral data offer complementary insights that significantly reduce the inherent ambiguity and uncertainty in single-source analysis. To bridge this gap, we introduce RingMoE, a unified multi-modal RS foundation model with 14.7 billion parameters, pre-trained on 400 million multi-modal RS images from nine satellites. RingMoE incorporates three key innovations: (1) A hierarchical Mixture-of-Experts (MoE) architecture comprising modal-specialized, collaborative, and shared experts, effectively modeling intra-modal knowledge while capturing cross-modal dependencies to mitigate conflicts between modal representations; (2) Physics-informed self-supervised learning, explicitly embedding sensor-specific radiometric characteristics into the pre-training objectives; (3) Dynamic expert pruning, enabling adaptive model compression from 14.7B to 1B parameters while maintaining performance, facilitating efficient deployment in Earth observation applications. Evaluated across 23 benchmarks spanning six key RS tasks (i.e., classification, detection, segmentation, tracking, change detection, and depth estimation), RingMoE outperforms existing foundation models and sets new SOTAs, demonstrating remarkable adaptability from single-modal to multi-modal scenarios. Beyond theoretical progress, it has been deployed and trialed in multiple sectors, including emergency response, land management, marine sciences, and urban planning. △ Less

Submitted 4 April, 2025; originally announced April 2025.

arXiv:2412.06369 [pdf]

Magnomechanically induced transparency in the ferrimagnetic bridge crystal of atom opto-magnomechanical system

Authors: Wenting Diao, Xi Wang, Ke Di, Yu Liu, Anyu Cheng, Chunxiao Cai, Wenhai Yang, Jiajia Du

Abstract: We investigate the absorption and transmission properties of a weak probe field in an atom opto-magnomechanics system. The system comprises an assembly of two-level atoms and a magnon mode within a ferrimagnetic crystal, which directly interacts with an optical cavity mode through the crystal's deformation displacement. We observe optomechanically induced transparency (OMIT) via radiation pressure… ▽ More We investigate the absorption and transmission properties of a weak probe field in an atom opto-magnomechanics system. The system comprises an assembly of two-level atoms and a magnon mode within a ferrimagnetic crystal, which directly interacts with an optical cavity mode through the crystal's deformation displacement. We observe optomechanically induced transparency (OMIT) via radiation pressure and a magnomechanically induced transparency (MMIT) due to the nonlinear magnon-phonon interaction. In addition, due to the coupling of the atom to the detected and signal light, the system's width transparency window is divided into two narrow windows. Additionally, we demonstrate that the group delay is contingent upon the tunability of the magnon-phonon coupling strength. Our solution possesses significant in the field of quantum precision measurement. △ Less

Submitted 9 December, 2024; originally announced December 2024.

arXiv:2411.17984 [pdf, ps, other]

RS-vHeat: Heat Conduction Guided Efficient Remote Sensing Foundation Model

Authors: Huiyang Hu, Peijin Wang, Hanbo Bi, Boyuan Tong, Zhaozhi Wang, Wenhui Diao, Hao Chang, Yingchao Feng, Ziqi Zhang, Yaowei Wang, Qixiang Ye, Kun Fu, Xian Sun

Abstract: Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with large-scale remote sensing images. To overcome these, we draw inspiration from heat conduction, a physica… ▽ More Remote sensing foundation models largely break away from the traditional paradigm of designing task-specific models, offering greater scalability across multiple tasks. However, they face challenges such as low computational efficiency and limited interpretability, especially when dealing with large-scale remote sensing images. To overcome these, we draw inspiration from heat conduction, a physical process modeling local heat diffusion. Building on this idea, we are the first to explore the potential of using the parallel computing model of heat conduction to simulate the local region correlations in high-resolution remote sensing images, and introduce RS-vHeat, an efficient multi-modal remote sensing foundation model. Specifically, RS-vHeat 1) applies the Heat Conduction Operator (HCO) with a complexity of $O(N^{1.5})$ and a global receptive field, reducing computational overhead while capturing remote sensing object structure information to guide heat diffusion; 2) learns the frequency distribution representations of various scenes through a self-supervised strategy based on frequency domain hierarchical masking and multi-domain reconstruction; 3) significantly improves efficiency and performance over state-of-the-art techniques across 4 tasks and 10 datasets. Compared to attention-based remote sensing foundation models, we reduce memory usage by 84\%, FLOPs by 24\% and improves throughput by 2.7 times. The code will be made publicly available. △ Less

Submitted 25 June, 2025; v1 submitted 26 November, 2024; originally announced November 2024.

Comments: 19 pages, 8 figures and 10 tables

arXiv:2409.17453 [pdf, other]

AgMTR: Agent Mining Transformer for Few-shot Segmentation in Remote Sensing

Authors: Hanbo Bi, Yingchao Feng, Yongqiang Mao, Jianning Pei, Wenhui Diao, Hongqi Wang, Xian Sun

Abstract: Few-shot Segmentation (FSS) aims to segment the interested objects in the query image with just a handful of labeled samples (i.e., support images). Previous schemes would leverage the similarity between support-query pixel pairs to construct the pixel-level semantic correlation. However, in remote sensing scenarios with extreme intra-class variations and cluttered backgrounds, such pixel-level co… ▽ More Few-shot Segmentation (FSS) aims to segment the interested objects in the query image with just a handful of labeled samples (i.e., support images). Previous schemes would leverage the similarity between support-query pixel pairs to construct the pixel-level semantic correlation. However, in remote sensing scenarios with extreme intra-class variations and cluttered backgrounds, such pixel-level correlations may produce tremendous mismatches, resulting in semantic ambiguity between the query foreground (FG) and background (BG) pixels. To tackle this problem, we propose a novel Agent Mining Transformer (AgMTR), which adaptively mines a set of local-aware agents to construct agent-level semantic correlation. Compared with pixel-level semantics, the given agents are equipped with local-contextual information and possess a broader receptive field. At this point, different query pixels can selectively aggregate the fine-grained local semantics of different agents, thereby enhancing the semantic clarity between query FG and BG pixels. Concretely, the Agent Learning Encoder (ALE) is first proposed to erect the optimal transport plan that arranges different agents to aggregate support semantics under different local regions. Then, for further optimizing the agents, the Agent Aggregation Decoder (AAD) and the Semantic Alignment Decoder (SAD) are constructed to break through the limited support set for mining valuable class-specific semantics from unlabeled data sources and the query image itself, respectively. Extensive experiments on the remote sensing benchmark iSAID indicate that the proposed method achieves state-of-the-art performance. Surprisingly, our method remains quite competitive when extended to more common natural scenarios, i.e., PASCAL-5i and COCO-20i. △ Less

Submitted 25 September, 2024; originally announced September 2024.

Comments: accepted to IJCV

arXiv:2409.13366 [pdf, ps, other]

RingMo-Aerial: An Aerial Remote Sensing Foundation Model With Affine Transformation Contrastive Learning

Authors: Wenhui Diao, Haichen Yu, Kaiyue Kang, Tong Ling, Di Liu, Yingchao Feng, Hanbo Bi, Libo Ren, Xuexue Li, Yongqiang Mao, Xian Sun

Abstract: Aerial Remote Sensing (ARS) vision tasks present significant challenges due to the unique viewing angle characteristics. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes RingMo-Aerial, aiming to fill the gap in foundation model research in the field of ARS vision. A Frequency-… ▽ More Aerial Remote Sensing (ARS) vision tasks present significant challenges due to the unique viewing angle characteristics. Existing research has primarily focused on algorithms for specific tasks, which have limited applicability in a broad range of ARS vision applications. This paper proposes RingMo-Aerial, aiming to fill the gap in foundation model research in the field of ARS vision. A Frequency-Enhanced Multi-Head Self-Attention (FE-MSA) mechanism is introduced to strengthen the model's capacity for small-object representation. Complementarily, an affine transformation-based contrastive learning method improves its adaptability to the tilted viewing angles inherent in ARS tasks. Furthermore, the ARS-Adapter, an efficient parameter fine-tuning method, is proposed to improve the model's adaptability and performance in various ARS vision tasks. Experimental results demonstrate that RingMo-Aerial achieves SOTA performance on multiple downstream tasks. This indicates the practicality and efficacy of RingMo-Aerial in enhancing the performance of ARS vision tasks. △ Less

Submitted 16 September, 2025; v1 submitted 20 September, 2024; originally announced September 2024.

arXiv:2409.10389 [pdf, other]

Prompt-and-Transfer: Dynamic Class-aware Enhancement for Few-shot Segmentation

Authors: Hanbo Bi, Yingchao Feng, Wenhui Diao, Peijin Wang, Yongqiang Mao, Kun Fu, Hongqi Wang, Xian Sun

Abstract: For more efficient generalization to unseen domains (classes), most Few-shot Segmentation (FSS) would directly exploit pre-trained encoders and only fine-tune the decoder, especially in the current era of large models. However, such fixed feature encoders tend to be class-agnostic, inevitably activating objects that are irrelevant to the target class. In contrast, humans can effortlessly focus on… ▽ More For more efficient generalization to unseen domains (classes), most Few-shot Segmentation (FSS) would directly exploit pre-trained encoders and only fine-tune the decoder, especially in the current era of large models. However, such fixed feature encoders tend to be class-agnostic, inevitably activating objects that are irrelevant to the target class. In contrast, humans can effortlessly focus on specific objects in the line of sight. This paper mimics the visual perception pattern of human beings and proposes a novel and powerful prompt-driven scheme, called ``Prompt and Transfer" (PAT), which constructs a dynamic class-aware prompting paradigm to tune the encoder for focusing on the interested object (target class) in the current task. Three key points are elaborated to enhance the prompting: 1) Cross-modal linguistic information is introduced to initialize prompts for each task. 2) Semantic Prompt Transfer (SPT) that precisely transfers the class-specific semantics within the images to prompts. 3) Part Mask Generator (PMG) that works in conjunction with SPT to adaptively generate different but complementary part prompts for different individuals. Surprisingly, PAT achieves competitive performance on 4 different tasks including standard FSS, Cross-domain FSS (e.g., CV, medical, and remote sensing domains), Weak-label FSS, and Zero-shot Segmentation, setting new state-of-the-arts on 11 benchmarks. △ Less

Submitted 16 September, 2024; originally announced September 2024.

arXiv:2406.00523 [pdf, other]

Stealing Trust: Unraveling Blind Message Attacks in Web3 Authentication

Authors: Kailun Yan, Xiaokuan Zhang, Wenrui Diao

Abstract: As the field of Web3 continues its rapid expansion, the security of Web3 authentication, often the gateway to various Web3 applications, becomes increasingly crucial. Despite its widespread use as a login method by numerous Web3 applications, the security risks of Web3 authentication have not received much attention. This paper investigates the vulnerabilities in the Web3 authentication process an… ▽ More As the field of Web3 continues its rapid expansion, the security of Web3 authentication, often the gateway to various Web3 applications, becomes increasingly crucial. Despite its widespread use as a login method by numerous Web3 applications, the security risks of Web3 authentication have not received much attention. This paper investigates the vulnerabilities in the Web3 authentication process and proposes a new type of attack, dubbed blind message attacks. In blind message attacks, attackers trick users into blindly signing messages from target applications by exploiting users' inability to verify the source of messages, thereby achieving unauthorized access to the target application. We have developed Web3AuthChecker, a dynamic detection tool that interacts with Web3 authentication-related APIs to identify vulnerabilities. Our evaluation of real-world Web3 applications shows that a staggering 75.8% (22/29) of Web3 authentication deployments are at risk of blind message attacks. In response to this alarming situation, we implemented Web3AuthGuard on the open-source wallet MetaMask to alert users of potential attacks. Our evaluation results show that Web3AuthGuard can successfully raise alerts in 80% of the tested Web3 authentications. We have responsibly reported our findings to vulnerable websites and have been assigned two CVE IDs. △ Less

Submitted 30 September, 2024; v1 submitted 1 June, 2024; originally announced June 2024.

arXiv:2310.13913 [pdf, other]

Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models

Authors: Lihang Liu, Shanzhuo Zhang, Donglong He, Xianbin Ye, Jingbo Zhou, Xiaonan Zhang, Yaoyao Jiang, Weiming Diao, Hang Yin, Hua Chai, Fan Wang, Jingzhou He, Liang Zheng, Yonghui Li, Xiaomin Fang

Abstract: Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Recent advances have incorporated deep learning techniques to improve the accuracy of protein-ligand structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises conce… ▽ More Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Recent advances have incorporated deep learning techniques to improve the accuracy of protein-ligand structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can obtain a protein-ligand structure prediction model with outstanding performance. Specifically, this process involved the generation of 100 million docking conformations for protein-ligand pairings, an endeavor consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been rigorously benchmarked against both physics-based and deep learning-based baselines, demonstrating its exceptional precision and robust transferability in predicting binding confirmation. In addition, our investigation reveals the scaling laws governing pre-trained protein-ligand structure prediction models, indicating a consistent enhancement in performance with increases in model parameters and the volume of pre-training data. Moreover, we applied HelixDock to several drug discovery-related tasks to validate its practical utility. HelixDock demonstrates outstanding capabilities on both cross-docking and structure-based virtual screening benchmarks. △ Less

Submitted 22 May, 2024; v1 submitted 21 October, 2023; originally announced October 2023.

arXiv:2310.12452 [pdf, other]

Not Just Learning from Others but Relying on Yourself: A New Perspective on Few-Shot Segmentation in Remote Sensing

Authors: Hanbo Bi, Yingchao Feng, Zhiyuan Yan, Yongqiang Mao, Wenhui Diao, Hongqi Wang, Xian Sun

Abstract: Few-shot segmentation (FSS) is proposed to segment unknown class targets with just a few annotated samples. Most current FSS methods follow the paradigm of mining the semantics from the support images to guide the query image segmentation. However, such a pattern of `learning from others' struggles to handle the extreme intra-class variation, preventing FSS from being directly generalized to remot… ▽ More Few-shot segmentation (FSS) is proposed to segment unknown class targets with just a few annotated samples. Most current FSS methods follow the paradigm of mining the semantics from the support images to guide the query image segmentation. However, such a pattern of `learning from others' struggles to handle the extreme intra-class variation, preventing FSS from being directly generalized to remote sensing scenes. To bridge the gap of intra-class variance, we develop a Dual-Mining network named DMNet for cross-image mining and self-mining, meaning that it no longer focuses solely on support images but pays more attention to the query image itself. Specifically, we propose a Class-public Region Mining (CPRM) module to effectively suppress irrelevant feature pollution by capturing the common semantics between the support-query image pair. The Class-specific Region Mining (CSRM) module is then proposed to continuously mine the class-specific semantics of the query image itself in a `filtering' and `purifying' manner. In addition, to prevent the co-existence of multiple classes in remote sensing scenes from exacerbating the collapse of FSS generalization, we also propose a new Known-class Meta Suppressor (KMS) module to suppress the activation of known-class objects in the sample. Extensive experiments on the iSAID and LoveDA remote sensing datasets have demonstrated that our method sets the state-of-the-art with a minimum number of model parameters. Significantly, our model with the backbone of Resnet-50 achieves the mIoU of 49.58% and 51.34% on iSAID under 1-shot and 5-shot settings, outperforming the state-of-the-art method by 1.8% and 1.12%, respectively. The code is publicly available at https://github.com/HanboBizl/DMNet. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: accepted to IEEE TGRS

arXiv:2304.11805 [pdf, other]

doi 10.1016/j.isprsjprs.2023.04.009

OGMN: Occlusion-guided Multi-task Network for Object Detection in UAV Images

Authors: Xuexue Li, Wenhui Diao, Yongqiang Mao, Peng Gao, Xiuhua Mao, Xinming Li, Xian Sun

Abstract: Occlusion between objects is one of the overlooked challenges for object detection in UAV images. Due to the variable altitude and angle of UAVs, occlusion in UAV images happens more frequently than that in natural scenes. Compared to occlusion in natural scene images, occlusion in UAV images happens with feature confusion problem and local aggregation characteristic. And we found that extracting… ▽ More Occlusion between objects is one of the overlooked challenges for object detection in UAV images. Due to the variable altitude and angle of UAVs, occlusion in UAV images happens more frequently than that in natural scenes. Compared to occlusion in natural scene images, occlusion in UAV images happens with feature confusion problem and local aggregation characteristic. And we found that extracting or localizing occlusion between objects is beneficial for the detector to address this challenge. According to this finding, the occlusion localization task is introduced, which together with the object detection task constitutes our occlusion-guided multi-task network (OGMN). The OGMN contains the localization of occlusion and two occlusion-guided multi-task interactions. In detail, an occlusion estimation module (OEM) is proposed to precisely localize occlusion. Then the OGMN utilizes the occlusion localization results to implement occlusion-guided detection with two multi-task interactions. One interaction for the guide is between two task decoders to address the feature confusion problem, and an occlusion decoupling head (ODH) is proposed to replace the general detection head. Another interaction for guide is designed in the detection process according to local aggregation characteristic, and a two-phase progressive refinement process (TPP) is proposed to optimize the detection process. Extensive experiments demonstrate the effectiveness of our OGMN on the Visdrone and UAVDT datasets. In particular, our OGMN achieves 35.0% mAP on the Visdrone dataset and outperforms the baseline by 5.3%. And our OGMN provides a new insight for accurate occlusion localization and achieves competitive detection performance. △ Less

Submitted 23 April, 2023; originally announced April 2023.

Comments: 20 pages, 15 figures

arXiv:2303.12304 [pdf, other]

SiamTHN: Siamese Target Highlight Network for Visual Tracking

Authors: Jiahao Bao, Kaiqiang Chen, Xian Sun, Liangjin Zhao, Wenhui Diao, Menglong Yan

Abstract: Siamese network based trackers develop rapidly in the field of visual object tracking in recent years. The majority of siamese network based trackers now in use treat each channel in the feature maps generated by the backbone network equally, making the similarity response map sensitive to background influence and hence challenging to focus on the target region. Additionally, there are no structur… ▽ More Siamese network based trackers develop rapidly in the field of visual object tracking in recent years. The majority of siamese network based trackers now in use treat each channel in the feature maps generated by the backbone network equally, making the similarity response map sensitive to background influence and hence challenging to focus on the target region. Additionally, there are no structural links between the classification and regression branches in these trackers, and the two branches are optimized separately during training. Therefore, there is a misalignment between the classification and regression branches, which results in less accurate tracking results. In this paper, a Target Highlight Module is proposed to help the generated similarity response maps to be more focused on the target region. To reduce the misalignment and produce more precise tracking results, we propose a corrective loss to train the model. The two branches of the model are jointly tuned with the use of corrective loss to produce more reliable prediction results. Experiments on 5 challenging benchmark datasets reveal that the method outperforms current models in terms of performance, and runs at 38 fps, proving its effectiveness and efficiency. △ Less

Submitted 22 March, 2023; originally announced March 2023.

arXiv:2302.02764 [pdf, other]

doi 10.1016/j.nima.2023.168449

Machine Learning based tool for CMS RPC currents quality monitoring

Authors: E. Shumka, A. Samalan, M. Tytgat, M. El Sawy, G. A. Alves, F. Marujo, E. A. Coelho, E. M. Da Costa, H. Nogima, A. Santoro, S. Fonseca De Souza, D. De Jesus Damiao, M. Thiel, K. Mota Amarilo, M. Barroso Ferreira Filho, A. Aleksandrov, R. Hadjiiska, P. Iaydjiev, M. Rodozov, M. Shopova, G. Soultanov, A. Dimitrov, L. Litov, B. Pavlov, P. Petkov , et al. (83 additional authors not shown)

Abstract: The muon system of the CERN Compact Muon Solenoid (CMS) experiment includes more than a thousand Resistive Plate Chambers (RPC). They are gaseous detectors operated in the hostile environment of the CMS underground cavern on the Large Hadron Collider where pp luminosities of up to $2\times 10^{34}$ $\text{cm}^{-2}\text{s}^{-1}$ are routinely achieved. The CMS RPC system performance is constantly m… ▽ More The muon system of the CERN Compact Muon Solenoid (CMS) experiment includes more than a thousand Resistive Plate Chambers (RPC). They are gaseous detectors operated in the hostile environment of the CMS underground cavern on the Large Hadron Collider where pp luminosities of up to $2\times 10^{34}$ $\text{cm}^{-2}\text{s}^{-1}$ are routinely achieved. The CMS RPC system performance is constantly monitored and the detector is regularly maintained to ensure stable operation. The main monitorable characteristics are dark current, efficiency for muon detection, noise rate etc. Herein we describe an automated tool for CMS RPC current monitoring which uses Machine Learning techniques. We further elaborate on the dedicated generalized linear model proposed already and add autoencoder models for self-consistent predictions as well as hybrid models to allow for RPC current predictions in a distant future. △ Less

Submitted 6 February, 2023; originally announced February 2023.

arXiv:2301.04581 [pdf, other]

doi 10.1109/TGRS.2023.3266477

Elevation Estimation-Driven Building 3D Reconstruction from Single-View Remote Sensing Imagery

Authors: Yongqiang Mao, Kaiqiang Chen, Liangjin Zhao, Wei Chen, Deke Tang, Wenjie Liu, Zhirui Wang, Wenhui Diao, Xian Sun, Kun Fu

Abstract: Building 3D reconstruction from remote sensing images has a wide range of applications in smart cities, photogrammetry and other fields. Methods for automatic 3D urban building modeling typically employ multi-view images as input to algorithms to recover point clouds and 3D models of buildings. However, such models rely heavily on multi-view images of buildings, which are time-intensive and limit… ▽ More Building 3D reconstruction from remote sensing images has a wide range of applications in smart cities, photogrammetry and other fields. Methods for automatic 3D urban building modeling typically employ multi-view images as input to algorithms to recover point clouds and 3D models of buildings. However, such models rely heavily on multi-view images of buildings, which are time-intensive and limit the applicability and practicality of the models. To solve these issues, we focus on designing an efficient DSM estimation-driven reconstruction framework (Building3D), which aims to reconstruct 3D building models from the input single-view remote sensing image. First, we propose a Semantic Flow Field-guided DSM Estimation (SFFDE) network, which utilizes the proposed concept of elevation semantic flow to achieve the registration of local and global features. Specifically, in order to make the network semantics globally aware, we propose an Elevation Semantic Globalization (ESG) module to realize the semantic globalization of instances. Further, in order to alleviate the semantic span of global features and original local features, we propose a Local-to-Global Elevation Semantic Registration (L2G-ESR) module based on elevation semantic flow. Our Building3D is rooted in the SFFDE network for building elevation prediction, synchronized with a building extraction network for building masks, and then sequentially performs point cloud reconstruction, surface reconstruction (or CityGML model reconstruction). On this basis, our Building3D can optionally generate CityGML models or surface mesh models of the buildings. Extensive experiments on ISPRS Vaihingen and DFC2019 datasets on the DSM estimation task show that our SFFDE significantly improves upon state-of-the-arts. Furthermore, our Building3D achieves impressive results in the 3D point cloud and 3D model reconstruction process. △ Less

Submitted 11 January, 2023; originally announced January 2023.

arXiv:2211.16591 [pdf, other]

doi 10.1016/j.nima.2023.168271

RPC based tracking system at CERN GIF++ facility

Authors: K. Mota Amarilo, A. Samalan, M. Tytgat, M. El Sawy, G. A. Alves, F. Marujo, E. A. Coelho, E. M. Da Costa, H. Nogima, A. Santoro, S. Fonseca De Souza, D. De Jesus Damiao, M. Thiel, M. Barroso Ferreira Filho, A. Aleksandrov, R. Hadjiiska, P. Iaydjiev, M. Rodozov, M. Shopova, G. Soultanov, A. Dimitrov, L. Litov, B. Pavlov, P. Petkov, A. Petrov , et al. (83 additional authors not shown)

Abstract: With the HL-LHC upgrade of the LHC machine, an increase of the instantaneous luminosity by a factor of five is expected and the current detection systems need to be validated for such working conditions to ensure stable data taking. At the CERN Gamma Irradiation Facility (GIF++) many muon detectors undergo such studies, but the high gamma background can pose a challenge to the muon trigger system… ▽ More With the HL-LHC upgrade of the LHC machine, an increase of the instantaneous luminosity by a factor of five is expected and the current detection systems need to be validated for such working conditions to ensure stable data taking. At the CERN Gamma Irradiation Facility (GIF++) many muon detectors undergo such studies, but the high gamma background can pose a challenge to the muon trigger system which is exposed to many fake hits from the gamma background. A tracking system using RPCs is implemented to clean the fake hits, taking profit of the high muon efficiency of these chambers. This work will present the tracking system configuration, used detector analysis algorithm and results. △ Less

Submitted 29 November, 2022; originally announced November 2022.

Comments: 12 pages, 9 figures. Contribution to XVI Workshop on Resistive Plate Chambers and Related Detectors (RPC2022), September 26-30 2022. Submitted to Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment

arXiv:2211.14782 [pdf, other]

Breaking Immutable: Information-Coupled Prototype Elaboration for Few-Shot Object Detection

Authors: Xiaonan Lu, Wenhui Diao, Yongqiang Mao, Junxi Li, Peijin Wang, Xian Sun, Kun Fu

Abstract: Few-shot object detection, expecting detectors to detect novel classes with a few instances, has made conspicuous progress. However, the prototypes extracted by existing meta-learning based methods still suffer from insufficient representative information and lack awareness of query images, which cannot be adaptively tailored to different query images. Firstly, only the support images are involved… ▽ More Few-shot object detection, expecting detectors to detect novel classes with a few instances, has made conspicuous progress. However, the prototypes extracted by existing meta-learning based methods still suffer from insufficient representative information and lack awareness of query images, which cannot be adaptively tailored to different query images. Firstly, only the support images are involved for extracting prototypes, resulting in scarce perceptual information of query images. Secondly, all pixels of all support images are treated equally when aggregating features into prototype vectors, thus the salient objects are overwhelmed by the cluttered background. In this paper, we propose an Information-Coupled Prototype Elaboration (ICPE) method to generate specific and representative prototypes for each query image. Concretely, a conditional information coupling module is introduced to couple information from the query branch to the support branch, strengthening the query-perceptual information in support features. Besides, we design a prototype dynamic aggregation module that dynamically adjusts intra-image and inter-image aggregation weights to highlight the salient information useful for detecting query images. Experimental results on both Pascal VOC and MS COCO demonstrate that our method achieves state-of-the-art performance in almost all settings. △ Less

Submitted 27 November, 2022; originally announced November 2022.

Comments: Accepted by AAAI2023

arXiv:2207.10278 [pdf, other]

doi 10.1016/j.isprsjprs.2022.03.019

Beyond single receptive field: A receptive field fusion-and-stratification network for airborne laser scanning point cloud classification

Authors: Yongqiang Mao, Kaiqiang Chen, Wenhui Diao, Xian Sun, Xiaonan Lu, Kun Fu, Martin Weinmann

Abstract: The classification of airborne laser scanning (ALS) point clouds is a critical task of remote sensing and photogrammetry fields. Although recent deep learning-based methods have achieved satisfactory performance, they have ignored the unicity of the receptive field, which makes the ALS point cloud classification remain challenging for the distinguishment of the areas with complex structures and ex… ▽ More The classification of airborne laser scanning (ALS) point clouds is a critical task of remote sensing and photogrammetry fields. Although recent deep learning-based methods have achieved satisfactory performance, they have ignored the unicity of the receptive field, which makes the ALS point cloud classification remain challenging for the distinguishment of the areas with complex structures and extreme scale variations. In this article, for the objective of configuring multi-receptive field features, we propose a novel receptive field fusion-and-stratification network (RFFS-Net). With a novel dilated graph convolution (DGConv) and its extension annular dilated convolution (ADConv) as basic building blocks, the receptive field fusion process is implemented with the dilated and annular graph fusion (DAGFusion) module, which obtains multi-receptive field feature representation through capturing dilated and annular graphs with various receptive regions. The stratification of the receptive fields with point sets of different resolutions as the calculation bases is performed with Multi-level Decoders nested in RFFS-Net and driven by the multi-level receptive field aggregation loss (MRFALoss) to drive the network to learn in the direction of the supervision labels with different resolutions. With receptive field fusion-and-stratification, RFFS-Net is more adaptable to the classification of regions with complex structures and extreme scale variations in large-scale ALS point clouds. Evaluated on the ISPRS Vaihingen 3D dataset, our RFFS-Net significantly outperforms the baseline approach by 5.3% on mF1 and 5.4% on mIoU, accomplishing an overall accuracy of 82.1%, an mF1 of 71.6%, and an mIoU of 58.2%. Furthermore, experiments on the LASDU dataset and the 2019 IEEE-GRSS Data Fusion Contest dataset show that RFFS-Net achieves a new state-of-the-art classification performance. △ Less

Submitted 20 July, 2022; originally announced July 2022.

Comments: accepted to ISPRS Journal

arXiv:2204.04944 [pdf, other]

Semantic Segmentation for Point Cloud Scenes via Dilated Graph Feature Aggregation and Pyramid Decoders

Authors: Yongqiang Mao, Xian Sun, Kaiqiang Chen, Wenhui Diao, Zonghao Guo, Xiaonan Lu, Kun Fu

Abstract: Semantic segmentation of point clouds generates comprehensive understanding of scenes through densely predicting the category for each point. Due to the unicity of receptive field, semantic segmentation of point clouds remains challenging for the expression of multi-receptive field features, which brings about the misclassification of instances with similar spatial structures. In this paper, we pr… ▽ More Semantic segmentation of point clouds generates comprehensive understanding of scenes through densely predicting the category for each point. Due to the unicity of receptive field, semantic segmentation of point clouds remains challenging for the expression of multi-receptive field features, which brings about the misclassification of instances with similar spatial structures. In this paper, we propose a graph convolutional network DGFA-Net rooted in dilated graph feature aggregation (DGFA), guided by multi-basis aggregation loss (MALoss) calculated through Pyramid Decoders. To configure multi-receptive field features, DGFA which takes the proposed dilated graph convolution (DGConv) as its basic building block, is designed to aggregate multi-scale feature representation by capturing dilated graphs with various receptive regions. By simultaneously considering penalizing the receptive field information with point sets of different resolutions as calculation bases, we introduce Pyramid Decoders driven by MALoss for the diversity of receptive field bases. Combining these two aspects, DGFA-Net significantly improves the segmentation performance of instances with similar spatial structures. Experiments on S3DIS, ShapeNetPart and Toronto-3D show that DGFA-Net outperforms the baseline approach, achieving a new state-of-the-art segmentation performance. △ Less

Submitted 3 November, 2022; v1 submitted 11 April, 2022; originally announced April 2022.

Comments: AAAI Workshop 2022

arXiv:2107.08591 [pdf, other]

doi 10.1109/TIP.2021.3083113

Double Similarity Distillation for Semantic Image Segmentation

Authors: Yingchao Feng, Xian Sun, Wenhui Diao, Jihao Li, Xin Gao

Abstract: The balance between high accuracy and high speed has always been a challenging task in semantic image segmentation. Compact segmentation networks are more widely used in the case of limited resources, while their performances are constrained. In this paper, motivated by the residual learning and global aggregation, we propose a simple yet general and effective knowledge distillation framework call… ▽ More The balance between high accuracy and high speed has always been a challenging task in semantic image segmentation. Compact segmentation networks are more widely used in the case of limited resources, while their performances are constrained. In this paper, motivated by the residual learning and global aggregation, we propose a simple yet general and effective knowledge distillation framework called double similarity distillation (DSD) to improve the classification accuracy of all existing compact networks by capturing the similarity knowledge in pixel and category dimensions, respectively. Specifically, we propose a pixel-wise similarity distillation (PSD) module that utilizes residual attention maps to capture more detailed spatial dependencies across multiple layers. Compared with exiting methods, the PSD module greatly reduces the amount of calculation and is easy to expand. Furthermore, considering the differences in characteristics between semantic segmentation task and other computer vision tasks, we propose a category-wise similarity distillation (CSD) module, which can help the compact segmentation network strengthen the global category correlation by constructing the correlation matrix. Combining these two modules, DSD framework has no extra parameters and only a minimal increase in FLOPs. Extensive experiments on four challenging datasets, including Cityscapes, CamVid, ADE20K, and Pascal VOC 2012, show that DSD outperforms current state-of-the-art methods, proving its effectiveness and generality. The code and models will be publicly available. △ Less

Submitted 18 July, 2021; originally announced July 2021.

Comments: Published in IEEE Transaction on Image Processing (TIP) 2021, Volume: 30

arXiv:2103.05569 [pdf, other]

FAIR1M: A Benchmark Dataset for Fine-grained Object Recognition in High-Resolution Remote Sensing Imagery

Authors: Xian Sun, Peijin Wang, Zhiyuan Yan, Feng Xu, Ruiping Wang, Wenhui Diao, Jin Chen, Jihao Li, Yingchao Feng, Tao Xu, Martin Weinmann, Stefan Hinz, Cheng Wang, Kun Fu

Abstract: With the rapid development of deep learning, many deep learning-based approaches have made great achievements in object detection task. It is generally known that deep learning is a data-driven method. Data directly impact the performance of object detectors to some extent. Although existing datasets have included common objects in remote sensing images, they still have some limitations in terms o… ▽ More With the rapid development of deep learning, many deep learning-based approaches have made great achievements in object detection task. It is generally known that deep learning is a data-driven method. Data directly impact the performance of object detectors to some extent. Although existing datasets have included common objects in remote sensing images, they still have some limitations in terms of scale, categories, and images. Therefore, there is a strong requirement for establishing a large-scale benchmark on object detection in high-resolution remote sensing images. In this paper, we propose a novel benchmark dataset with more than 1 million instances and more than 15,000 images for Fine-grAined object recognItion in high-Resolution remote sensing imagery which is named as FAIR1M. All objects in the FAIR1M dataset are annotated with respect to 5 categories and 37 sub-categories by oriented bounding boxes. Compared with existing detection datasets dedicated to object detection, the FAIR1M dataset has 4 particular characteristics: (1) it is much larger than other existing object detection datasets both in terms of the quantity of instances and the quantity of images, (2) it provides more rich fine-grained category information for objects in remote sensing images, (3) it contains geographic information such as latitude, longitude and resolution, (4) it provides better image quality owing to a careful data cleaning procedure. To establish a baseline for fine-grained object recognition, we propose a novel evaluation method and benchmark fine-grained object detection tasks and a visual classification task using several State-Of-The-Art (SOTA) deep learning-based models on our FAIR1M dataset. Experimental results strongly indicate that the FAIR1M dataset is closer to practical application and it is considerably more challenging than existing datasets. △ Less

Submitted 24 March, 2021; v1 submitted 9 March, 2021; originally announced March 2021.

Comments: 19 pages, 13 figures

arXiv:2001.02870 [pdf, other]

Hybrid Multiple Attention Network for Semantic Segmentation in Aerial Images

Authors: Ruigang Niu, Xian Sun, Yu Tian, Wenhui Diao, Kaiqiang Chen, Kun Fu

Abstract: Semantic segmentation in very high resolution (VHR) aerial images is one of the most challenging tasks in remote sensing image understanding. Most of the current approaches are based on deep convolutional neural networks (DCNNs). However, standard convolution with local receptive fields fails in modeling global dependencies. Prior researches have indicated that attention-based methods can capture… ▽ More Semantic segmentation in very high resolution (VHR) aerial images is one of the most challenging tasks in remote sensing image understanding. Most of the current approaches are based on deep convolutional neural networks (DCNNs). However, standard convolution with local receptive fields fails in modeling global dependencies. Prior researches have indicated that attention-based methods can capture long-range dependencies and further reconstruct the feature maps for better representation. Nevertheless, limited by the mere perspective of spacial and channel attention and huge computation complexity of self-attention mechanism, it is unlikely to model the effective semantic interdependencies between each pixel-pair of remote sensing data of complex spectra. In this work, we propose a novel attention-based framework named Hybrid Multiple Attention Network (HMANet) to adaptively capture global correlations from the perspective of space, channel and category in a more effective and efficient manner. Concretely, a class augmented attention (CAA) module embedded with a class channel attention (CCA) module can be used to compute category-based correlation and recalibrate the class-level information. Additionally, we introduce a simple yet effective region shuffle attention (RSA) module to reduce feature redundant and improve the efficiency of self-attention mechanism via region-wise representations. Extensive experimental results on the ISPRS Vaihingen and Potsdam benchmark demonstrate the effectiveness and efficiency of our HMANet over other state-of-the-art methods. △ Less

Submitted 14 September, 2020; v1 submitted 9 January, 2020; originally announced January 2020.

arXiv:1912.07781 [pdf, other]

Efficient Identifying the Orientation of Single NV Centers in Diamond and Using them to Detect Near Field Microwave

Authors: Xuerui Song, Fupan Feng, Chunxiao Cai, Guanzhong Wang, Wei Zhu, Wenting Diao, Chongdi Duan

Abstract: Arrays of NV centers in the diamond have the potential in the fields of chip-scale quantum information processing and nanoscale quantum sensing. However, determining their orientations one by one is resource intensive and time consuming. Here, in this paper, by combining scanning confocal fluorescence images and optical detected magnetic resonance, we realized a method of identifying single NV cen… ▽ More Arrays of NV centers in the diamond have the potential in the fields of chip-scale quantum information processing and nanoscale quantum sensing. However, determining their orientations one by one is resource intensive and time consuming. Here, in this paper, by combining scanning confocal fluorescence images and optical detected magnetic resonance, we realized a method of identifying single NV centers with the same orientation, which is practicable and high efficiency. In the proof of principle experiment, five single NV centers with the same orientation in a NV center array were identified. After that, using the five single NV centers, microwave near field generated by a 20 μm-diameter Cu antenna was also measured by reading the fluourescence intensity change and Rabi frequency at different microwave source power. The gradient of near field microwave at sub-microscale can be resoluted by using arry of NV centers in our work. This work promotes the quantum sensing using arrays of NV centers. △ Less

Submitted 16 December, 2019; originally announced December 2019.

Comments: 4 pages, 3 figures

arXiv:1904.09823 [pdf, other]

Ship Instance Segmentation From Remote Sensing Images Using Sequence Local Context Module

Authors: Yingchao Feng, Wenhui Diao, Zhonghan Chang, Menglong Yan, Xian Sun, Xin Gao

Abstract: The performance of object instance segmentation in remote sensing images has been greatly improved through the introduction of many landmark frameworks based on convolutional neural network. However, the object densely issue still affects the accuracy of such segmentation frameworks. Objects of the same class are easily confused, which is most likely due to the close docking between objects. We th… ▽ More The performance of object instance segmentation in remote sensing images has been greatly improved through the introduction of many landmark frameworks based on convolutional neural network. However, the object densely issue still affects the accuracy of such segmentation frameworks. Objects of the same class are easily confused, which is most likely due to the close docking between objects. We think context information is critical to address this issue. So, we propose a novel framework called SLCMASK-Net, in which a sequence local context module (SLC) is introduced to avoid confusion between objects of the same class. The SLC module applies a sequence of dilation convolution blocks to progressively learn multi-scale context information in the mask branch. Besides, we try to add SLC module to different locations in our framework and experiment with the effect of different parameter settings. Comparative experiments are conducted on remote sensing images acquired by QuickBird with a resolution of $0.5m-1m$ and the results show that the proposed method achieves state-of-the-art performance. △ Less

Submitted 22 April, 2019; originally announced April 2019.

Comments: 4 pages, 5 figures, IEEE Geoscience and Remote Sensing Society 2019

arXiv:1808.05444 [pdf, other]

DRLGENCERT: Deep Learning-based Automated Testing of Certificate Verification in SSL/TLS Implementations

Authors: Chao Chen, Wenrui Diao, Yingpei Zeng, Shanqing Guo, Chengyu Hu

Abstract: The Secure Sockets Layer (SSL) and Transport Layer Security (TLS) protocols are the foundation of network security. The certificate verification in SSL/TLS implementations is vital and may become the weak link in the whole network ecosystem. In previous works, some research focused on the automated testing of certificate verification, and the main approaches rely on generating massive certificates… ▽ More The Secure Sockets Layer (SSL) and Transport Layer Security (TLS) protocols are the foundation of network security. The certificate verification in SSL/TLS implementations is vital and may become the weak link in the whole network ecosystem. In previous works, some research focused on the automated testing of certificate verification, and the main approaches rely on generating massive certificates through randomly combining parts of seed certificates for fuzzing. Although the generated certificates could meet the semantic constraints, the cost is quite heavy, and the performance is limited due to the randomness. To fill this gap, in this paper, we propose DRLGENCERT, the first framework of applying deep reinforcement learning to the automated testing of certificate verification in SSL/TLS implementations. DRLGENCERT accepts ordinary certificates as input and outputs newly generated certificates which could trigger discrepancies with high efficiency. Benefited by the deep reinforcement learning, when generating certificates, our framework could choose the best next action according to the result of a previous modification, instead of simple random combinations. At the same time, we developed a set of new techniques to support the overall design, like new feature extraction method for X.509 certificates, fine-grained differential testing, and so forth. Also, we implemented a prototype of DRLGENCERT and carried out a series of real-world experiments. The results show DRLGENCERT is quite efficient, and we obtained 84,661 discrepancy-triggering certificates from 181,900 certificate seeds, say around 46.5% effectiveness. Also, we evaluated six popular SSL/TLS implementations, including GnuTLS, MatrixSSL, MbedTLS, NSS, OpenSSL, and wolfSSL. DRLGENCERT successfully discovered 23 serious certificate verification flaws, and most of them were previously unknown. △ Less

Submitted 16 August, 2018; originally announced August 2018.

arXiv:1801.01633 [pdf, other]

Understanding Android Obfuscation Techniques: A Large-Scale Investigation in the Wild

Authors: Shuaike Dong, Menghao Li, Wenrui Diao, Xiangyu Liu, Jian Liu, Zhou Li, Fenghao Xu, Kai Chen, Xiaofeng Wang, Kehuan Zhang

Abstract: In this paper, we seek to better understand Android obfuscation and depict a holistic view of the usage of obfuscation through a large-scale investigation in the wild. In particular, we focus on four popular obfuscation approaches: identifier renaming, string encryption, Java reflection, and packing. To obtain the meaningful statistical results, we designed efficient and lightweight detection mode… ▽ More In this paper, we seek to better understand Android obfuscation and depict a holistic view of the usage of obfuscation through a large-scale investigation in the wild. In particular, we focus on four popular obfuscation approaches: identifier renaming, string encryption, Java reflection, and packing. To obtain the meaningful statistical results, we designed efficient and lightweight detection models for each obfuscation technique and applied them to our massive APK datasets (collected from Google Play, multiple third-party markets, and malware databases). We have learned several interesting facts from the result. For example, malware authors use string encryption more frequently, and more apps on third-party markets than Google Play are packed. We are also interested in the explanation of each finding. Therefore we carry out in-depth code analysis on some Android apps after sampling. We believe our study will help developers select the most suitable obfuscation approach, and in the meantime help researchers improve code analysis systems in the right direction. △ Less

Submitted 5 January, 2018; originally announced January 2018.

arXiv:1703.09809 [pdf, other]

Understanding IoT Security Through the Data Crystal Ball: Where We Are Now and Where We Are Going to Be

Authors: Nan Zhang, Soteris Demetriou, Xianghang Mi, Wenrui Diao, Kan Yuan, Peiyuan Zong, Feng Qian, XiaoFeng Wang, Kai Chen, Yuan Tian, Carl A. Gunter, Kehuan Zhang, Patrick Tague, Yue-Hsun Lin

Abstract: Inspired by the boom of the consumer IoT market, many device manufacturers, start-up companies and technology giants have jumped into the space. Unfortunately, the exciting utility and rapid marketization of IoT, come at the expense of privacy and security. Industry reports and academic work have revealed many attacks on IoT systems, resulting in privacy leakage, property loss and large-scale avai… ▽ More Inspired by the boom of the consumer IoT market, many device manufacturers, start-up companies and technology giants have jumped into the space. Unfortunately, the exciting utility and rapid marketization of IoT, come at the expense of privacy and security. Industry reports and academic work have revealed many attacks on IoT systems, resulting in privacy leakage, property loss and large-scale availability problems. To mitigate such threats, a few solutions have been proposed. However, it is still less clear what are the impacts they can have on the IoT ecosystem. In this work, we aim to perform a comprehensive study on reported attacks and defenses in the realm of IoT aiming to find out what we know, where the current studies fall short and how to move forward. To this end, we first build a toolkit that searches through massive amount of online data using semantic analysis to identify over 3000 IoT-related articles. Further, by clustering such collected data using machine learning technologies, we are able to compare academic views with the findings from industry and other sources, in an attempt to understand the gaps between them, the trend of the IoT security risks and new problems that need further attention. We systemize this process, by proposing a taxonomy for the IoT ecosystem and organizing IoT security into five problem areas. We use this taxonomy as a beacon to assess each IoT work across a number of properties we define. Our assessment reveals that relevant security and privacy problems are far from solved. We discuss how each proposed solution can be applied to a problem area and highlight their strengths, assumptions and constraints. We stress the need for a security framework for IoT vendors and discuss the trend of shifting security liability to external or centralized entities. We also identify open research problems and provide suggestions towards a secure IoT ecosystem. △ Less

Submitted 28 March, 2017; originally announced March 2017.

arXiv:1605.06610 [pdf, other]

Vulnerable GPU Memory Management: Towards Recovering Raw Data from GPU

Authors: Zhe Zhou, Wenrui Diao, Xiangyu Liu, Zhou Li, Kehuan Zhang, Rui Liu

Abstract: In this paper, we present that security threats coming with existing GPU memory management strategy are overlooked, which opens a back door for adversaries to freely break the memory isolation: they enable adversaries without any privilege in a computer to recover the raw memory data left by previous processes directly. More importantly, such attacks can work on not only normal multi-user operatin… ▽ More In this paper, we present that security threats coming with existing GPU memory management strategy are overlooked, which opens a back door for adversaries to freely break the memory isolation: they enable adversaries without any privilege in a computer to recover the raw memory data left by previous processes directly. More importantly, such attacks can work on not only normal multi-user operating systems, but also cloud computing platforms. To demonstrate the seriousness of such attacks, we recovered original data directly from GPU memory residues left by exited commodity applications, including Google Chrome, Adobe Reader, GIMP, Matlab. The results show that, because of the vulnerable memory management strategy, commodity applications in our experiments are all affected. △ Less

Submitted 21 May, 2016; originally announced May 2016.

arXiv:1407.5410 [pdf, other]

An Empirical Study on Android for Saving Non-shared Data on Public Storage

Authors: Xiangyu Liu, Zhe Zhou, Wenrui Diao, Zhou Li, Kehuan Zhang

Abstract: With millions of apps that can be downloaded from official or third-party market, Android has become one of the most popular mobile platforms today. These apps help people in all kinds of ways and thus have access to lots of user's data that in general fall into three categories: sensitive data, data to be shared with other apps, and non-sensitive data not to be shared with others. For the first a… ▽ More With millions of apps that can be downloaded from official or third-party market, Android has become one of the most popular mobile platforms today. These apps help people in all kinds of ways and thus have access to lots of user's data that in general fall into three categories: sensitive data, data to be shared with other apps, and non-sensitive data not to be shared with others. For the first and second type of data, Android has provided very good storage models: an app's private sensitive data are saved to its private folder that can only be access by the app itself, and the data to be shared are saved to public storage (either the external SD card or the emulated SD card area on internal FLASH memory). But for the last type, i.e., an app's non-sensitive and non-shared data, there is a big problem in Android's current storage model which essentially encourages an app to save its non-sensitive data to shared public storage that can be accessed by other apps. At first glance, it seems no problem to do so, as those data are non-sensitive after all, but it implicitly assumes that app developers could correctly identify all sensitive data and prevent all possible information leakage from private-but-non-sensitive data. In this paper, we will demonstrate that this is an invalid assumption with a thorough survey on information leaks of those apps that had followed Android's recommended storage model for non-sensitive data. Our studies showed that highly sensitive information from billions of users can be easily hacked by exploiting the mentioned problematic storage model. Although our empirical studies are based on a limited set of apps, the identified problems are never isolated or accidental bugs of those apps being investigated. On the contrary, the problem is rooted from the vulnerable storage model recommended by Android. To mitigate the threat, we also propose a defense framework. △ Less

Submitted 10 November, 2014; v1 submitted 21 July, 2014; originally announced July 2014.

arXiv:1407.4923 [pdf, other]

Your Voice Assistant is Mine: How to Abuse Speakers to Steal Information and Control Your Phone

Authors: Wenrui Diao, Xiangyu Liu, Zhe Zhou, Kehuan Zhang

Abstract: Previous research about sensor based attacks on Android platform focused mainly on accessing or controlling over sensitive device components, such as camera, microphone and GPS. These approaches get data from sensors directly and need corresponding sensor invoking permissions. This paper presents a novel approach (GVS-Attack) to launch permission bypassing attacks from a zero permission Android… ▽ More Previous research about sensor based attacks on Android platform focused mainly on accessing or controlling over sensitive device components, such as camera, microphone and GPS. These approaches get data from sensors directly and need corresponding sensor invoking permissions. This paper presents a novel approach (GVS-Attack) to launch permission bypassing attacks from a zero permission Android application (VoicEmployer) through the speaker. The idea of GVS-Attack utilizes an Android system built-in voice assistant module -- Google Voice Search. Through Android Intent mechanism, VoicEmployer triggers Google Voice Search to the foreground, and then plays prepared audio files (like "call number 1234 5678") in the background. Google Voice Search can recognize this voice command and execute corresponding operations. With ingenious designs, our GVS-Attack can forge SMS/Email, access privacy information, transmit sensitive data and achieve remote control without any permission. Also we found a vulnerability of status checking in Google Search app, which can be utilized by GVS-Attack to dial arbitrary numbers even when the phone is securely locked with password. A prototype of VoicEmployer has been implemented to demonstrate the feasibility of GVS-Attack in real world. In theory, nearly all Android devices equipped with Google Services Framework can be affected by GVS-Attack. This study may inspire application developers and researchers rethink that zero permission doesn't mean safety and the speaker can be treated as a new attack surface. △ Less

Submitted 18 July, 2014; originally announced July 2014.

arXiv:1407.0803 [pdf, other]

Acoustic Fingerprinting Revisited: Generate Stable Device ID Stealthy with Inaudible Sound

Authors: Zhe Zhou, Wenrui Diao, Xiangyu Liu, Kehuan Zhang

Abstract: The popularity of mobile device has made people's lives more convenient, but threatened people's privacy at the same time. As end users are becoming more and more concerned on the protection of their private information, it is even harder to track a specific user using conventional technologies. For example, cookies might be cleared by users regularly. Apple has stopped apps accessing UDIDs, and A… ▽ More The popularity of mobile device has made people's lives more convenient, but threatened people's privacy at the same time. As end users are becoming more and more concerned on the protection of their private information, it is even harder to track a specific user using conventional technologies. For example, cookies might be cleared by users regularly. Apple has stopped apps accessing UDIDs, and Android phones use some special permission to protect IMEI code. To address this challenge, some recent studies have worked on tracing smart phones using the hardware features resulted from the imperfect manufacturing process. These works have demonstrated that different devices can be differentiated to each other. However, it still has a long way to go in order to replace cookie and be deployed in real world scenarios, especially in terms of properties like uniqueness, robustness, etc. In this paper, we presented a novel method to generate stable and unique device ID stealthy for smartphones by exploiting the frequency response of the speaker. With carefully selected audio frequencies and special sound wave patterns, we can reduce the impacts of non-linear effects and noises, and keep our feature extraction process un-noticeable to users. The extracted feature is not only very stable for a given smart phone speaker, but also unique to that phone. The feature contains rich information that is equivalent to around 40 bits of entropy, which is enough to identify billions of different smart phones of the same model. We have built a prototype to evaluate our method, and the results show that the generated device ID can be used as a replacement of cookie. △ Less

Submitted 3 July, 2014; originally announced July 2014.

Comments: 12 pages

Showing 1–33 of 33 results for author: Diao, W