Search | arXiv e-print repository

Cracking Instance Jigsaw Puzzles: An Alternative to Multiple Instance Learning for Whole Slide Image Analysis

Authors: Xiwen Chen, Peijie Qiu, Wenhui Zhu, Hao Wang, Huayu Li, Xuanzhao Dong, Xiaotong Sun, Xiaobing Yu, Yalin Wang, Abolfazl Razi, Aristeidis Sotiras

Abstract: While multiple instance learning (MIL) has shown to be a promising approach for histopathological whole slide image (WSI) analysis, its reliance on permutation invariance significantly limits its capacity to effectively uncover semantic correlations between instances within WSIs. Based on our empirical and theoretical investigations, we argue that approaches that are not permutation-invariant but… ▽ More While multiple instance learning (MIL) has shown to be a promising approach for histopathological whole slide image (WSI) analysis, its reliance on permutation invariance significantly limits its capacity to effectively uncover semantic correlations between instances within WSIs. Based on our empirical and theoretical investigations, we argue that approaches that are not permutation-invariant but better capture spatial correlations between instances can offer more effective solutions. In light of these findings, we propose a novel alternative to existing MIL for WSI analysis by learning to restore the order of instances from their randomly shuffled arrangement. We term this task as cracking an instance jigsaw puzzle problem, where semantic correlations between instances are uncovered. To tackle the instance jigsaw puzzles, we propose a novel Siamese network solution, which is theoretically justified by optimal transport theory. We validate the proposed method on WSI classification and survival prediction tasks, where the proposed method outperforms the recent state-of-the-art MIL competitors. The code is available at https://github.com/xiwenc1/MIL-JigsawPuzzles. △ Less

Submitted 10 July, 2025; originally announced July 2025.

Comments: Accepted by ICCV2025

arXiv:2507.03729 [pdf, ps, other]

Improving SAGIN Resilience to Jamming with Reconfigurable Intelligent Surfaces

Authors: Leila Marandi, Khaled Humadi, Gunes Karabulut Kurt, Wessam Ajib, Wei-Ping Zhu

Abstract: This study investigates the anti-jamming space-air-ground integrated network (SAGIN) scenario wherein a reconfigurable intelligent surface (RIS) is deployed on a fixed Unmanned Aerial Vehicle (UAV) to counteract malevolent jamming attacks. In contrast to existing research, in this paper, we consider that a Low Earth Orbit (LEO) satellite is sending the signal to the user on the ground in the prese… ▽ More This study investigates the anti-jamming space-air-ground integrated network (SAGIN) scenario wherein a reconfigurable intelligent surface (RIS) is deployed on a fixed Unmanned Aerial Vehicle (UAV) to counteract malevolent jamming attacks. In contrast to existing research, in this paper, we consider that a Low Earth Orbit (LEO) satellite is sending the signal to the user on the ground in the presence of jamming from a Geostationary Equatorial Orbit (GEO) satellite side. We aim to maximize the signal-to-jamming plus noise ratio (SJNR) by optimizing the RIS beamforming and transmit power of the LEO satellite. Assuming the availability of global channel state information (CSI) at the RIS, we propose alternating optimization (AO) and semidefinite relaxation (SDR) techniques to address the complexity. Simulation results show that the optimization schemes lead to considerable performance improvements. The results also indicate that, given the high jamming power and the relatively small number of RIS elements, deploying the RIS on UAVs near the user is more effective in mitigating the impact of jamming interferers. △ Less

Submitted 4 July, 2025; originally announced July 2025.

Comments: Accepted at IEEE VTC-Fall 2025. 7 pages, 4 figures

arXiv:2506.23203 [pdf, ps, other]

Multi-Branch DNN and CRLB-Ratio-Weight Fusion for Enhanced DOA Sensing via a Massive H$^2$AD MIMO Receiver

Authors: Feng Shu, Jiatong Bai, Di Wu, Wei Zhu, Bin Deng, Fuhui Zhou, Jiangzhou Wang

Abstract: As a green MIMO structure, massive H$^2$AD is viewed as a potential technology for the future 6G wireless network. For such a structure, it is a challenging task to design a low-complexity and high-performance fusion of target direction values sensed by different sub-array groups with fewer use of prior knowledge. To address this issue, a lightweight Cramer-Rao lower bound (CRLB)-ratio-weight fusi… ▽ More As a green MIMO structure, massive H$^2$AD is viewed as a potential technology for the future 6G wireless network. For such a structure, it is a challenging task to design a low-complexity and high-performance fusion of target direction values sensed by different sub-array groups with fewer use of prior knowledge. To address this issue, a lightweight Cramer-Rao lower bound (CRLB)-ratio-weight fusion (WF) method is proposed, which approximates inverse CRLB of each subarray using antenna number reciprocals to eliminate real-time CRLB computation. This reduces complexity and prior knowledge dependence while preserving fusion performance. Moreover, a multi-branch deep neural network (MBDNN) is constructed to further enhance direction-of-arrival (DOA) sensing by leveraging candidate angles from multiple subarrays. The subarray-specific branch networks are integrated with a shared regression module to effectively eliminate pseudo-solutions and fuse true angles. Simulation results show that the proposed CRLB-ratio-WF method achieves DOA sensing performance comparable to CRLB-based methods, while significantly reducing the reliance on prior knowledge. More notably, the proposed MBDNN has superior performance in low-SNR ranges. At SNR $= -15$ dB, it achieves an order-of-magnitude improvement in estimation accuracy compared to CRLB-ratio-WF method. △ Less

Submitted 29 June, 2025; originally announced June 2025.

arXiv:2506.21349 [pdf, ps, other]

Generalizable Neural Electromagnetic Inverse Scattering

Authors: Yizhe Cheng, Chunxun Tian, Haoru Wang, Wentao Zhu, Xiaoxuan Ma, Yizhou Wang

Abstract: Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by l… ▽ More Solving Electromagnetic Inverse Scattering Problems (EISP) is fundamental in applications such as medical imaging, where the goal is to reconstruct the relative permittivity from scattered electromagnetic field. This inverse process is inherently ill-posed and highly nonlinear, making it particularly challenging. A recent machine learning-based approach, Img-Interiors, shows promising results by leveraging continuous implicit functions. However, it requires case-specific optimization, lacks generalization to unseen data, and fails under sparse transmitter setups (e.g., with only one transmitter). To address these limitations, we revisit EISP from a physics-informed perspective, reformulating it as a two stage inverse transmission-scattering process. This formulation reveals the induced current as a generalizable intermediate representation, effectively decoupling the nonlinear scattering process from the ill-posed inverse problem. Built on this insight, we propose the first generalizable physics-driven framework for EISP, comprising a current estimator and a permittivity solver, working in an end-to-end manner. The current estimator explicitly learns the induced current as a physical bridge between the incident and scattered field, while the permittivity solver computes the relative permittivity directly from the estimated induced current. This design enables data-driven training and generalizable feed-forward prediction of relative permittivity on unseen data while maintaining strong robustness to transmitter sparsity. Extensive experiments show that our method outperforms state-of-the-art approaches in reconstruction accuracy, generalization, and robustness. This work offers a fundamentally new perspective on electromagnetic inverse scattering and represents a major step toward cost-effective practical solutions for electromagnetic imaging. △ Less

Submitted 1 July, 2025; v1 submitted 26 June, 2025; originally announced June 2025.

arXiv:2506.18671 [pdf, ps, other]

TCDiff++: An End-to-end Trajectory-Controllable Diffusion Model for Harmonious Music-Driven Group Choreography

Authors: Yuqin Dai, Wanlu Zhu, Ronghui Li, Xiu Li, Zhenyu Zhang, Jun Li, Jian Yang

Abstract: Music-driven dance generation has garnered significant attention due to its wide range of industrial applications, particularly in the creation of group choreography. During the group dance generation process, however, most existing methods still face three primary issues: multi-dancer collisions, single-dancer foot sliding and abrupt swapping in the generation of long group dance. In this paper,… ▽ More Music-driven dance generation has garnered significant attention due to its wide range of industrial applications, particularly in the creation of group choreography. During the group dance generation process, however, most existing methods still face three primary issues: multi-dancer collisions, single-dancer foot sliding and abrupt swapping in the generation of long group dance. In this paper, we propose TCDiff++, a music-driven end-to-end framework designed to generate harmonious group dance. Specifically, to mitigate multi-dancer collisions, we utilize a dancer positioning embedding to better maintain the relative positioning among dancers. Additionally, we incorporate a distance-consistency loss to ensure that inter-dancer distances remain within plausible ranges. To address the issue of single-dancer foot sliding, we introduce a swap mode embedding to indicate dancer swapping patterns and design a Footwork Adaptor to refine raw motion, thereby minimizing foot sliding. For long group dance generation, we present a long group diffusion sampling strategy that reduces abrupt position shifts by injecting positional information into the noisy input. Furthermore, we integrate a Sequence Decoder layer to enhance the model's ability to selectively process long sequences. Extensive experiments demonstrate that our TCDiff++ achieves state-of-the-art performance, particularly in long-duration scenarios, ensuring high-quality and coherent group dance generation. △ Less

Submitted 26 June, 2025; v1 submitted 23 June, 2025; originally announced June 2025.

arXiv:2506.12908 [pdf, ps, other]

Low-Latency Terrestrial Interference Detection for Satellite-to-Device Communications

Authors: Runnan Liu, Weifeng Zhu, Shu Sun, Wenjun Zhang

Abstract: Direct satellite-to-device communication is a promising future direction due to its lower latency and enhanced efficiency. However, intermittent and unpredictable terrestrial interference significantly affects system reliability and performance. Continuously employing sophisticated interference mitigation techniques is practically inefficient. Motivated by the periodic idle intervals characteristi… ▽ More Direct satellite-to-device communication is a promising future direction due to its lower latency and enhanced efficiency. However, intermittent and unpredictable terrestrial interference significantly affects system reliability and performance. Continuously employing sophisticated interference mitigation techniques is practically inefficient. Motivated by the periodic idle intervals characteristic of burst-mode satellite transmissions, this paper investigates online interference detection frameworks specifically tailored for satellite-to-device scenarios. We first rigorously formulate interference detection as a binary hypothesis testing problem, leveraging differences between Rayleigh (no interference) and Rice (interference present) distributions. Then, we propose a cumulative sum (CUSUM)-based online detector for scenarios with known interference directions, explicitly characterizing the trade-off between detection latency and false alarm rate, and establish its asymptotic optimality. For practical scenarios involving unknown interference direction, we further propose a generalized likelihood ratio (GLR)-based detection method, jointly estimating interference direction via the Root-MUSIC algorithm. Numerical results validate our theoretical findings and demonstrate that our proposed methods achieve high detection accuracy with remarkably low latency, highlighting their practical applicability in future satellite-to-device communication systems. △ Less

Submitted 15 June, 2025; originally announced June 2025.

Comments: 6 pages

arXiv:2504.14783 [pdf, other]

How Effective Can Dropout Be in Multiple Instance Learning ?

Authors: Wenhui Zhu, Peijie Qiu, Xiwen Chen, Zhangsihao Yang, Aristeidis Sotiras, Abolfazl Razi, Yalin Wang

Abstract: Multiple Instance Learning (MIL) is a popular weakly-supervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is… ▽ More Multiple Instance Learning (MIL) is a popular weakly-supervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is well-known that this suboptimal training scheme suffers from "noisy" feature embeddings from the backbone and inherent weak supervision, hindering MIL from learning rich and generalizable features. However, the most commonly used technique (i.e., dropout) for mitigating this issue has yet to be explored in MIL. In this paper, we empirically explore how effective the dropout can be in MIL. Interestingly, we observe that dropping the top-k most important instances within a bag leads to better performance and generalization even under noise attack. Based on this key observation, we propose a novel MIL-specific dropout method, termed MIL-Dropout, which systematically determines which instances to drop. Experiments on five MIL benchmark datasets and two WSI datasets demonstrate that MIL-Dropout boosts the performance of current MIL methods with a negligible computational cost. The code is available at https://github.com/ChongQingNoSubway/MILDropout. △ Less

Submitted 20 May, 2025; v1 submitted 20 April, 2025; originally announced April 2025.

Comments: Accepted by ICML2025

arXiv:2504.11162 [pdf, ps, other]

Scalable Transceiver Design for Multi-User Communication in FDD Massive MIMO Systems via Deep Learning

Authors: Lin Zhu, Weifeng Zhu, Shuowen Zhang, Shuguang Cui, Liang Liu

Abstract: This paper addresses the joint transceiver design, including pilot transmission, channel feature extraction and feedback, as well as precoding, for low-overhead downlink massive multiple-input multiple-output (MIMO) communication in frequency-division duplex (FDD) systems. Although deep learning (DL) has shown great potential in tackling this problem, existing methods often suffer from poor scalab… ▽ More This paper addresses the joint transceiver design, including pilot transmission, channel feature extraction and feedback, as well as precoding, for low-overhead downlink massive multiple-input multiple-output (MIMO) communication in frequency-division duplex (FDD) systems. Although deep learning (DL) has shown great potential in tackling this problem, existing methods often suffer from poor scalability in practical systems, as the solution obtained in the training phase merely works for a fixed feedback capacity and a fixed number of users in the deployment phase. To address this limitation, we propose a novel DL-based framework comprised of choreographed neural networks, which can utilize one training phase to generate all the transceiver solutions used in the deployment phase with varying sizes of feedback codebooks and numbers of users. The proposed framework includes a residual vector-quantized variational autoencoder (RVQ-VAE) for efficient channel feedback and an edge graph attention network (EGAT) for robust multiuser precoding. It can adapt to different feedback capacities by flexibly adjusting the RVQ codebook sizes using the hierarchical codebook structure, and scale with the number of users through a feedback module sharing scheme and the inherent scalability of EGAT. Moreover, a progressive training strategy is proposed to further enhance data transmission performance and generalization capability. Numerical results on a real-world dataset demonstrate the superior scalability and performance of our approach over existing methods. △ Less

Submitted 15 April, 2025; originally announced April 2025.

arXiv:2504.10181 [pdf]

A New Paradigm in IBR Modeling for Power Flow and Short Circuit Analysis

Authors: Zahid Javid, Firdous Ul Nazir, Wentao Zhu, Diptargha Chakravorty, Ahmed Aboushady, Mohamed Galeela

Abstract: The fault characteristics of inverter-based resources (IBRs) are different from conventional synchronous generators. The fault response of IBRs is non-linear due to saturation states and mainly determined by fault ride through (FRT) strategies of the associated voltage source converter (VSC). This results in prohibitively large solution times for power flows considering these short circuit charact… ▽ More The fault characteristics of inverter-based resources (IBRs) are different from conventional synchronous generators. The fault response of IBRs is non-linear due to saturation states and mainly determined by fault ride through (FRT) strategies of the associated voltage source converter (VSC). This results in prohibitively large solution times for power flows considering these short circuit characteristics, especially when the power system states change fast due to uncertainty in IBR generations. To overcome this, a phasor-domain steady state (SS) short circuit (SC) solver for IBR dominated power systems is proposed in this paper, and subsequently the developed IBR models are incorporated with a novel Jacobian-based Power Flow (PF) solver. In this multiphase PF solver, any power system components can be modeled by considering their original non-linear or linear mathematical representations. Moreover, two novel FRT strategies are proposed to fully utilize the converter capacity and to comply with IEEE-2800 2022 std and German grid code. The results are compared with the Electromagnetic Transient (EMT) simulation on the IEEE 34 test network and the 120 kV EPRI benchmark system. The developed IBR sequence domain PF model demonstrates more accurate behavior compared to the classical IBR generator model. The error in calculating the short circuit current with the proposed SC solver is less than 3%, while achieving significant speed improvements of three order of magnitudes. △ Less

Submitted 14 April, 2025; originally announced April 2025.

Comments: 12 Pages, First Revision Submitted

arXiv:2502.14260 [pdf, other]

EyeBench: A Call for More Rigorous Evaluation of Retinal Image Enhancement

Authors: Wenhui Zhu, Xuanzhao Dong, Xin Li, Yujian Xiong, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Zhangsihao Yang, Yi Su, Oana Dumitrascu, Yalin Wang

Abstract: Over the past decade, generative models have achieved significant success in enhancement fundus images.However, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical r… ▽ More Over the past decade, generative models have achieved significant success in enhancement fundus images.However, the evaluation of these models still presents a considerable challenge. A comprehensive evaluation benchmark for fundus image enhancement is indispensable for three main reasons: 1) The existing denoising metrics (e.g., PSNR, SSIM) are hardly to extend to downstream real-world clinical research (e.g., Vessel morphology consistency). 2) There is a lack of comprehensive evaluation for both paired and unpaired enhancement methods, along with the need for expert protocols to accurately assess clinical value. 3) An ideal evaluation system should provide insights to inform future developments of fundus image enhancement. To this end, we propose a novel comprehensive benchmark, EyeBench, to provide insights that align enhancement models with clinical needs, offering a foundation for future work to improve the clinical relevance and applicability of generative models for fundus image enhancement. EyeBench has three appealing properties: 1) multi-dimensional clinical alignment downstream evaluation: In addition to evaluating the enhancement task, we provide several clinically significant downstream tasks for fundus images, including vessel segmentation, DR grading, denoising generalization, and lesion segmentation. 2) Medical expert-guided evaluation design: We introduce a novel dataset that promote comprehensive and fair comparisons between paired and unpaired methods and includes a manual evaluation protocol by medical experts. 3) Valuable insights: Our benchmark study provides a comprehensive and rigorous evaluation of existing methods across different downstream tasks, assisting medical experts in making informed choices. Additionally, we offer further analysis of the challenges faced by existing methods. The code is available at \url{https://github.com/Retinal-Research/EyeBench} △ Less

Submitted 19 February, 2025; originally announced February 2025.

arXiv:2502.02295 [pdf, ps, other]

Intelligent Reflecting Surface Based Localization of Mixed Near-Field and Far-Field Targets

Authors: Weifeng Zhu, Qipeng Wang, Shuowen Zhang, Boya Di, Liang Liu, Yonina C. Eldar

Abstract: This paper considers an intelligent reflecting surface (IRS)-assisted bi-static localization architecture for the sixth-generation (6G) integrated sensing and communication (ISAC) network. The system consists of a transmit user, a receive base station (BS), an IRS, and multiple targets in either the far-field or near-field region of the IRS. In particular, we focus on the challenging scenario wher… ▽ More This paper considers an intelligent reflecting surface (IRS)-assisted bi-static localization architecture for the sixth-generation (6G) integrated sensing and communication (ISAC) network. The system consists of a transmit user, a receive base station (BS), an IRS, and multiple targets in either the far-field or near-field region of the IRS. In particular, we focus on the challenging scenario where the line-of-sight (LOS) paths between targets and the BS are blocked, such that the emitted orthogonal frequency division multiplexing (OFDM) signals from the user reach the BS merely via the user-target-IRS-BS path. Based on the signals received by the BS, our goal is to localize the targets by estimating their relative positions to the IRS, instead of to the BS. We show that subspace-based methods, such as the multiple signal classification (MUSIC) algorithm, can be applied onto the BS's received signals to estimate the relative states from the targets to the IRS. To this end, we create a virtual signal via combining user-target-IRS-BS channels over various time slots. By applying MUSIC on such a virtual signal, we are able to detect the far-field targets and the near-field targets, and estimate the angle-of-arrivals (AOAs) and/or ranges from the targets to the IRS. Furthermore, we theoretically verify that the proposed method can perfectly estimate the relative states from the targets to the IRS in the ideal case with infinite coherence blocks. Numerical results verify the effectiveness of our proposed IRS-assisted localization scheme. Our paper demonstrates the potential of employing passive anchors, i.e., IRSs, to improve the sensing coverage of the active anchors, i.e., BSs. △ Less

Submitted 4 February, 2025; originally announced February 2025.

arXiv:2501.08809 [pdf, other]

XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework

Authors: Sida Tian, Can Zhang, Wei Yuan, Wei Tan, Wenjie Zhu

Abstract: In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quali… ▽ More In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine Highlights of Collectibles at WAIC 2023. The project homepage of XMusic is https://xmusic-project.github.io. △ Less

Submitted 15 January, 2025; originally announced January 2025.

Comments: accepted by TMM

arXiv:2412.19071 [pdf, other]

Movable Intelligent Surface (MIS) for Wireless Communications: Architecture, Modeling, Algorithm, and Prototyping

Authors: Ziyuan Zheng, Qingqing Wu, Wen Chen, Xiangming Wu, Weiren Zhu

Abstract: Reconfigurable intelligent surfaces enhance wireless systems by reshaping propagation environments. However, dynamic metasurfaces (MSs) with numerous phase-shift elements incur undesired control and hardware costs. In contrast, static MSs (SMSs), configured with static phase shifts pre-designed for specific communication demands, offer a cost-effective alternative by eliminating element-wise tunin… ▽ More Reconfigurable intelligent surfaces enhance wireless systems by reshaping propagation environments. However, dynamic metasurfaces (MSs) with numerous phase-shift elements incur undesired control and hardware costs. In contrast, static MSs (SMSs), configured with static phase shifts pre-designed for specific communication demands, offer a cost-effective alternative by eliminating element-wise tuning. Nevertheless, SMSs typically support a single beam pattern with limited flexibility. In this paper, we propose a novel Movable Intelligent Surface (MIS) technology that enables dynamic beamforming while maintaining static phase shifts. Specifically, we design a MIS architecture comprising two closely stacked transmissive MSs: a larger fixed-position MS 1 and a smaller movable MS 2. By differentially shifting MS 2's position relative to MS 1, the MIS synthesizes distinct beam patterns. Then, we model the interaction between MS 2 and MS 1 using binary selection matrices and padding vectors and formulate a new optimization problem that jointly designs the MIS phase shifts and selects shifting positions for worst-case signal-to-noise ratio maximization. This position selection, equal to beam pattern scheduling, offers a new degree of freedom for RIS-aided systems. To solve the intractable problem, we develop an efficient algorithm that handles unit-modulus and binary constraints and employs manifold optimization methods. Finally, extensive validation results are provided. We implement a MIS prototype and perform proof-of-concept experiments, demonstrating the MIS's ability to synthesize desired beam patterns that achieve one-dimensional beam steering. Numerical results show that by introducing MS 2 with a few elements, MIS effectively offers beamforming flexibility for significantly improved performance. We also draw insights into the optimal MIS configuration and element allocation strategy. △ Less

Submitted 26 December, 2024; originally announced December 2024.

Comments: 13 pages,10 figures, submitted to IEEE Transactions for possible publications

arXiv:2412.07040 [pdf, other]

Beyond Idle Channels: Unlocking Idle Space with Signal Alignment in Massive MIMO Cognitive Radio Networks

Authors: Weidong Zhu, Xueqian Li, Longwei Wang, Zheng Zhang

Abstract: Cognitive radio networks (CRNs) have traditionally focused on utilizing idle channels to enhance spectrum efficiency. However, as wireless networks grow denser, channel-centric strategies face increasing limitations. This paper introduces a paradigm shift by exploring the underutilized potential of idle spatial dimensions, termed idle space, in co-channel transmissions. By integrating massive mult… ▽ More Cognitive radio networks (CRNs) have traditionally focused on utilizing idle channels to enhance spectrum efficiency. However, as wireless networks grow denser, channel-centric strategies face increasing limitations. This paper introduces a paradigm shift by exploring the underutilized potential of idle spatial dimensions, termed idle space, in co-channel transmissions. By integrating massive multiple-input multiple-output (MIMO) systems with signal alignment techniques, we enable secondary users to transmit without causing interference to primary users by aligning their signals within the null spaces of primary receivers. We propose a comprehensive framework that synergizes spatial spectrum sensing, signal alignment, and resource allocation, specifically designed for secondary users in CRNs. Theoretical analyses and extensive simulations validate the framework, demonstrating substantial gains in spectrum efficiency, throughput, and interference mitigation. The results show that the proposed approach not only ensures interference-free coexistence with primary users but also unlocks untapped spatial resources for secondary transmissions. △ Less

Submitted 9 December, 2024; originally announced December 2024.

Comments: 9 pages, 7 figures

arXiv:2411.01403 [pdf, other]

TPOT: Topology Preserving Optimal Transport in Retinal Fundus Image Enhancement

Authors: Xuanzhao Dong, Wenhui Zhu, Xin Li, Guoxin Sun, Yi Su, Oana M. Dumitrascu, Yalin Wang

Abstract: Retinal fundus photography enhancement is important for diagnosing and monitoring retinal diseases. However, early approaches to retinal image enhancement, such as those based on Generative Adversarial Networks (GANs), often struggle to preserve the complex topological information of blood vessels, resulting in spurious or missing vessel structures. The persistence diagram, which captures topologi… ▽ More Retinal fundus photography enhancement is important for diagnosing and monitoring retinal diseases. However, early approaches to retinal image enhancement, such as those based on Generative Adversarial Networks (GANs), often struggle to preserve the complex topological information of blood vessels, resulting in spurious or missing vessel structures. The persistence diagram, which captures topological features based on the persistence of topological structures under different filtrations, provides a promising way to represent the structure information. In this work, we propose a topology-preserving training paradigm that regularizes blood vessel structures by minimizing the differences of persistence diagrams. We call the resulting framework Topology Preserving Optimal Transport (TPOT). Experimental results on a large-scale dataset demonstrate the superiority of the proposed method compared to several state-of-the-art supervised and unsupervised techniques, both in terms of image quality and performance in the downstream blood vessel segmentation task. The code is available at https://github.com/Retinal-Research/TPOT. △ Less

Submitted 2 November, 2024; originally announced November 2024.

arXiv:2410.22674 [pdf]

Dynamic PET Image Prediction Using a Network Combining Reversible and Irreversible Modules

Authors: Jie Sun, Qian Xia, Chuanfu Sun, Yumei Chen, Huafeng Liu, Wentao Zhu, Qiegen Liu

Abstract: Dynamic positron emission tomography (PET) images can reveal the distribution of tracers in the organism and the dynamic processes involved in biochemical reactions, and it is widely used in clinical practice. Despite the high effectiveness of dynamic PET imaging in studying the kinetics and metabolic processes of radiotracers. Pro-longed scan times can cause discomfort for both patients and medic… ▽ More Dynamic positron emission tomography (PET) images can reveal the distribution of tracers in the organism and the dynamic processes involved in biochemical reactions, and it is widely used in clinical practice. Despite the high effectiveness of dynamic PET imaging in studying the kinetics and metabolic processes of radiotracers. Pro-longed scan times can cause discomfort for both patients and medical personnel. This study proposes a dynamic frame prediction method for dynamic PET imaging, reduc-ing dynamic PET scanning time by applying a multi-module deep learning framework composed of reversible and irreversible modules. The network can predict kinetic parameter images based on the early frames of dynamic PET images, and then generate complete dynamic PET images. In validation experiments with simulated data, our network demonstrated good predictive performance for kinetic parameters and was able to reconstruct high-quality dynamic PET images. Additionally, in clinical data experiments, the network exhibited good generalization performance and attached that the proposed method has promising clinical application prospects. △ Less

Submitted 29 October, 2024; originally announced October 2024.

arXiv:2410.15036 [pdf, other]

EViT-Unet: U-Net Like Efficient Vision Transformer for Medical Image Segmentation on Mobile and Edge Devices

Authors: Xin Li, Wenhui Zhu, Xuanzhao Dong, Oana M. Dumitrascu, Yalin Wang

Abstract: With the rapid development of deep learning, CNN-based U-shaped networks have succeeded in medical image segmentation and are widely applied for various tasks. However, their limitations in capturing global features hinder their performance in complex segmentation tasks. The rise of Vision Transformer (ViT) has effectively compensated for this deficiency of CNNs and promoted the application of ViT… ▽ More With the rapid development of deep learning, CNN-based U-shaped networks have succeeded in medical image segmentation and are widely applied for various tasks. However, their limitations in capturing global features hinder their performance in complex segmentation tasks. The rise of Vision Transformer (ViT) has effectively compensated for this deficiency of CNNs and promoted the application of ViT-based U-networks in medical image segmentation. However, the high computational demands of ViT make it unsuitable for many medical devices and mobile platforms with limited resources, restricting its deployment on resource-constrained and edge devices. To address this, we propose EViT-UNet, an efficient ViT-based segmentation network that reduces computational complexity while maintaining accuracy, making it ideal for resource-constrained medical devices. EViT-UNet is built on a U-shaped architecture, comprising an encoder, decoder, bottleneck layer, and skip connections, combining convolutional operations with self-attention mechanisms to optimize efficiency. Experimental results demonstrate that EViT-UNet achieves high accuracy in medical image segmentation while significantly reducing computational complexity. △ Less

Submitted 19 October, 2024; originally announced October 2024.

Comments: 5 pages, 3 figures

arXiv:2410.11578 [pdf, other]

STA-Unet: Rethink the semantic redundant for Medical Imaging Segmentation

Authors: Vamsi Krishna Vasa, Wenhui Zhu, Xiwen Chen, Peijie Qiu, Xuanzhao Dong, Yalin Wang

Abstract: In recent years, significant progress has been made in the medical image analysis domain using convolutional neural networks (CNNs). In particular, deep neural networks based on a U-shaped architecture (UNet) with skip connections have been adopted for several medical imaging tasks, including organ segmentation. Despite their great success, CNNs are not good at learning global or semantic features… ▽ More In recent years, significant progress has been made in the medical image analysis domain using convolutional neural networks (CNNs). In particular, deep neural networks based on a U-shaped architecture (UNet) with skip connections have been adopted for several medical imaging tasks, including organ segmentation. Despite their great success, CNNs are not good at learning global or semantic features. Especially ones that require human-like reasoning to understand the context. Many UNet architectures attempted to adjust with the introduction of Transformer-based self-attention mechanisms, and notable gains in performance have been noted. However, the transformers are inherently flawed with redundancy to learn at shallow layers, which often leads to an increase in the computation of attention from the nearby pixels offering limited information. The recently introduced Super Token Attention (STA) mechanism adapts the concept of superpixels from pixel space to token space, using super tokens as compact visual representations. This approach tackles the redundancy by learning efficient global representations in vision transformers, especially for the shallow layers. In this work, we introduce the STA module in the UNet architecture (STA-UNet), to limit redundancy without losing rich information. Experimental results on four publicly available datasets demonstrate the superiority of STA-UNet over existing state-of-the-art architectures in terms of Dice score and IOU for organ segmentation tasks. The code is available at \url{https://github.com/Retinal-Research/STA-UNet}. △ Less

Submitted 13 October, 2024; originally announced October 2024.

arXiv:2409.19595 [pdf, other]

Solution for Temporal Sound Localisation Task of ECCV Second Perception Test Challenge 2024

Authors: Haowei Gu, Weihao Zhu, Yang Yang

Abstract: This report proposes an improved method for the Temporal Sound Localisation (TSL) task, which localizes and classifies the sound events occurring in the video according to a predefined set of sound classes. The champion solution from last year's first competition has explored the TSL by fusing audio and video modalities with the same weight. Considering the TSL task aims to localize sound events,… ▽ More This report proposes an improved method for the Temporal Sound Localisation (TSL) task, which localizes and classifies the sound events occurring in the video according to a predefined set of sound classes. The champion solution from last year's first competition has explored the TSL by fusing audio and video modalities with the same weight. Considering the TSL task aims to localize sound events, we conduct relevant experiments that demonstrated the superiority of sound features (Section 3). Based on our findings, to enhance audio modality features, we employ various models to extract audio features, such as InterVideo, CaVMAE, and VideoMAE models. Our approach ranks first in the final test with a score of 0.4925. △ Less

Submitted 29 September, 2024; originally announced September 2024.

arXiv:2409.10966 [pdf, other]

CUNSB-RFIE: Context-aware Unpaired Neural Schrödinger Bridge in Retinal Fundus Image Enhancement

Authors: Xuanzhao Dong, Vamsi Krishna Vasa, Wenhui Zhu, Peijie Qiu, Xiwen Chen, Yi Su, Yujian Xiong, Zhangsihao Yang, Yanxi Chen, Yalin Wang

Abstract: Retinal fundus photography is significant in diagnosing and monitoring retinal diseases. However, systemic imperfections and operator/patient-related factors can hinder the acquisition of high-quality retinal images. Previous efforts in retinal image enhancement primarily relied on GANs, which are limited by the trade-off between training stability and output diversity. In contrast, the Schrödinge… ▽ More Retinal fundus photography is significant in diagnosing and monitoring retinal diseases. However, systemic imperfections and operator/patient-related factors can hinder the acquisition of high-quality retinal images. Previous efforts in retinal image enhancement primarily relied on GANs, which are limited by the trade-off between training stability and output diversity. In contrast, the Schrödinger Bridge (SB), offers a more stable solution by utilizing Optimal Transport (OT) theory to model a stochastic differential equation (SDE) between two arbitrary distributions. This allows SB to effectively transform low-quality retinal images into their high-quality counterparts. In this work, we leverage the SB framework to propose an image-to-image translation pipeline for retinal image enhancement. Additionally, previous methods often fail to capture fine structural details, such as blood vessels. To address this, we enhance our pipeline by introducing Dynamic Snake Convolution, whose tortuous receptive field can better preserve tubular structures. We name the resulting retinal fundus image enhancement framework the Context-aware Unpaired Neural Schrödinger Bridge (CUNSB-RFIE). To the best of our knowledge, this is the first endeavor to use the SB approach for retinal image enhancement. Experimental results on a large-scale dataset demonstrate the advantage of the proposed method compared to several state-of-the-art supervised and unsupervised methods in terms of image quality and performance on downstream tasks.The code is available at https://github.com/Retinal-Research/CUNSB-RFIE . △ Less

Submitted 17 September, 2024; originally announced September 2024.

arXiv:2409.07862 [pdf, other]

Context-Aware Optimal Transport Learning for Retinal Fundus Image Enhancement

Authors: Vamsi Krishna Vasa, Peijie Qiu, Wenhui Zhu, Yujian Xiong, Oana Dumitrascu, Yalin Wang

Abstract: Retinal fundus photography offers a non-invasive way to diagnose and monitor a variety of retinal diseases, but is prone to inherent quality glitches arising from systemic imperfections or operator/patient-related factors. However, high-quality retinal images are crucial for carrying out accurate diagnoses and automated analyses. The fundus image enhancement is typically formulated as a distributi… ▽ More Retinal fundus photography offers a non-invasive way to diagnose and monitor a variety of retinal diseases, but is prone to inherent quality glitches arising from systemic imperfections or operator/patient-related factors. However, high-quality retinal images are crucial for carrying out accurate diagnoses and automated analyses. The fundus image enhancement is typically formulated as a distribution alignment problem, by finding a one-to-one mapping between a low-quality image and its high-quality counterpart. This paper proposes a context-informed optimal transport (OT) learning framework for tackling unpaired fundus image enhancement. In contrast to standard generative image enhancement methods, which struggle with handling contextual information (e.g., over-tampered local structures and unwanted artifacts), the proposed context-aware OT learning paradigm better preserves local structures and minimizes unwanted artifacts. Leveraging deep contextual features, we derive the proposed context-aware OT using the earth mover's distance and show that the proposed context-OT has a solid theoretical guarantee. Experimental results on a large-scale dataset demonstrate the superiority of the proposed method over several state-of-the-art supervised and unsupervised methods in terms of signal-to-noise ratio, structural similarity index, as well as two downstream tasks. The code is available at \url{https://github.com/Retinal-Research/Contextual-OT}. △ Less

Submitted 12 September, 2024; originally announced September 2024.

arXiv:2408.15667 [pdf, other]

Towards reliable respiratory disease diagnosis based on cough sounds and vision transformers

Authors: Qian Wang, Zhaoyang Bu, Jiaxuan Mao, Wenyu Zhu, Jingya Zhao, Wei Du, Guochao Shi, Min Zhou, Si Chen, Jieming Qu

Abstract: Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or dee… ▽ More Recent advancements in deep learning techniques have sparked performance boosts in various real-world applications including disease diagnosis based on multi-modal medical data. Cough sound data-based respiratory disease (e.g., COVID-19 and Chronic Obstructive Pulmonary Disease) diagnosis has also attracted much attention. However, existing works usually utilise traditional machine learning or deep models of moderate scales. On the other hand, the developed approaches are trained and evaluated on small-scale data due to the difficulty of curating and annotating clinical data on scale. To address these issues in prior works, we create a unified framework to evaluate various deep models from lightweight Convolutional Neural Networks (e.g., ResNet18) to modern vision transformers and compare their performance in respiratory disease classification. Based on the observations from such an extensive empirical study, we propose a novel approach to cough-based disease classification based on both self-supervised and supervised learning on a large-scale cough data set. Experimental results demonstrate our proposed approach outperforms prior arts consistently on two benchmark datasets for COVID-19 diagnosis and a proprietary dataset for COPD/non-COPD classification with an AUROC of 92.5%. △ Less

Submitted 2 September, 2024; v1 submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.03174 [pdf, ps, other]

Joint Transmission and Compression Optimization for Networked Sensing with Limited-Capacity Fronthaul Links

Authors: Weifeng Zhu, Shuowen Zhang, Liang Liu

Abstract: This paper considers networked sensing in cellular network, where multiple base stations (BSs) first compress their received echo signals from multiple targets and then forward the quantized signals to the central unit (CU) via limited-capacity fronthaul links, such that the CU can leverage all useful echo signals to perform high-resolution localization. Under this setup, we manage to characterize… ▽ More This paper considers networked sensing in cellular network, where multiple base stations (BSs) first compress their received echo signals from multiple targets and then forward the quantized signals to the central unit (CU) via limited-capacity fronthaul links, such that the CU can leverage all useful echo signals to perform high-resolution localization. Under this setup, we manage to characterize the posterior Cramer-Rao Bound (PCRB) for localizing all the targets with random positions as a function of the transmit covariance matrix and the compression noise covariance matrix of each BS. Then, a PCRB minimization problem subject to the transmit power constraints and the fronthaul capacity constraints is formulated to jointly design the BSs' transmission and compression strategies. We propose an efficient algorithm to solve this problem based on the alternating optimization technique. Specifically, it is shown that when either the transmit covariance matrices or the compression noise covariance matrices are fixed, the successive convex approximation (SCA) technique can be leveraged to optimize the other type of covariance matrices locally optimally. Moreover, we also propose a novel estimate-then-beamform-then-compress strategy for the massive receive antenna scenario, under which each BS first estimates targets' angle-of-arrivals (AOAs) locally, then beamforms its high-dimension received signals into low-dimension ones based on the estimated AOAs, and last compresses the beamformed signals for fronthaul transmission. An efficient beamforming and compression design method is devised under this strategy. Numerical results are provided to verify the effectiveness of our proposed algorithms. △ Less

Submitted 6 September, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

Comments: submitted to IEEE TWC; conference paper accepted by IEEE Globecom 2024

arXiv:2407.13092 [pdf, other]

CC-DCNet: Dynamic Convolutional Neural Network with Contrastive Constraints for Identifying Lung Cancer Subtypes on Multi-modality Images

Authors: Yuan Jin, Gege Ma, Geng Chen, Tianling Lyu, Jan Egger, Junhui Lyu, Shaoting Zhang, Wentao Zhu

Abstract: The accurate diagnosis of pathological subtypes of lung cancer is of paramount importance for follow-up treatments and prognosis managements. Assessment methods utilizing deep learning technologies have introduced novel approaches for clinical diagnosis. However, the majority of existing models rely solely on single-modality image input, leading to limited diagnostic accuracy. To this end, we prop… ▽ More The accurate diagnosis of pathological subtypes of lung cancer is of paramount importance for follow-up treatments and prognosis managements. Assessment methods utilizing deep learning technologies have introduced novel approaches for clinical diagnosis. However, the majority of existing models rely solely on single-modality image input, leading to limited diagnostic accuracy. To this end, we propose a novel deep learning network designed to accurately classify lung cancer subtype with multi-dimensional and multi-modality images, i.e., CT and pathological images. The strength of the proposed model lies in its ability to dynamically process both paired CT-pathological image sets as well as independent CT image sets, and consequently optimize the pathology-related feature extractions from CT images. This adaptive learning approach enhances the flexibility in processing multi-dimensional and multi-modality datasets and results in performance elevating in the model testing phase. We also develop a contrastive constraint module, which quantitatively maps the cross-modality associations through network training, and thereby helps to explore the "gold standard" pathological information from the corresponding CT scans. To evaluate the effectiveness, adaptability, and generalization ability of our model, we conducted extensive experiments on a large-scale multi-center dataset and compared our model with a series of state-of-the-art classification models. The experimental results demonstrated the superiority of our model for lung cancer subtype classification, showcasing significant improvements in accuracy metrics such as ACC, AUC, and F1-score. △ Less

Submitted 17 July, 2024; originally announced July 2024.

arXiv:2407.12271 [pdf, other]

RBAD: A Dataset and Benchmark for Retinal Vessels Branching Angle Detection

Authors: Hao Wang, Wenhui Zhu, Jiayou Qin, Xin Li, Oana Dumitrascu, Xiwen Chen, Peijie Qiu, Abolfazl Razi

Abstract: Detecting retinal image analysis, particularly the geometrical features of branching points, plays an essential role in diagnosing eye diseases. However, existing methods used for this purpose often are coarse-level and lack fine-grained analysis for efficient annotation. To mitigate these issues, this paper proposes a novel method for detecting retinal branching angles using a self-configured ima… ▽ More Detecting retinal image analysis, particularly the geometrical features of branching points, plays an essential role in diagnosing eye diseases. However, existing methods used for this purpose often are coarse-level and lack fine-grained analysis for efficient annotation. To mitigate these issues, this paper proposes a novel method for detecting retinal branching angles using a self-configured image processing technique. Additionally, we offer an open-source annotation tool and a benchmark dataset comprising 40 images annotated with retinal branching angles. Our methodology for retinal branching angle detection and calculation is detailed, followed by a benchmark analysis comparing our method with previous approaches. The results indicate that our method is robust under various conditions with high accuracy and efficiency, which offers a valuable instrument for ophthalmic research and clinical applications. △ Less

Submitted 16 July, 2024; originally announced July 2024.

arXiv:2407.06612 [pdf]

AI-based Automatic Segmentation of Prostate on Multi-modality Images: A Review

Authors: Rui Jin, Derun Li, Dehui Xiang, Lei Zhang, Hailing Zhou, Fei Shi, Weifang Zhu, Jing Cai, Tao Peng, Xinjian Chen

Abstract: Prostate cancer represents a major threat to health. Early detection is vital in reducing the mortality rate among prostate cancer patients. One approach involves using multi-modality (CT, MRI, US, etc.) computer-aided diagnosis (CAD) systems for the prostate region. However, prostate segmentation is challenging due to imperfections in the images and the prostate's complex tissue structure. The ad… ▽ More Prostate cancer represents a major threat to health. Early detection is vital in reducing the mortality rate among prostate cancer patients. One approach involves using multi-modality (CT, MRI, US, etc.) computer-aided diagnosis (CAD) systems for the prostate region. However, prostate segmentation is challenging due to imperfections in the images and the prostate's complex tissue structure. The advent of precision medicine and a significant increase in clinical capacity have spurred the need for various data-driven tasks in the field of medical imaging. Recently, numerous machine learning and data mining tools have been integrated into various medical areas, including image segmentation. This article proposes a new classification method that differentiates supervision types, either in number or kind, during the training phase. Subsequently, we conducted a survey on artificial intelligence (AI)-based automatic prostate segmentation methods, examining the advantages and limitations of each. Additionally, we introduce variants of evaluation metrics for the verification and performance assessment of the segmentation method and summarize the current challenges. Finally, future research directions and development trends are discussed, reflecting the outcomes of our literature survey, suggesting high-precision detection and treatment of prostate cancer as a promising avenue. △ Less

Submitted 9 July, 2024; originally announced July 2024.

arXiv:2407.03575 [pdf, other]

DGR-MIL: Exploring Diverse Global Representation in Multiple Instance Learning for Whole Slide Image Classification

Authors: Wenhui Zhu, Xiwen Chen, Peijie Qiu, Aristeidis Sotiras, Abolfazl Razi, Yalin Wang

Abstract: Multiple instance learning (MIL) stands as a powerful approach in weakly supervised learning, regularly employed in histological whole slide image (WSI) classification for detecting tumorous lesions. However, existing mainstream MIL methods focus on modeling correlation between instances while overlooking the inherent diversity among instances. However, few MIL methods have aimed at diversity mode… ▽ More Multiple instance learning (MIL) stands as a powerful approach in weakly supervised learning, regularly employed in histological whole slide image (WSI) classification for detecting tumorous lesions. However, existing mainstream MIL methods focus on modeling correlation between instances while overlooking the inherent diversity among instances. However, few MIL methods have aimed at diversity modeling, which empirically show inferior performance but with a high computational cost. To bridge this gap, we propose a novel MIL aggregation method based on diverse global representation (DGR-MIL), by modeling diversity among instances through a set of global vectors that serve as a summary of all instances. First, we turn the instance correlation into the similarity between instance embeddings and the predefined global vectors through a cross-attention mechanism. This stems from the fact that similar instance embeddings typically would result in a higher correlation with a certain global vector. Second, we propose two mechanisms to enforce the diversity among the global vectors to be more descriptive of the entire bag: (i) positive instance alignment and (ii) a novel, efficient, and theoretically guaranteed diversification learning paradigm. Specifically, the positive instance alignment module encourages the global vectors to align with the center of positive instances (e.g., instances containing tumors in WSI). To further diversify the global representations, we propose a novel diversification learning paradigm leveraging the determinantal point process. The proposed model outperforms the state-of-the-art MIL aggregation models by a substantial margin on the CAMELYON-16 and the TCGA-lung cancer datasets. The code is available at \url{https://github.com/ChongQingNoSubway/DGR-MIL}. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: Accepted to ECCV 2024

arXiv:2406.14896 [pdf, other]

SelfReg-UNet: Self-Regularized UNet for Medical Image Segmentation

Authors: Wenhui Zhu, Xiwen Chen, Peijie Qiu, Mohammad Farazi, Aristeidis Sotiras, Abolfazl Razi, Yalin Wang

Abstract: Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important facto… ▽ More Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important factors that potentially affect its performance: (i) irrelative feature learned caused by asymmetric supervision; (ii) feature redundancy in the feature map. To this end, we propose to balance the supervision between encoder and decoder and reduce the redundant information in the UNet. Specifically, we use the feature map that contains the most semantic information (i.e., the last layer of the decoder) to provide additional supervision to other blocks to provide additional supervision and reduce feature redundancy by leveraging feature distillation. The proposed method can be easily integrated into existing UNet architecture in a plug-and-play fashion with negligible computational cost. The experimental results suggest that the proposed method consistently improves the performance of standard UNets on four medical image segmentation datasets. The code is available at \url{https://github.com/ChongQingNoSubway/SelfReg-UNet} △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: Accepted as a conference paper to 2024 MICCAI

arXiv:2406.10856 [pdf, other]

LEO Satellite Networks Assisted Geo-distributed Data Processing

Authors: Zhiyuan Zhao, Zhe Chen, Zheng Lin, Wenjun Zhu, Kun Qiu, Chaoqun You, Yue Gao

Abstract: Nowadays, the increasing deployment of edge clouds globally provides users with low-latency services. However, connecting an edge cloud to a core cloud via optic cables in terrestrial networks poses significant barriers due to the prohibitively expensive building cost of optic cables. Fortunately, emerging Low Earth Orbit (LEO) satellite networks (e.g., Starlink) offer a more cost-effective soluti… ▽ More Nowadays, the increasing deployment of edge clouds globally provides users with low-latency services. However, connecting an edge cloud to a core cloud via optic cables in terrestrial networks poses significant barriers due to the prohibitively expensive building cost of optic cables. Fortunately, emerging Low Earth Orbit (LEO) satellite networks (e.g., Starlink) offer a more cost-effective solution for increasing edge clouds, and hence large volumes of data in edge clouds can be transferred to a core cloud via those networks for time-sensitive big data tasks processing, such as attack detection. However, the state-of-the-art satellite selection algorithms bring poor performance for those processing via our measurements. Therefore, we propose a novel data volume aware satellite selection algorithm, named DVA, to support such big data processing tasks. DVA first takes into account both the data size in edge clouds and satellite capacity to finalize the selection, thereby preventing congestion in the access network and reducing transmitting duration. Extensive simulations validate that DVA has a significantly lower average access network duration than the state-of-the-art satellite selection algorithms in a LEO satellite emulation platform. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 6 pages, 5 figures

arXiv:2405.19665 [pdf]

A novel fault localization with data refinement for hydroelectric units

Authors: Jialong Huang, Junlin Song, Penglong Lian, Mengjie Gan, Zhiheng Su, Benhao Wang, Wenji Zhu, Xiaomin Pu, Jianxiao Zou, Shicai Fan

Abstract: Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learni… ▽ More Due to the scarcity of fault samples and the complexity of non-linear and non-smooth characteristics data in hydroelectric units, most of the traditional hydroelectric unit fault localization methods are difficult to carry out accurate localization. To address these problems, a sparse autoencoder (SAE)-generative adversarial network (GAN)-wavelet noise reduction (WNR)- manifold-boosted deep learning (SG-WMBDL) based fault localization method for hydroelectric units is proposed. To overcome the data scarcity, a SAE is embedded into the GAN to generate more high-quality samples in the data generation module. Considering the signals involving non-linear and non-smooth characteristics, the improved WNR which combining both soft and hard thresholding and local linear embedding (LLE) are utilized to the data preprocessing module in order to reduce the noise and effectively capture the local features. In addition, to seek higher performance, the novel Adaptive Boost (AdaBoost) combined with multi deep learning is proposed to achieve accurate fault localization. The experimental results show that the SG-WMBDL can locate faults for hydroelectric units under a small number of fault samples with non-linear and non-smooth characteristics on higher precision and accuracy compared to other frontier methods, which verifies the effectiveness and practicality of the proposed method. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 6pages,4 figures,Conference on Decision and Control(CDC) conference

arXiv:2403.12425 [pdf, other]

Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation

Authors: Jun Yu, Gongpeng Zhao, Yongqi Wang, Zhihong Wei, Yang Zheng, Zerui Zhang, Zhongpeng Cai, Guochen Xie, Jichao Zhu, Wangyuan Zhu

Abstract: This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we… ▽ More This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability. Our method leverages a multimodal data fusion approach, integrating pre-trained audio and video backbones for feature extraction, followed by TCN-based spatiotemporal encoding and Transformer-based temporal information capture. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance in VA estimation on the AffWild2 dataset. △ Less

Submitted 20 March, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: 8 pages,3 figures

arXiv:2403.11757 [pdf, other]

Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation

Authors: Jun Yu, Wangyuan Zhu, Jichao Zhu

Abstract: In this paper, we present the solution to the Emotional Mimicry Intensity (EMI) Estimation challenge, which is part of 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.The EMI Estimation challenge task aims to evaluate the emotional intensity of seed videos by assessing them from a set of predefined emotion categories (i.e., "Admiration", "Amusement", "Determination", "Empathic Pain"… ▽ More In this paper, we present the solution to the Emotional Mimicry Intensity (EMI) Estimation challenge, which is part of 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.The EMI Estimation challenge task aims to evaluate the emotional intensity of seed videos by assessing them from a set of predefined emotion categories (i.e., "Admiration", "Amusement", "Determination", "Empathic Pain", "Excitement" and "Joy"). To tackle this challenge, we extracted rich dual-channel visual features based on ResNet18 and AUs for the video modality and effective single-channel features based on Wav2Vec2.0 for the audio modality. This allowed us to obtain comprehensive emotional features for the audiovisual modality. Additionally, leveraging a late fusion strategy, we averaged the predictions of the visual and acoustic models, resulting in a more accurate estimation of audiovisual emotional mimicry intensity. Experimental results validate the effectiveness of our approach, with the average Pearson's correlation Coefficient($ρ$) across the 6 emotion dimensionson the validation set achieving 0.3288. △ Less

Submitted 19 March, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

arXiv:2402.11834 [pdf, ps, other]

Terahertz User-Centric Clustering in the Presence of Beam Misalignment

Authors: Khaled Humadi, Imene Trigui, Wei-Ping Zhu, Wessam Ajib

Abstract: Beam misalignment is one of the main challenges for the design of reliable wireless systems in terahertz (THz) bands. This paper investigates how to apply user-centric base station (BS) clustering as a valuable add-on in THz networks. In particular, to reduce the impact of beam misalignment, a user-centric BS clustering design that provides multi-connectivity via BS cooperation is investigated. Th… ▽ More Beam misalignment is one of the main challenges for the design of reliable wireless systems in terahertz (THz) bands. This paper investigates how to apply user-centric base station (BS) clustering as a valuable add-on in THz networks. In particular, to reduce the impact of beam misalignment, a user-centric BS clustering design that provides multi-connectivity via BS cooperation is investigated. The coverage probability is derived by leveraging an accurate approximation of the aggregate interference distribution that captures the effect of beam misalignment and THz fading. The numerical results reveal the impact of beam misalignment with respect to crucial link parameters, such as the transmitter's beam width and the serving cluster size, demonstrating that user-centric BS clustering is a promising enabler of THz networks. △ Less

Submitted 18 February, 2024; originally announced February 2024.

arXiv:2402.10388 [pdf]

Improvising Age Verification Technologies in Canada: Technical, Regulatory and Social Dynamics

Authors: Azfar Adib, Wei-Ping Zhu, M. Omair Ahmad

Abstract: Age verification, which is a mandatory legal requirement for delivering certain age-appropriate services or products, has recently been emphasized around the globe to ensure online safety for children. The rapid advancement of artificial intelligence has facilitated the recent development of some cutting-edge age-verification technologies, particularly using biometrics. However, successful deploym… ▽ More Age verification, which is a mandatory legal requirement for delivering certain age-appropriate services or products, has recently been emphasized around the globe to ensure online safety for children. The rapid advancement of artificial intelligence has facilitated the recent development of some cutting-edge age-verification technologies, particularly using biometrics. However, successful deployment and mass acceptance of these technologies are significantly dependent on the corresponding socio-economic and regulatory context. This paper reviews such key dynamics for improvising age-verification technologies in Canada. It is particularly essential for such technologies to be inclusive, transparent, adaptable, privacy-preserving, and secure. Effective collaboration between academia, government, and industry entities can help to meet the growing demands for age-verification services in Canada while maintaining a user-centric approach. △ Less

Submitted 15 February, 2024; originally announced February 2024.

Comments: Presented and accepted for publication in the 2023 IEEE International Humanitarian Technologies Conference (IEEE IHTC 2023), November 1 to 3, 2023, Cartagena, Colombia

arXiv:2401.04154 [pdf]

Efficient Selective Audio Masked Multimodal Bottleneck Transformer for Audio-Video Classification

Authors: Wentao Zhu

Abstract: Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply conc… ▽ More Audio and video are two most common modalities in the mainstream media platforms, e.g., YouTube. To learn from multimodal videos effectively, in this work, we propose a novel audio-video recognition approach termed audio video Transformer, AVT, leveraging the effective spatio-temporal representation by the video Transformer to improve action recognition accuracy. For multimodal fusion, simply concatenating multimodal tokens in a cross-modal Transformer requires large computational and memory resources, instead we reduce the cross-modality complexity through an audio-video bottleneck Transformer. To improve the learning efficiency of multimodal Transformer, we integrate self-supervised objectives, i.e., audio-video contrastive learning, audio-video matching, and masked audio and video learning, into AVT training, which maps diverse audio and video representations into a common multimodal representation space. We further propose a masked audio segment loss to learn semantic audio activities in AVT. Extensive experiments and ablation studies on three public datasets and two in-house datasets consistently demonstrate the effectiveness of the proposed AVT. Specifically, AVT outperforms its previous state-of-the-art counterparts on Kinetics-Sounds by 8%. AVT also surpasses one of the previous state-of-the-art video Transformers [25] by 10% on VGGSound by leveraging the audio signal. Compared to one of the previous state-of-the-art multimodal methods, MBT [32], AVT is 1.3% more efficient in terms of FLOPs and improves the accuracy by 3.8% on Epic-Kitchens-100. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: Accepted by WACV 2024; well-formatted PDF is in https://drive.google.com/file/d/1qvW52lamsvNGMCqPS7q8g8L4NaR_LlbR/view?usp=sharing. arXiv admin note: text overlap with arXiv:2401.04023

arXiv:2401.04023 [pdf]

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification

Authors: Wentao Zhu

Abstract: In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development. In this work, we develop a multiscale multimodal Transformer (MMT) that leverages hierarchical representation learning. Particularly, MMT is composed of a nove… ▽ More In recent years, researchers combine both audio and video signals to deal with challenges where actions are not well represented or captured by visual cues. However, how to effectively leverage the two modalities is still under development. In this work, we develop a multiscale multimodal Transformer (MMT) that leverages hierarchical representation learning. Particularly, MMT is composed of a novel multiscale audio Transformer (MAT) and a multiscale video Transformer [43]. To learn a discriminative cross-modality fusion, we further design multimodal supervised contrastive objectives called audio-video contrastive loss (AVC) and intra-modal contrastive loss (IMC) that robustly align the two modalities. MMT surpasses previous state-of-the-art approaches by 7.3% and 2.1% on Kinetics-Sounds and VGGSound in terms of the top-1 accuracy without external training data. Moreover, the proposed MAT significantly outperforms AST [28] by 22.2%, 4.4% and 4.7% on three public benchmark datasets, and is about 3% more efficient based on the number of FLOPs and 9.8% more efficient based on GPU memory usage. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: Accepted by WACV 2024; well-formatted PDF is in https://drive.google.com/file/d/10Zo_ydJZFAm7YsxHDgTjhyc4dEJbW_dk/view?usp=sharing

arXiv:2312.16228 [pdf, other]

Deformable Audio Transformer for Audio Event Detection

Authors: Wentao Zhu

Abstract: Transformers have achieved promising results on a variety of tasks. However, the quadratic complexity in self-attention computation has limited the applications, especially in low-resource settings and mobile or edge devices. Existing works have proposed to exploit hand-crafted attention patterns to reduce computation complexity. However, such hand-crafted patterns are data-agnostic and may not be… ▽ More Transformers have achieved promising results on a variety of tasks. However, the quadratic complexity in self-attention computation has limited the applications, especially in low-resource settings and mobile or edge devices. Existing works have proposed to exploit hand-crafted attention patterns to reduce computation complexity. However, such hand-crafted patterns are data-agnostic and may not be optimal. Hence, it is likely that relevant keys or values are being reduced, while less important ones are still preserved. Based on this key insight, we propose a novel deformable audio Transformer for audio recognition, named DATAR, where a deformable attention equipping with a pyramid transformer backbone is constructed and learnable. Such an architecture has been proven effective in prediction tasks,~\textit{e.g.}, event classification. Moreover, we identify that the deformable attention map computation may over-simplify the input feature, which can be further enhanced. Hence, we introduce a learnable input adaptor to alleviate this issue, and DATAR achieves state-of-the-art performance. △ Less

Submitted 7 January, 2024; v1 submitted 24 December, 2023; originally announced December 2023.

Comments: ICASSP 2024. arXiv admin note: substantial text overlap with arXiv:2201.00520 by other authors

arXiv:2312.05786 [pdf, other]

Deep Learning for Joint Design of Pilot, Channel Feedback, and Hybrid Beamforming in FDD Massive MIMO-OFDM Systems

Authors: Junyi Yang, Weifeng Zhu, Shu Sun, Xiaofeng Li, Xingqin Lin, Meixia Tao

Abstract: This letter considers the transceiver design in frequency division duplex (FDD) massive multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) systems for high-quality data transmission. We propose a novel deep learning based framework where the procedures of pilot design, channel feedback, and hybrid beamforming are realized by carefully crafted deep neural networ… ▽ More This letter considers the transceiver design in frequency division duplex (FDD) massive multiple-input multiple-output (MIMO) orthogonal frequency division multiplexing (OFDM) systems for high-quality data transmission. We propose a novel deep learning based framework where the procedures of pilot design, channel feedback, and hybrid beamforming are realized by carefully crafted deep neural networks. All the considered modules are jointly learned in an end-to-end manner, and a graph neural network is adopted to effectively capture interactions between beamformers based on the built graphical representation. Numerical results validate the effectiveness of our method. △ Less

Submitted 10 December, 2023; originally announced December 2023.

Comments: 5 pages, 4 figures, acccpted by IEEE Communication Letters

arXiv:2312.05557 [pdf, ps, other]

Long-Term Rate-Fairness-Aware Beamforming Based Massive MIMO Systems

Authors: W. Zhu, H. D. Tuan, E. Dutkiewicz, Y. Fang, H. V. Poor, L. Hanzo

Abstract: This is the first treatise on multi-user (MU) beamforming designed for achieving long-term rate-fairness in fulldimensional MU massive multi-input multi-output (m-MIMO) systems. Explicitly, based on the channel covariances, which can be assumed to be known beforehand, we address this problem by optimizing the following objective functions: the users' signal-toleakage-noise ratios (SLNRs) using SLN… ▽ More This is the first treatise on multi-user (MU) beamforming designed for achieving long-term rate-fairness in fulldimensional MU massive multi-input multi-output (m-MIMO) systems. Explicitly, based on the channel covariances, which can be assumed to be known beforehand, we address this problem by optimizing the following objective functions: the users' signal-toleakage-noise ratios (SLNRs) using SLNR max-min optimization, geometric mean of SLNRs (GM-SLNR) based optimization, and SLNR soft max-min optimization. We develop a convex-solver based algorithm, which invokes a convex subproblem of cubic time-complexity at each iteration for solving the SLNR maxmin problem. We then develop closed-form expression based algorithms of scalable complexity for the solution of the GMSLNR and of the SLNR soft max-min problem. The simulations provided confirm the users' improved-fairness ergodic rate distributions. △ Less

Submitted 9 December, 2023; originally announced December 2023.

arXiv:2311.14264 [pdf, ps, other]

An ADMM-Based Geometric Configuration Optimization in RSSD-Based Source Localization By UAVs with Spread Angle Constraint

Authors: Xin Cheng, Guangjie Han, Jinlin Peng, Jinfang Jiang, Yu He, Weiqiang Zhu, Feng Shu, Jiangzhou Wang

Abstract: Deploying multiple unmanned aerial vehicles (UAVs) to locate a signal-emitting source covers a wide range of military and civilian applications like rescue and target tracking. It is well known that the UAVs-source (sensors-target) geometry, namely geometric configuration, significantly affects the final localization accuracy. This paper focuses on the geometric configuration optimization for rece… ▽ More Deploying multiple unmanned aerial vehicles (UAVs) to locate a signal-emitting source covers a wide range of military and civilian applications like rescue and target tracking. It is well known that the UAVs-source (sensors-target) geometry, namely geometric configuration, significantly affects the final localization accuracy. This paper focuses on the geometric configuration optimization for received signal strength difference (RSSD)-based passive source localization by drone swarm. Different from prior works, this paper considers a general measuring condition where the spread angle of drone swarm centered on the source is constrained. Subject to this constraint, a geometric configuration optimization problem with the aim of maximizing the determinant of Fisher information matrix (FIM) is formulated. After transforming this problem using matrix theory, an alternating direction method of multipliers (ADMM)-based optimization framework is proposed. To solve the subproblems in this framework, two global optimal solutions based on the Von Neumann matrix trace inequality theorem and majorize-minimize (MM) algorithm are proposed respectively. Finally, the effectiveness as well as the practicality of the proposed ADMM-based optimization algorithm are demonstrated by extensive simulations. △ Less

Submitted 17 July, 2024; v1 submitted 23 November, 2023; originally announced November 2023.

arXiv:2310.17155 [pdf, ps, other]

Max-min Rate Optimization of Low-Complexity Hybrid Multi-User Beamforming Maintaining Rate-Fairness

Authors: W. Zhu, H. D. Tuan, E. Dutkiewicz, H. V. Poor, L. Hanzo

Abstract: A wireless network serving multiple users in the millimeter-wave or the sub-terahertz band by a base station is considered. High-throughput multi-user hybrid-transmit beamforming is conceived by maximizing the minimum rate of the users. For the sake of energy-efficient signal transmission, the array-of-subarrays structure is used for analog beamforming relying on low-resolution phase shifters. We… ▽ More A wireless network serving multiple users in the millimeter-wave or the sub-terahertz band by a base station is considered. High-throughput multi-user hybrid-transmit beamforming is conceived by maximizing the minimum rate of the users. For the sake of energy-efficient signal transmission, the array-of-subarrays structure is used for analog beamforming relying on low-resolution phase shifters. We develop a convexsolver based algorithm, which iteratively invokes a convex problem of the same beamformer size for its solution. We then introduce the soft max-min rate objective function and develop a scalable algorithm for its optimization. Our simulation results demonstrate the striking fact that soft max-min rate optimization not only approaches the minimum user rate obtained by max-min rate optimization but it also achieves a sum rate similar to that of sum-rate maximization. Thus, the soft max-min rate optimization based beamforming design conceived offers a new technique of simultaneously achieving a high individual quality-of-service for all users and a high total network throughput. △ Less

Submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.10095 [pdf, other]

A Multi-Scale Spatial Transformer U-Net for Simultaneously Automatic Reorientation and Segmentation of 3D Nuclear Cardiac Images

Authors: Yangfan Ni, Duo Zhang, Gege Ma, Lijun Lu, Zhongke Huang, Wentao Zhu

Abstract: Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of i… ▽ More Accurate reorientation and segmentation of the left ventricular (LV) is essential for the quantitative analysis of myocardial perfusion imaging (MPI), in which one critical step is to reorient the reconstructed transaxial nuclear cardiac images into standard short-axis slices for subsequent image processing. Small-scale LV myocardium (LV-MY) region detection and the diverse cardiac structures of individual patients pose challenges to LV segmentation operation. To mitigate these issues, we propose an end-to-end model, named as multi-scale spatial transformer UNet (MS-ST-UNet), that involves the multi-scale spatial transformer network (MSSTN) and multi-scale UNet (MSUNet) modules to perform simultaneous reorientation and segmentation of LV region from nuclear cardiac images. The proposed method is trained and tested using two different nuclear cardiac image modalities: 13N-ammonia PET and 99mTc-sestamibi SPECT. We use a multi-scale strategy to generate and extract image features with different scales. Our experimental results demonstrate that the proposed method significantly improves the reorientation and segmentation performance. This joint learning framework promotes mutual enhancement between reorientation and segmentation tasks, leading to cutting edge performance and an efficient image processing workflow. The proposed end-to-end deep network has the potential to reduce the burden of manual delineation for cardiac images, thereby providing multimodal quantitative analysis assistance for physicists. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 17 pages, 7 figures

arXiv:2308.12198 [pdf, other]

Hierarchical Beam Alignment for Millimeter-Wave Communication Systems: A Deep Learning Approach

Authors: Junyi Yang, Weifeng Zhu, Meixia Tao, Shu Sun

Abstract: Fast and precise beam alignment is crucial for high-quality data transmission in millimeter-wave (mmWave) communication systems, where large-scale antenna arrays are utilized to overcome the severe propagation loss. To tackle the challenging problem, we propose a novel deep learning-based hierarchical beam alignment method for both multiple-input single-output (MISO) and multiple-input multiple-ou… ▽ More Fast and precise beam alignment is crucial for high-quality data transmission in millimeter-wave (mmWave) communication systems, where large-scale antenna arrays are utilized to overcome the severe propagation loss. To tackle the challenging problem, we propose a novel deep learning-based hierarchical beam alignment method for both multiple-input single-output (MISO) and multiple-input multiple-output (MIMO) systems, which learns two tiers of probing codebooks (PCs) and uses their measurements to predict the optimal beam in a coarse-to-fine search manner. Specifically, a hierarchical beam alignment network (HBAN) is developed for MISO systems, which first performs coarse channel measurement using a tier-1 PC, then selects a tier-2 PC for fine channel measurement, and finally predicts the optimal beam based on both coarse and fine measurements. The propounded HBAN is trained in two steps: the tier-1 PC and the tier-2 PC selector are first trained jointly, followed by the joint training of all the tier-2 PCs and beam predictors. Furthermore, an HBAN for MIMO systems is proposed to directly predict the optimal beam pair without performing beam alignment individually at the transmitter and receiver. Numerical results demonstrate that the proposed HBANs are superior to the state-of-art methods in both alignment accuracy and signaling overhead reduction. △ Less

Submitted 23 August, 2023; originally announced August 2023.

Comments: 15 pages, 16 figures, to appear in Transactions on Wireless Communications. arXiv admin note: text overlap with arXiv:2209.03643

arXiv:2308.04663 [pdf, other]

doi 10.1016/j.media.2024.103199

Classification of lung cancer subtypes on CT images with synthetic pathological priors

Authors: Wentao Zhu, Yuan Jin, Gege Ma, Geng Chen, Jan Egger, Shaoting Zhang, Dimitris N. Metaxas

Abstract: The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements. In this paper, we propose self-generating hybrid feature network (SGHF-Net) for accurately classifying lung cancer subtypes on computed tomography (CT) images. Inspired by studies stating that cross-scale associations exist in the image patterns betwe… ▽ More The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements. In this paper, we propose self-generating hybrid feature network (SGHF-Net) for accurately classifying lung cancer subtypes on computed tomography (CT) images. Inspired by studies stating that cross-scale associations exist in the image patterns between the same case's CT images and its pathological images, we innovatively developed a pathological feature synthetic module (PFSM), which quantitatively maps cross-modality associations through deep neural networks, to derive the "gold standard" information contained in the corresponding pathological images from CT images. Additionally, we designed a radiological feature extraction module (RFEM) to directly acquire CT image information and integrated it with the pathological priors under an effective feature fusion framework, enabling the entire classification model to generate more indicative and specific pathologically related features and eventually output more accurate predictions. The superiority of the proposed model lies in its ability to self-generate hybrid features that contain multi-modality image information based on a single-modality input. To evaluate the effectiveness, adaptability, and generalization ability of our model, we performed extensive experiments on a large-scale multi-center dataset (i.e., 829 cases from three hospitals) to compare our model and a series of state-of-the-art (SOTA) classification models. The experimental results demonstrated the superiority of our model for lung cancer subtypes classification with significant accuracy improvements in terms of accuracy (ACC), area under the curve (AUC), and F1 score. △ Less

Submitted 8 August, 2023; originally announced August 2023.

Comments: 16 pages, 7 figures

Journal ref: Medical Image Analysis 95, July 2024, 103199

arXiv:2306.15942 [pdf, other]

Enhanced Neural Beamformer with Spatial Information for Target Speech Extraction

Authors: Aoqi Guo, Junnan Wu, Peng Gao, Wenbo Zhu, Qinwen Guo, Dazhi Gao, Yujun Wang

Abstract: Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input fea… ▽ More Recently, deep learning-based beamforming algorithms have shown promising performance in target speech extraction tasks. However, most systems do not fully utilize spatial information. In this paper, we propose a target speech extraction network that utilizes spatial information to enhance the performance of neural beamformer. To achieve this, we first use the UNet-TCN structure to model input features and improve the estimation accuracy of the speech pre-separation module by avoiding information loss caused by direct dimensionality reduction in other models. Furthermore, we introduce a multi-head cross-attention mechanism that enhances the neural beamformer's perception of spatial information by making full use of the spatial information received by the array. Experimental results demonstrate that our approach, which incorporates a more reasonable target mask estimation network and a spatial information-based cross-attention mechanism into the neural beamformer, effectively improves speech separation performance. △ Less

Submitted 28 June, 2023; originally announced June 2023.

arXiv:2306.11958 [pdf, other]

doi 10.1088/1361-6560/ad00fc

PDS-MAR: a fine-grained Projection-Domain Segmentation-based Metal Artifact Reduction method for intraoperative CBCT images with guidewires

Authors: Tianling Lyu, Zhan Wu, Gege Ma, Chen Jiang, Xinyun Zhong, Yan Xi, Yang Chen, Wentao Zhu

Abstract: Since the invention of modern CT systems, metal artifacts have been a persistent problem. Due to increased scattering, amplified noise, and insufficient data collection, it is more difficult to suppress metal artifacts in cone-beam CT, limiting its use in human- and robot-assisted spine surgeries where metallic guidewires and screws are commonly used. In this paper, we demonstrate that conventiona… ▽ More Since the invention of modern CT systems, metal artifacts have been a persistent problem. Due to increased scattering, amplified noise, and insufficient data collection, it is more difficult to suppress metal artifacts in cone-beam CT, limiting its use in human- and robot-assisted spine surgeries where metallic guidewires and screws are commonly used. In this paper, we demonstrate that conventional image-domain segmentation-based MAR methods are unable to eliminate metal artifacts for intraoperative CBCT images with guidewires. To solve this problem, we present a fine-grained projection-domain segmentation-based MAR method termed PDS-MAR, in which metal traces are augmented and segmented in the projection domain before being inpainted using triangular interpolation. In addition, a metal reconstruction phase is proposed to restore metal areas in the image domain. The digital phantom study and real CBCT data study demonstrate that the proposed algorithm achieves significantly better artifact suppression than other comparing methods and has the potential to advance the use of intraoperative CBCT imaging in clinical spine surgeries. △ Less

Submitted 20 June, 2023; originally announced June 2023.

Comments: 19 Pages

Journal ref: Phys. Med. Biol. 68 215007 (2023)

arXiv:2306.01289 [pdf, other]

nnMobileNet: Rethinking CNN for Retinopathy Research

Authors: Wenhui Zhu, Peijie Qiu, Xiwen Chen, Xin Li, Natasha Lepore, Oana M. Dumitrascu, Yalin Wang

Abstract: Over the past few decades, convolutional neural networks (CNNs) have been at the forefront of the detection and tracking of various retinal diseases (RD). Despite their success, the emergence of vision transformers (ViT) in the 2020s has shifted the trajectory of RD model development. The leading-edge performance of ViT-based models in RD can be largely credited to their scalability-their ability… ▽ More Over the past few decades, convolutional neural networks (CNNs) have been at the forefront of the detection and tracking of various retinal diseases (RD). Despite their success, the emergence of vision transformers (ViT) in the 2020s has shifted the trajectory of RD model development. The leading-edge performance of ViT-based models in RD can be largely credited to their scalability-their ability to improve as more parameters are added. As a result, ViT-based models tend to outshine traditional CNNs in RD applications, albeit at the cost of increased data and computational demands. ViTs also differ from CNNs in their approach to processing images, working with patches rather than local regions, which can complicate the precise localization of small, variably presented lesions in RD. In our study, we revisited and updated the architecture of a CNN model, specifically MobileNet, to enhance its utility in RD diagnostics. We found that an optimized MobileNet, through selective modifications, can surpass ViT-based models in various RD benchmarks, including diabetic retinopathy grading, detection of multiple fundus diseases, and classification of diabetic macular edema. The code is available at https://github.com/Retinal-Research/NN-MOBILENET △ Less

Submitted 15 April, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

Comments: Accepted as a conference paper to 2024 CVPRW

arXiv:2305.08014 [pdf]

Surface EMG-Based Inter-Session/Inter-Subject Gesture Recognition by Leveraging Lightweight All-ConvNet and Transfer Learning

Authors: Md. Rabiul Islam, Daniel Massicotte, Philippe Y. Massicotte, Wei-Ping Zhu

Abstract: Gesture recognition using low-resolution instantaneous HD-sEMG images opens up new avenues for the development of more fluid and natural muscle-computer interfaces. However, the data variability between inter-session and inter-subject scenarios presents a great challenge. The existing approaches employed very large and complex deep ConvNet or 2SRNN-based domain adaptation methods to approximate th… ▽ More Gesture recognition using low-resolution instantaneous HD-sEMG images opens up new avenues for the development of more fluid and natural muscle-computer interfaces. However, the data variability between inter-session and inter-subject scenarios presents a great challenge. The existing approaches employed very large and complex deep ConvNet or 2SRNN-based domain adaptation methods to approximate the distribution shift caused by these inter-session and inter-subject data variability. Hence, these methods also require learning over millions of training parameters and a large pre-trained and target domain dataset in both the pre-training and adaptation stages. As a result, it makes high-end resource-bounded and computationally very expensive for deployment in real-time applications. To overcome this problem, we propose a lightweight All-ConvNet+TL model that leverages lightweight All-ConvNet and transfer learning (TL) for the enhancement of inter-session and inter-subject gesture recognition performance. The All-ConvNet+TL model consists solely of convolutional layers, a simple yet efficient framework for learning invariant and discriminative representations to address the distribution shifts caused by inter-session and inter-subject data variability. Experiments on four datasets demonstrate that our proposed methods outperform the most complex existing approaches by a large margin and achieve state-of-the-art results on inter-session and inter-subject scenarios and perform on par or competitively on intra-session gesture recognition. These performance gaps increase even more when a tiny amount (e.g., a single trial) of data is available on the target domain for adaptation. These outstanding experimental results provide evidence that the current state-of-the-art models may be overparameterized for sEMG-based inter-session and inter-subject gesture recognition tasks. △ Less

Submitted 19 February, 2024; v1 submitted 13 May, 2023; originally announced May 2023.

arXiv:2304.09727 [pdf, other]

Cooperative Multi-Cell Massive Access with Temporally Correlated Activity

Authors: Weifeng Zhu, Meixia Tao, Xiaojun Yuan, Fan Xu, Yunfeng Guan

Abstract: This paper investigates the problem of activity detection and channel estimation in cooperative multi-cell massive access systems with temporally correlated activity, where all access points (APs) are connected to a central unit via fronthaul links. We propose to perform user-centric AP cooperation for computation burden alleviation and introduce a generalized sliding-window detection strategy for… ▽ More This paper investigates the problem of activity detection and channel estimation in cooperative multi-cell massive access systems with temporally correlated activity, where all access points (APs) are connected to a central unit via fronthaul links. We propose to perform user-centric AP cooperation for computation burden alleviation and introduce a generalized sliding-window detection strategy for fully exploiting the temporal correlation in activity. By establishing the probabilistic model associated with the factor graph representation, we propose a scalable Dynamic Compressed Sensing-based Multiple Measurement Vector Generalized Approximate Message Passing (DCS-MMV-GAMP) algorithm from the perspective of Bayesian inference. Therein, the activity likelihood is refined by performing standard message passing among the activities in the spatial-temporal domain and GAMP is employed for efficient channel estimation. Furthermore, we develop two schemes of quantize-and-forward (QF) and detect-and-forward (DF) based on DCS-MMV-GAMP for the finite-fronthaul-capacity scenario, which are extensively evaluated under various system limits. Numerical results verify the significant superiority of the proposed approach over the benchmarks. Moreover, it is revealed that QF can usually realize superior performance when the antenna number is small, whereas DF shifts to be preferable with limited fronthaul capacity if the large-scale antenna arrays are equipped. △ Less

Submitted 19 April, 2023; originally announced April 2023.

Comments: 16 pages, 17 figures, minor revision

arXiv:2303.10757 [pdf, other]

Multiscale Audio Spectrogram Transformer for Efficient Audio Classification

Authors: Wentao Zhu, Mohamed Omar

Abstract: Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along… ▽ More Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals. △ Less

Submitted 19 March, 2023; originally announced March 2023.

Comments: ICASSP 2023

Showing 1–50 of 120 results for author: Zhu, W