Search | arXiv e-print repository

DMF2Mel: A Dynamic Multiscale Fusion Network for EEG-Driven Mel Spectrogram Reconstruction

Authors: Cunhang Fan, Sheng Zhang, Jingjing Zhang, Enrui Liu, Xinhui Li, Minggang Zhao, Zhao Lv

Abstract: Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and… ▽ More Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related "foreground features" from noisy "background features" through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework,achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: https://github.com/fchest/DMF2Mel. △ Less

Submitted 10 July, 2025; originally announced July 2025.

Comments: Accepted by ACM MM 2025

arXiv:2506.23301 [pdf, ps, other]

Parallax QAMA: Novel Downlink Multiple Access for MISO Systems with Simple Receivers

Authors: Jie Huang, Ming Zhao, Shengli Zhou, Ling Qiu, Jinkang Zhu

Abstract: In this paper, we propose a novel downlink multiple access system with a multi-antenna transmitter and two single-antenna receivers, inspired by the underlying principles of hierarchical quadrature amplitude modulation (H-QAM) based multiple access (QAMA) and space-division multiple access (SDMA). In the proposed scheme, coded bits from two users are split and assigned to one shared symbol and two… ▽ More In this paper, we propose a novel downlink multiple access system with a multi-antenna transmitter and two single-antenna receivers, inspired by the underlying principles of hierarchical quadrature amplitude modulation (H-QAM) based multiple access (QAMA) and space-division multiple access (SDMA). In the proposed scheme, coded bits from two users are split and assigned to one shared symbol and two private symbols carried by different beams. Based on joint symbol mapping of H-QAM constellations and phase-aligned precoding at the transmitter, each receiver observes a different H-QAM constellation with Gray mapping, a unique parallax feature not shared by existing schemes. In addition to avoiding successive interference cancellation (SIC), each user independently demodulates its own bits on separate I and Q branches with calculations based on closed-form expressions. Hence the receiver complexity is on par with that of orthogonal multiple access (OMA), which is much lower than that in other competing alternatives such as non-orthogonal multiple access (NOMA) and rate-splitting multiple access (RSMA). We carry out system optimization and determine the achievable rate region. Numerical results show that the proposed system has a larger rate region relative to other benchmark schemes with receivers not using SIC, and even achieves a comparable rate region to those benchmark schemes with SIC receivers. △ Less

Submitted 29 June, 2025; originally announced June 2025.

arXiv:2506.22216 [pdf, ps, other]

ReF-LLE: Personalized Low-Light Enhancement via Reference-Guided Deep Reinforcement Learning

Authors: Ming Zhao, Pingping Liu, Tongshun Zhang, Zhe Zhang

Abstract: Low-light image enhancement presents two primary challenges: 1) Significant variations in low-light images across different conditions, and 2) Enhancement levels influenced by subjective preferences and user intent. To address these issues, we propose ReF-LLE, a novel personalized low-light image enhancement method that operates in the Fourier frequency domain and incorporates deep reinforcement l… ▽ More Low-light image enhancement presents two primary challenges: 1) Significant variations in low-light images across different conditions, and 2) Enhancement levels influenced by subjective preferences and user intent. To address these issues, we propose ReF-LLE, a novel personalized low-light image enhancement method that operates in the Fourier frequency domain and incorporates deep reinforcement learning. ReF-LLE is the first to integrate deep reinforcement learning into this domain. During training, a zero-reference image evaluation strategy is introduced to score enhanced images, providing reward signals that guide the model to handle varying degrees of low-light conditions effectively. In the inference phase, ReF-LLE employs a personalized adaptive iterative strategy, guided by the zero-frequency component in the Fourier domain, which represents the overall illumination level. This strategy enables the model to adaptively adjust low-light images to align with the illumination distribution of a user-provided reference image, ensuring personalized enhancement results. Extensive experiments on benchmark datasets demonstrate that ReF-LLE outperforms state-of-the-art methods, achieving superior perceptual quality and adaptability in personalized low-light image enhancement. △ Less

Submitted 27 June, 2025; originally announced June 2025.

Comments: 6 pages, 8 figures, accepted by ICME2025

arXiv:2506.20158 [pdf, ps, other]

Efficient Channel Estimation for Rotatable Antenna-Enabled Wireless Communication

Authors: Xue Xiong, Beixiong Zheng, Wen Wu, Xiaodan Shao, Liang Dai, Ming-Min Zhao, Jie Tang

Abstract: Non-fixed flexible antenna architectures, such as fluid antenna system (FAS), movable antenna (MA), and pinching antenna, have garnered significant interest in recent years. Among them, rotatable antenna (RA) is a promising antenna architecture that exploits additional spatial degrees of freedom (DoFs) to enhance the communication performance. To fully obtain the performance gain provided by RAs,… ▽ More Non-fixed flexible antenna architectures, such as fluid antenna system (FAS), movable antenna (MA), and pinching antenna, have garnered significant interest in recent years. Among them, rotatable antenna (RA) is a promising antenna architecture that exploits additional spatial degrees of freedom (DoFs) to enhance the communication performance. To fully obtain the performance gain provided by RAs, accurate channel state information (CSI) is essential for adjusting the orientation/boresight of each antenna. In this letter, we propose an efficient channel estimation scheme for RA communication systems, where the base station (BS) can sequentially and adaptively adjust the orientations of RAs to enrich the environmental observations from diverse angular perspectives, thereby enhancing the channel estimation accuracy. The proposed scheme includes two main procedures that are conducted alternately during each channel training period. Specifically, the first procedure is to estimate the CSI with given RAs' orientations, involving the angle-of-arrivals (AoAs) information and path gains. Then, based on the estimated CSI, the second procedure adjusts the RAs' orientations to maximize the effective channel gain. Simulation results demonstrate that the proposed channel estimation method outperforms other benchmark schemes. △ Less

Submitted 29 June, 2025; v1 submitted 25 June, 2025; originally announced June 2025.

Comments: 5 pages, 4 figures

arXiv:2504.20653 [pdf, other]

ComplexVCoder: An LLM-Driven Framework for Systematic Generation of Complex Verilog Code

Authors: Jian Zuo, Junzhe Liu, Xianyong Wang, Yicheng Liu, Navya Goli, Tong Xu, Hao Zhang, Umamaheswara Rao Tida, Zhenge Jia, Mengying Zhao

Abstract: Recent advances have demonstrated the promising capabilities of large language models (LLMs) in generating register-transfer level (RTL) code, such as Verilog. However, existing LLM-based frameworks still face significant challenges in accurately handling the complexity of real-world RTL designs, particularly those that are large-scale and involve multi-level module instantiations. To address this… ▽ More Recent advances have demonstrated the promising capabilities of large language models (LLMs) in generating register-transfer level (RTL) code, such as Verilog. However, existing LLM-based frameworks still face significant challenges in accurately handling the complexity of real-world RTL designs, particularly those that are large-scale and involve multi-level module instantiations. To address this issue, we present ComplexVCoder, an open-source LLM-driven framework that enhances both the generation quality and efficiency of complex Verilog code. Specifically, we introduce a two-stage generation mechanism, which leverages an intermediate representation to enable a more accurate and structured transition from natural language descriptions to intricate Verilog designs. In addition, we introduce a rule-based alignment method and a domain-specific retrieval-augmented generation (RAG) to further improve the correctness of the synthesized code by incorporating relevant design knowledge during generation. To evaluate our approach, we construct a comprehensive dataset comprising 55 complex Verilog designs derived from real-world implementations. We also release an open-source benchmark suite for systematically assessing the quality of auto-generated RTL code together with the ComplexVCoder framework. Experimental results show that ComplexVCoder outperforms SOTA frameworks such as CodeV and RTLCoder by 14.6% and 22.2%, respectively, in terms of function correctness on complex Verilog benchmarks. Furthermore, ComplexVcoder achieves comparable generation performances in terms of functionality correctness using a lightweight 32B model (Qwen2.5), rivaling larger-scale models such as GPT-3.5 and DeepSeek-V3. △ Less

Submitted 29 April, 2025; originally announced April 2025.

arXiv:2504.13741 [pdf, ps, other]

Sensing-Then-Beamforming: Robust Transmission Design for RIS-Empowered Integrated Sensing and Covert Communication

Authors: Xingyu Zhao, Min Li, Ming-Min Zhao, Shihao Yan, Min-Jian Zhao

Abstract: Traditional covert communication often relies on the knowledge of the warden's channel state information, which is inherently challenging to obtain due to the non-cooperative nature and potential mobility of the warden. The integration of sensing and communication technology provides a promising solution by enabling the legitimate transmitter to sense and track the warden, thereby enhancing transm… ▽ More Traditional covert communication often relies on the knowledge of the warden's channel state information, which is inherently challenging to obtain due to the non-cooperative nature and potential mobility of the warden. The integration of sensing and communication technology provides a promising solution by enabling the legitimate transmitter to sense and track the warden, thereby enhancing transmission covertness. In this paper, we develop a framework for sensing-then-beamforming in reconfigurable intelligent surface (RIS)-empowered integrated sensing and covert communication (ISCC) systems, where the transmitter (Alice) estimates and tracks the mobile aerial warden's channel using sensing echo signals while simultaneously sending covert information to multiple legitimate users (Bobs) with the assistance of RIS, under the surveillance of the warden (Willie). Considering channel estimation errors, we formulate a robust non-convex optimization problem that jointly designs the communication beamformers, the sensing signal covariance matrix at Alice, and the phase shifts at the RIS to maximize the covert sum rate of Bobs while satisfying the constraints related to covert communication, sensing, transmitter power, and the unit modulus of the RIS elements. To solve this complex problem, we develop an efficient algorithm using alternating optimization, successive convex approximation, S-procedure, sequential rank-one constraint relaxation, and semidefinite relaxation techniques. Numerical results confirm the convergence of the proposed algorithm and demonstrate its effectiveness in tracking the warden's channel while ensuring robust covert transmission. Furthermore, the results highlight the advantages of using RIS to enhance the covert transmission rate compared to baseline schemes, and also illustrate the intricate trade-off between communication and sensing in ISCC systems. △ Less

Submitted 18 April, 2025; originally announced April 2025.

Comments: 13 pages; submitted for possible publication

arXiv:2503.22486 [pdf, other]

Movable Antenna Enhanced Downlink Multi-User Integrated Sensing and Communication System

Authors: Yanze Han, Min Li, Xingyu Zhao, Ming-Min Zhao, Min-Jian Zhao

Abstract: This work investigates the potential of exploiting movable antennas (MAs) to enhance the performance of a multi-user downlink integrated sensing and communication (ISAC) system. Specifically, we formulate an optimization problem to maximize the transmit beampattern gain for sensing while simultaneously meeting each user's communication requirement by jointly optimizing antenna positions and beamfo… ▽ More This work investigates the potential of exploiting movable antennas (MAs) to enhance the performance of a multi-user downlink integrated sensing and communication (ISAC) system. Specifically, we formulate an optimization problem to maximize the transmit beampattern gain for sensing while simultaneously meeting each user's communication requirement by jointly optimizing antenna positions and beamforming design. The problem formulated is highly non-convex and involves multivariate-coupled constraints. To address these challenges, we introduce a series of auxiliary random variables and transform the original problem into an augmented Lagrangian problem. A double-loop algorithm based on a penalty dual decomposition framework is then developed to solve the problem. Numerical results validate the effectiveness of the proposed design, demonstrating its superiority over MA designs based on successive convex approximation optimization and other baseline approaches in ISAC systems. The results also highlight the advantages of MAs in achieving better sensing performance and improved beam control, especially for sparse arrays with large apertures. △ Less

Submitted 28 March, 2025; originally announced March 2025.

Comments: accepted and to appear in IEEE VTC2025-Spring

arXiv:2503.21818 [pdf]

Deep Learning-Based Quantitative Assessment of Renal Chronicity Indices in Lupus Nephritis

Authors: Tianqi Tu, Hui Wang, Jiangbo Pei, Xiaojuan Yu, Aidong Men, Suxia Wang, Qingchao Chen, Ying Tan, Feng Yu, Minghui Zhao

Abstract: Background: Renal chronicity indices (CI) have been identified as strong predictors of long-term outcomes in lupus nephritis (LN) patients. However, assessment by pathologists is hindered by challenges such as substantial time requirements, high interobserver variation, and susceptibility to fatigue. This study aims to develop an effective deep learning (DL) pipeline that automates the assessment… ▽ More Background: Renal chronicity indices (CI) have been identified as strong predictors of long-term outcomes in lupus nephritis (LN) patients. However, assessment by pathologists is hindered by challenges such as substantial time requirements, high interobserver variation, and susceptibility to fatigue. This study aims to develop an effective deep learning (DL) pipeline that automates the assessment of CI and provides valuable prognostic insights from a disease-specific perspective. Methods: We curated a dataset comprising 282 slides obtained from 141 patients across two independent cohorts with a complete 10-years follow-up. Our DL pipeline was developed on 60 slides (22,410 patch images) from 30 patients in the training cohort and evaluated on both an internal testing set (148 slides, 77,605 patch images) and an external testing set (74 slides, 27,522 patch images). Results: The study included two cohorts with slight demographic differences, particularly in age and hemoglobin levels. The DL pipeline showed high segmentation performance across tissue compartments and histopathologic lesions, outperforming state-of-the-art methods. The DL pipeline also demonstrated a strong correlation with pathologists in assessing CI, significantly improving interobserver agreement. Additionally, the DL pipeline enhanced prognostic accuracy, particularly in outcome prediction, when combined with clinical parameters and pathologist-assessed CIs Conclusions: The DL pipeline demonstrated accuracy and efficiency in assessing CI in LN, showing promise in improving interobserver agreement among pathologists. It also exhibited significant value in prognostic analysis and enhancing outcome prediction in LN patients, offering a valuable tool for clinical decision-making. △ Less

Submitted 26 March, 2025; originally announced March 2025.

arXiv:2503.16149 [pdf, other]

Selective Complementary Feature Fusion and Modal Feature Compression Interaction for Brain Tumor Segmentation

Authors: Dong Chen, Boyue Zhao, Yi Zhang, Meng Zhao

Abstract: Efficient modal feature fusion strategy is the key to achieve accurate segmentation of brain glioma. However, due to the specificity of different MRI modes, it is difficult to carry out cross-modal fusion with large differences in modal features, resulting in the model ignoring rich feature information. On the other hand, the problem of multi-modal feature redundancy interaction occurs in parallel… ▽ More Efficient modal feature fusion strategy is the key to achieve accurate segmentation of brain glioma. However, due to the specificity of different MRI modes, it is difficult to carry out cross-modal fusion with large differences in modal features, resulting in the model ignoring rich feature information. On the other hand, the problem of multi-modal feature redundancy interaction occurs in parallel networks due to the proliferation of feature dimensions, further increase the difficulty of multi-modal feature fusion at the bottom end. In order to solve the above problems, we propose a noval complementary feature compression interaction network (CFCI-Net), which realizes the complementary fusion and compression interaction of multi-modal feature information with an efficient mode fusion strategy. Firstly, we propose a selective complementary feature fusion (SCFF) module, which adaptively fuses rich cross-modal feature information by complementary soft selection weights. Secondly, a modal feature compression interaction (MFCI) transformer is proposed to deal with the multi-mode fusion redundancy problem when the feature dimension surges. The MFCI transformer is composed of modal feature compression (MFC) and modal feature interaction (MFI) to realize redundancy feature compression and multi-mode feature interactive learning. %In MFI, we propose a hierarchical interactive attention mechanism based on multi-head attention. Evaluations on the BraTS2019 and BraTS2020 datasets demonstrate that CFCI-Net achieves superior results compared to state-of-the-art models. Code: https://github.com/CDmm0/CFCI-Net △ Less

Submitted 20 March, 2025; originally announced March 2025.

arXiv:2503.11551 [pdf, ps, other]

Vectorable Thrust Control for Multimodal Locomotion of Quadruped Robot SPIDAR

Authors: Moju Zhao

Abstract: In this paper, I present vectorable thrust control for different locomotion modes by a novel quadruped robot, SPIDAR, equipped with vectoring rotor in each link. First, the robot's unique mechanical design, the dynamics model, and the basic control framework for terrestrial/aerial locomotion are briefly introduced. Second, a vectorable thrust control method derived from the basic control framework… ▽ More In this paper, I present vectorable thrust control for different locomotion modes by a novel quadruped robot, SPIDAR, equipped with vectoring rotor in each link. First, the robot's unique mechanical design, the dynamics model, and the basic control framework for terrestrial/aerial locomotion are briefly introduced. Second, a vectorable thrust control method derived from the basic control framework for aerial locomotion is presented. A key feature of this extended flight control is its ability to avoid interrotor aerodynamics interference under specific joint configuration. Third, another extended thrust control method and a fundamental gait strategy is proposed for special terrestrial locomotion called crawling that requires all legs to be lifted at the same time. Finally, the experimental results of the flight with a complex joint motion and the repeatable crawling motion are explained, which demonstrate the feasibility of the proposed thrust control methods for different locomotion modes. △ Less

Submitted 14 March, 2025; originally announced March 2025.

Comments: 16 Pages. Presented in International Symposium of Robotics Research (ISRR) 2024, Long Beach, USA

arXiv:2503.11190 [pdf, other]

Cross-Modal Learning for Music-to-Music-Video Description Generation

Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

Abstract: Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV descri… ▽ More Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV description generation task and propose a comprehensive pipeline encompassing training data construction and multimodal model fine-tuning. We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset based on the Music4All dataset, which integrates both musical and visual information. Our experimental results demonstrate that music representations can be effectively mapped to textual domains, enabling the generation of meaningful MV description directly from music inputs. We also identify key components in the dataset construction pipeline that critically impact the quality of MV description and highlight specific musical attributes that warrant greater focus for improved MV description generation. △ Less

Submitted 14 March, 2025; originally announced March 2025.

Comments: Accepted by RepL4NLP 2025 @ NAACL 2025

arXiv:2503.04407 [pdf, ps, other]

Ambiguity Function Analysis and Optimization of Frequency-Hopping MIMO Radar with Movable Antennas

Authors: Xiang Chen, Ming-Min Zhao, Min Li, Liyan Li, Min-Jian Zhao, Jiangzhou Wang

Abstract: In this paper, we propose a movable antenna (MA)-enabled frequency-hopping (FH) multiple-input multiple-output (MIMO) radar system and investigate its sensing resolution. Specifically, we derive the expression of the ambiguity function and analyze the relationship between its main lobe width and the transmit antenna positions. In particular, the optimal antenna distribution to achieve the minimum… ▽ More In this paper, we propose a movable antenna (MA)-enabled frequency-hopping (FH) multiple-input multiple-output (MIMO) radar system and investigate its sensing resolution. Specifically, we derive the expression of the ambiguity function and analyze the relationship between its main lobe width and the transmit antenna positions. In particular, the optimal antenna distribution to achieve the minimum main lobe width in the angular domain is characterized. We discover that this minimum width is related to the antenna size, the antenna number, and the target angle. Meanwhile, we present lower bounds of the ambiguity function in the Doppler and delay domains, and show that the impact of the antenna size on the radar performance in these two domains is very different from that in the angular domain. Moreover, the performance enhancement brought by MAs exhibits a certain trade-off between the main lobe width and the side lobe peak levels. Therefore, we propose to balance between minimizing the side lobe levels and narrowing the main lobe of the ambiguity function by optimizing the antenna positions. To achieve this goal, we propose a low-complexity algorithm based on the Rosen's gradient projection method, and show that its performance is very close to the baseline. Simulation results are presented to validate the theoretical analysis on the properties of the ambiguity function, and demonstrate that MAs can reduce the main lobe width and suppress the side lobe levels of the ambiguity function, thereby enhancing radar performance. △ Less

Submitted 6 March, 2025; originally announced March 2025.

Comments: 15 pages, 13 figures

arXiv:2502.20690 [pdf, ps, other]

Multi-model Stochastic Particle-based Variational Bayesian Inference for Multiband Delay Estimation

Authors: Zhixiang Hu, An Liu, Minjian Zhao

Abstract: Joint utilization of multiple discrete frequency bands can enhance the accuracy of delay estimation. Although some unique challenges of multiband fusion, such as phase distortion, oscillation phenomena, and high-dimensional search, have been partially addressed, further challenges remain. Specifically, under conditions of low signal-to-noise ratio (SNR), insufficient data, and closely spaced delay… ▽ More Joint utilization of multiple discrete frequency bands can enhance the accuracy of delay estimation. Although some unique challenges of multiband fusion, such as phase distortion, oscillation phenomena, and high-dimensional search, have been partially addressed, further challenges remain. Specifically, under conditions of low signal-to-noise ratio (SNR), insufficient data, and closely spaced delay paths, accurately determining the model order-the number of delay paths-becomes difficult. Misestimating the model order can significantly degrade the estimation performance of traditional methods. To address joint model selection and parameter estimation under such harsh conditions, we propose a multi-model stochastic particle-based variational Bayesian inference (MM-SPVBI) framework, capable of exploring multiple high-dimensional parameter spaces. Initially, we split potential overlapping primary delay paths based on coarse estimates, generating several parallel candidate models. Then, an auto-focusing sampling strategy is employed to quickly identify the optimal model. Additionally, we introduce a hybrid posterior approximation to improve the original single-model SPVBI, ensuring overall complexity does not increase significantly with parallelism. Simulations demonstrate that our algorithm offers substantial advantages over existing methods. △ Less

Submitted 8 July, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

arXiv:2502.20054 [pdf, other]

Night-Voyager: Consistent and Efficient Nocturnal Vision-Aided State Estimation in Object Maps

Authors: Tianxiao Gao, Mingle Zhao, Chengzhong Xu, Hui Kong

Abstract: Accurate and robust state estimation at nighttime is essential for autonomous robotic navigation to achieve nocturnal or round-the-clock tasks. An intuitive question arises: Can low-cost standard cameras be exploited for nocturnal state estimation? Regrettably, most existing visual methods may fail under adverse illumination conditions, even with active lighting or image enhancement. A pivotal ins… ▽ More Accurate and robust state estimation at nighttime is essential for autonomous robotic navigation to achieve nocturnal or round-the-clock tasks. An intuitive question arises: Can low-cost standard cameras be exploited for nocturnal state estimation? Regrettably, most existing visual methods may fail under adverse illumination conditions, even with active lighting or image enhancement. A pivotal insight, however, is that streetlights in most urban scenarios act as stable and salient prior visual cues at night, reminiscent of stars in deep space aiding spacecraft voyage in interstellar navigation. Inspired by this, we propose Night-Voyager, an object-level nocturnal vision-aided state estimation framework that leverages prior object maps and keypoints for versatile localization. We also find that the primary limitation of conventional visual methods under poor lighting conditions stems from the reliance on pixel-level metrics. In contrast, metric-agnostic, non-pixel-level object detection serves as a bridge between pixel-level and object-level spaces, enabling effective propagation and utilization of object map information within the system. Night-Voyager begins with a fast initialization to solve the global localization problem. By employing an effective two-stage cross-modal data association, the system delivers globally consistent state updates using map-based observations. To address the challenge of significant uncertainties in visual observations at night, a novel matrix Lie group formulation and a feature-decoupled multi-state invariant filter are introduced, ensuring consistent and efficient estimation. Through comprehensive experiments in both simulation and diverse real-world scenarios (spanning approximately 12.3 km), Night-Voyager showcases its efficacy, robustness, and efficiency, filling a critical gap in nocturnal vision-aided state estimation. △ Less

Submitted 4 March, 2025; v1 submitted 27 February, 2025; originally announced February 2025.

Comments: IEEE Transactions on Robotics (T-RO), 2025

arXiv:2502.16709 [pdf]

FedDA-TSformer: Federated Domain Adaptation with Vision TimeSformer for Left Ventricle Segmentation on Gated Myocardial Perfusion SPECT Image

Authors: Yehong Huang, Chen Zhao, Rochak Dhakal, Min Zhao, Guang-Uei Hung, Zhixin Jiang, Weihua Zhou

Abstract: Background and Purpose: Functional assessment of the left ventricle using gated myocardial perfusion (MPS) single-photon emission computed tomography relies on the precise extraction of the left ventricular contours while simultaneously ensuring the security of patient data. Methods: In this paper, we introduce the integration of Federated Domain Adaptation with TimeSformer, named 'FedDA-TSformer'… ▽ More Background and Purpose: Functional assessment of the left ventricle using gated myocardial perfusion (MPS) single-photon emission computed tomography relies on the precise extraction of the left ventricular contours while simultaneously ensuring the security of patient data. Methods: In this paper, we introduce the integration of Federated Domain Adaptation with TimeSformer, named 'FedDA-TSformer' for left ventricle segmentation using MPS. FedDA-TSformer captures spatial and temporal features in gated MPS images, leveraging spatial attention, temporal attention, and federated learning for improved domain adaptation while ensuring patient data security. In detail, we employed Divide-Space-Time-Attention mechanism to extract spatio-temporal correlations from the multi-centered MPS datasets, ensuring that predictions are spatio-temporally consistent. To achieve domain adaptation, we align the model output on MPS from three different centers using local maximum mean discrepancy (LMMD) loss. This approach effectively addresses the dual requirements of federated learning and domain adaptation, enhancing the model's performance during training with multi-site datasets while ensuring the protection of data from different hospitals. Results: Our FedDA-TSformer was trained and evaluated using MPS datasets collected from three hospitals, comprising a total of 150 subjects. Each subject's cardiac cycle was divided into eight gates. The model achieved Dice Similarity Coefficients (DSC) of 0.842 and 0.907 for left ventricular (LV) endocardium and epicardium segmentation, respectively. Conclusion: Our proposed FedDA-TSformer model addresses the challenge of multi-center generalization, ensures patient data privacy protection, and demonstrates effectiveness in left ventricular (LV) segmentation. △ Less

Submitted 23 February, 2025; originally announced February 2025.

arXiv:2502.14198 [pdf, ps, other]

Antenna Position and Beamforming Optimization for Movable Antenna Enabled ISAC: Optimal Solutions and Efficient Algorithms

Authors: Lebin Chen, Ming-Min Zhao, Min-Jian Zhao, Rui Zhang

Abstract: In this paper, we propose an integrated sensing and communication (ISAC) system enabled by movable antennas (MAs), which can dynamically adjust antenna positions to enhance both sensing and communication performance for future wireless networks. To characterize the benefits of MA-enabled ISAC systems, we first derive the Cramér-Rao bound (CRB) for angle estimation error, which is then minimized fo… ▽ More In this paper, we propose an integrated sensing and communication (ISAC) system enabled by movable antennas (MAs), which can dynamically adjust antenna positions to enhance both sensing and communication performance for future wireless networks. To characterize the benefits of MA-enabled ISAC systems, we first derive the Cramér-Rao bound (CRB) for angle estimation error, which is then minimized for optimizing the antenna position vector (APV) and beamforming design, subject to a pre-defined signal-to-noise ratio (SNR) constraint to ensure the communication performance. In particular, for the case with receive MAs only, we provide a closed-form optimal antenna position solution, and show that employing MAs over conventional fixed-position antennas (FPAs) can achieve a sensing performance gain upper-bounded by 4.77 dB. On the other hand, for the case with transmit MAs only, we develop a boundary traversal breadth-first search (BT-BFS) algorithm to obtain the global optimal solution in the line-of-sight (LoS) channel scenario, along with a lower-complexity boundary traversal depth-first search (BT-DFS) algorithm to find a local optimal solution efficiently. While in the scenario with non-LoS (NLoS) channels, a majorization-minimization (MM) based Rosen's gradient projection (RGP) algorithm with an efficient initialization method is proposed to obtain stationary solutions for the considered problem, which can be extended to the general case with both transmit and receive MAs. Extensive numerical results are presented to verify the effectiveness of the proposed algorithms, and demonstrate the superiority of the considered MA-enabled ISAC system over conventional ISAC systems with FPAs in terms of sensing and communication performance trade-off. △ Less

Submitted 19 February, 2025; originally announced February 2025.

Comments: 13 pages, 7 figures

arXiv:2502.12623 [pdf, other]

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

Authors: Zhuoyuan Mao, Mengjie Zhao, Qiyu Wu, Hiromi Wakaki, Yuki Mitsufuji

Abstract: Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance… ▽ More Recent advancements in music large language models (LLMs) have significantly improved music understanding tasks, which involve the model's ability to analyze and interpret various musical elements. These improvements primarily focused on integrating both music and text inputs. However, the potential of incorporating additional modalities such as images, videos and textual music features to enhance music understanding remains unexplored. To bridge this gap, we propose DeepResonance, a multimodal music understanding LLM fine-tuned via multi-way instruction tuning with multi-way aligned music, text, image, and video data. To this end, we construct Music4way-MI2T, Music4way-MV2T, and Music4way-Any2T, three 4-way training and evaluation datasets designed to enable DeepResonance to integrate both visual and textual music feature content. We also introduce multi-sampled ImageBind embeddings and a pre-LLM fusion Transformer to enhance modality fusion prior to input into text LLMs, tailoring DeepResonance for multi-way instruction tuning. Our model achieves state-of-the-art performances across six music understanding tasks, highlighting the benefits of the auxiliary modalities and the structural superiority of DeepResonance. We plan to open-source the models and the newly constructed datasets. △ Less

Submitted 20 May, 2025; v1 submitted 18 February, 2025; originally announced February 2025.

arXiv:2501.13130 [pdf, other]

A Novel Scene Coupling Semantic Mask Network for Remote Sensing Image Segmentation

Authors: Xiaowen Ma, Rongrong Lian, Zhenkai Wu, Renxiang Guan, Tingfeng Hong, Mengjiao Zhao, Mengting Ma, Jiangtao Nie, Zhenhong Du, Siyang Song, Wei Zhang

Abstract: As a common method in the field of computer vision, spatial attention mechanism has been widely used in semantic segmentation of remote sensing images due to its outstanding long-range dependency modeling capability. However, remote sensing images are usually characterized by complex backgrounds and large intra-class variance that would degrade their analysis performance. While vanilla spatial att… ▽ More As a common method in the field of computer vision, spatial attention mechanism has been widely used in semantic segmentation of remote sensing images due to its outstanding long-range dependency modeling capability. However, remote sensing images are usually characterized by complex backgrounds and large intra-class variance that would degrade their analysis performance. While vanilla spatial attention mechanisms are based on dense affine operations, they tend to introduce a large amount of background contextual information and lack of consideration for intrinsic spatial correlation. To deal with such limitations, this paper proposes a novel scene-Coupling semantic mask network, which reconstructs the vanilla attention with scene coupling and local global semantic masks strategies. Specifically, scene coupling module decomposes scene information into global representations and object distributions, which are then embedded in the attention affinity processes. This Strategy effectively utilizes the intrinsic spatial correlation between features so that improve the process of attention modeling. Meanwhile, local global semantic masks module indirectly correlate pixels with the global semantic masks by using the local semantic mask as an intermediate sensory element, which reduces the background contextual interference and mitigates the effect of intra-class variance. By combining the above two strategies, we propose the model SCSM, which not only can efficiently segment various geospatial objects in complex scenarios, but also possesses inter-clean and elegant mathematical representations. Experimental results on four benchmark datasets demonstrate the the effectiveness of the above two strategies for improving the attention modeling of remote sensing images. The dataset and code are available at https://github.com/xwmaxwma/rssegmentation △ Less

Submitted 21 January, 2025; originally announced January 2025.

Comments: Accepted by ISPRS Journal of Photogrammetry and Remote Sensing

arXiv:2501.11869 [pdf, other]

Saturation in Snapshot Compressive Imaging

Authors: Mengyu Zhao, Shirin Jalali

Abstract: Snapshot Compressive Imaging (SCI) maps three-dimensional (3D) data cubes, such as videos or hyperspectral images, into two-dimensional (2D) measurements via optical modulation, enabling efficient data acquisition and reconstruction. Recent advances have shown the potential of mask optimization to enhance SCI performance, but most studies overlook nonlinear distortions caused by saturation in prac… ▽ More Snapshot Compressive Imaging (SCI) maps three-dimensional (3D) data cubes, such as videos or hyperspectral images, into two-dimensional (2D) measurements via optical modulation, enabling efficient data acquisition and reconstruction. Recent advances have shown the potential of mask optimization to enhance SCI performance, but most studies overlook nonlinear distortions caused by saturation in practical systems. Saturation occurs when high-intensity measurements exceed the sensor's dynamic range, leading to information loss that standard reconstruction algorithms cannot fully recover. This paper addresses the challenge of optimizing binary masks in SCI under saturation. We theoretically characterize the performance of compression-based SCI recovery in the presence of saturation and leverage these insights to optimize masks for such conditions. Our analysis reveals trade-offs between mask statistics and reconstruction quality in saturated systems. Experimental results using a Plug-and-Play (PnP) style network validate the theory, demonstrating improved recovery performance and robustness to saturation with our optimized binary masks. △ Less

Submitted 20 January, 2025; originally announced January 2025.

Comments: 13 pages

arXiv:2501.06653 [pdf, other]

Theoretical Characterization of Effect of Masks in Snapshot Compressive Imaging

Authors: Mengyu Zhao, Shirin Jalali

Abstract: Snapshot compressive imaging (SCI) refers to the recovery of three-dimensional data cubes-such as videos or hyperspectral images-from their two-dimensional projections, which are generated by a special encoding of the data with a mask. SCI systems commonly use binary-valued masks that follow certain physical constraints. Optimizing these masks subject to these constraints is expected to improve sy… ▽ More Snapshot compressive imaging (SCI) refers to the recovery of three-dimensional data cubes-such as videos or hyperspectral images-from their two-dimensional projections, which are generated by a special encoding of the data with a mask. SCI systems commonly use binary-valued masks that follow certain physical constraints. Optimizing these masks subject to these constraints is expected to improve system performance. However, prior theoretical work on SCI systems focuses solely on independently and identically distributed (i.i.d.) Gaussian masks, which do not permit such optimization. On the other hand, existing practical mask optimizations rely on computationally intensive joint optimizations that provide limited insight into the role of masks and are expected to be sub-optimal due to the non-convexity and complexity of the optimization. In this paper, we analytically characterize the performance of SCI systems employing binary masks and leverage our analysis to optimize hardware parameters. Our findings provide a comprehensive and fundamental understanding of the role of binary masks - with both independent and dependent elements - and their optimization. We also present simulation results that confirm our theoretical findings and further illuminate different aspects of mask design. △ Less

Submitted 11 January, 2025; originally announced January 2025.

Comments: 27 pages. arXiv admin note: substantial text overlap with arXiv:2307.07796

arXiv:2501.04285 [pdf, other]

doi 10.12142/ZTECOM.202501005

Separate Source Channel Coding Is Still What You Need: An LLM-based Rethinking

Authors: Tianqi Ren, Rongpeng Li, Ming-min Zhao, Xianfu Chen, Guangyi Liu, Yang Yang, Zhifeng Zhao, Honggang Zhang

Abstract: Along with the proliferating research interest in Semantic Communication (SemCom), Joint Source Channel Coding (JSCC) has dominated the attention due to the widely assumed existence in efficiently delivering information semantics. Nevertheless, this paper challenges the conventional JSCC paradigm, and advocates for adoption of Separate Source Channel Coding (SSCC) to enjoy the underlying more degr… ▽ More Along with the proliferating research interest in Semantic Communication (SemCom), Joint Source Channel Coding (JSCC) has dominated the attention due to the widely assumed existence in efficiently delivering information semantics. Nevertheless, this paper challenges the conventional JSCC paradigm, and advocates for adoption of Separate Source Channel Coding (SSCC) to enjoy the underlying more degree of freedom for optimization. We demonstrate that SSCC, after leveraging the strengths of Large Language Model (LLM) for source coding and Error Correction Code Transformer (ECCT) complemented for channel decoding, offers superior performance over JSCC. Our proposed framework also effectively highlights the compatibility challenges between SemCom approaches and digital communication systems, particularly concerning the resource costs associated with the transmission of high precision floating point numbers. Through comprehensive evaluations, we establish that empowered by LLM-based compression and ECCT-enhanced error correction, SSCC remains a viable and effective solution for modern communication systems. In other words, separate source and channel coding is still what we need! △ Less

Submitted 26 May, 2025; v1 submitted 8 January, 2025; originally announced January 2025.

Journal ref: ZTE Communications, vol. 23, no. 1, pp. 30-44, Mar. 2025

arXiv:2412.19475 [pdf, other]

Exploiting Dynamic Sparsity for Near-Field Spatial Non-Stationary XL-MIMO Channel Tracking

Authors: Wenkang Xu, An Liu, Min-jian Zhao, Giuseppe Caire, Yik-Chung Wu

Abstract: This work considers a spatial non-stationary channel tracking problem in broadband extremely large-scale multiple-input-multiple-output (XL-MIMO) systems. In the case of spatial non-stationary, each scatterer has a certain visibility region (VR) over antennas and power change may occur among visible antennas. Concentrating on the temporal correlation of XL-MIMO channels, we design a three-layer Ma… ▽ More This work considers a spatial non-stationary channel tracking problem in broadband extremely large-scale multiple-input-multiple-output (XL-MIMO) systems. In the case of spatial non-stationary, each scatterer has a certain visibility region (VR) over antennas and power change may occur among visible antennas. Concentrating on the temporal correlation of XL-MIMO channels, we design a three-layer Markov prior model and hierarchical two-dimensional (2D) Markov model to exploit the dynamic sparsity of sparse channel vectors and VRs, respectively. Then, we formulate the channel tracking problem as a bilinear measurement process, and a novel dynamic alternating maximum a posteriori (DA-MAP) framework is developed to solve the problem. The DA-MAP contains four basic modules: channel estimation module, VR detection module, grid update module, and temporal correlated module. Specifically, the first module is an inverse-free variational Bayesian inference (IF-VBI) estimator that avoids computational intensive matrix inverse each iteration; the second module is a turbo compressive sensing (Turbo-CS) algorithm that only needs small-scale matrix operations in a parallel fashion; the third module refines the polar-delay domain grid; and the fourth module can process the temporal prior information to ensure high-efficiency channel tracking. Simulations show that the proposed method can achieve a significant channel tracking performance while achieving low computational overhead. △ Less

Submitted 31 March, 2025; v1 submitted 27 December, 2024; originally announced December 2024.

Comments: 13 pages, 11 figures,Submitted to IEEE TSP

arXiv:2411.11192 [pdf]

Robot Metabolism: Towards machines that can grow by consuming other machines

Authors: Philippe Martin Wyder, Riyaan Bakhda, Meiqi Zhao, Quinn A. Booth, Matthew E. Modi, Andrew Song, Simon Kang, Jiahao Wu, Priya Patel, Robert T. Kasumi, David Yi, Nihar Niraj Garg, Pranav Jhunjhunwala, Siddharth Bhutoria, Evan H. Tong, Yuhang Hu, Judah Goldfeder, Omer Mustel, Donghan Kim, Hod Lipson

Abstract: Biological lifeforms can heal, grow, adapt, and reproduce -- abilities essential for sustained survival and development. In contrast, robots today are primarily monolithic machines with limited ability to self-repair, physically develop, or incorporate material from their environments. A key challenge to such physical adaptation has been that while robot minds are rapidly evolving new behaviors th… ▽ More Biological lifeforms can heal, grow, adapt, and reproduce -- abilities essential for sustained survival and development. In contrast, robots today are primarily monolithic machines with limited ability to self-repair, physically develop, or incorporate material from their environments. A key challenge to such physical adaptation has been that while robot minds are rapidly evolving new behaviors through AI, their bodies remain closed systems, unable to systematically integrate new material to grow or heal. We argue that open-ended physical adaptation is only possible when robots are designed using only a small repertoire of simple modules. This allows machines to mechanically adapt by consuming parts from other machines or their surroundings and shedding broken components. We demonstrate this principle using a truss modular robot platform composed of one-dimensional actuated bars. We show how robots in this space can grow bigger, faster, and more capable by consuming materials from their environment and from other robots. We suggest that machine metabolic processes akin to the one demonstrated here will be an essential part of any sustained future robot ecology. △ Less

Submitted 17 November, 2024; originally announced November 2024.

Comments: Manuscript combined with Supplementary Materials File for arXiv submission. Submitting to Journal and will update external DOI once available

MSC Class: 70-01; 68-02 ACM Class: I.6; H.4; H.m; I.m; B.m

arXiv:2411.06307 [pdf, other]

Acoustic Volume Rendering for Neural Impulse Response Fields

Authors: Zitong Lan, Chenhao Zheng, Zhiwei Zheng, Mingmin Zhao

Abstract: Realistic audio synthesis that captures accurate acoustic phenomena is essential for creating immersive experiences in virtual and augmented reality. Synthesizing the sound received at any position relies on the estimation of impulse response (IR), which characterizes how sound propagates in one scene along different paths before arriving at the listener's position. In this paper, we present Acous… ▽ More Realistic audio synthesis that captures accurate acoustic phenomena is essential for creating immersive experiences in virtual and augmented reality. Synthesizing the sound received at any position relies on the estimation of impulse response (IR), which characterizes how sound propagates in one scene along different paths before arriving at the listener's position. In this paper, we present Acoustic Volume Rendering (AVR), a novel approach that adapts volume rendering techniques to model acoustic impulse responses. While volume rendering has been successful in modeling radiance fields for images and neural scene representations, IRs present unique challenges as time-series signals. To address these challenges, we introduce frequency-domain volume rendering and use spherical integration to fit the IR measurements. Our method constructs an impulse response field that inherently encodes wave propagation principles and achieves state-of-the-art performance in synthesizing impulse responses for novel poses. Experiments show that AVR surpasses current leading methods by a substantial margin. Additionally, we develop an acoustic simulation platform, AcoustiX, which provides more accurate and realistic IR simulations than existing simulators. Code for AVR and AcoustiX are available at https://zitonglan.github.io/avr. △ Less

Submitted 9 November, 2024; originally announced November 2024.

Comments: NeurIPS 2024 Spotlight

arXiv:2410.15573 [pdf, other]

OpenMU: Your Swiss Army Knife for Music Understanding

Authors: Mengjie Zhao, Zhi Zhong, Zhuoyuan Mao, Shiqi Yang, Wei-Hsiang Liao, Shusuke Takahashi, Hiromi Wakaki, Yuki Mitsufuji

Abstract: We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our mus… ▽ More We present OpenMU-Bench, a large-scale benchmark suite for addressing the data scarcity issue in training multimodal language models to understand music. To construct OpenMU-Bench, we leveraged existing datasets and bootstrapped new annotations. OpenMU-Bench also broadens the scope of music understanding by including lyrics understanding and music tool usage. Using OpenMU-Bench, we trained our music understanding model, OpenMU, with extensive ablations, demonstrating that OpenMU outperforms baseline models such as MU-Llama. Both OpenMU and OpenMU-Bench are open-sourced to facilitate future research in music understanding and to enhance creative music production efficiency. △ Less

Submitted 27 November, 2024; v1 submitted 20 October, 2024; originally announced October 2024.

Comments: Resources: https://github.com/sony/openmu

arXiv:2410.05151 [pdf, other]

Editing Music with Melody and Text: Using ControlNet for Diffusion Transformer

Authors: Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo Zhao, Chao Zhang

Abstract: Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address these limitations, we propose a novel approach using a Diffusion Transformer (DiT) augmented with an additional control branch using ControlNet. This allows for lon… ▽ More Despite the significant progress in controllable music generation and editing, challenges remain in the quality and length of generated music due to the use of Mel-spectrogram representations and UNet-based model structures. To address these limitations, we propose a novel approach using a Diffusion Transformer (DiT) augmented with an additional control branch using ControlNet. This allows for long-form and variable-length music generation and editing controlled by text and melody prompts. For more precise and fine-grained melody control, we introduce a novel top-$k$ constant-Q Transform representation as the melody prompt, reducing ambiguity compared to previous representations (e.g., chroma), particularly for music with multiple tracks or a wide range of pitch values. To effectively balance the control signals from text and melody prompts, we adopt a curriculum learning strategy that progressively masks the melody prompt, resulting in a more stable training process. Experiments have been performed on text-to-music generation and music-style transfer tasks using open-source instrumental recording data. The results demonstrate that by extending StableAudio, a pre-trained text-controlled DiT model, our approach enables superior melody-controlled editing while retaining good text-to-music generation performance. These results outperform a strong MusicGen baseline in terms of both text-based generation and melody preservation for editing. Audio examples can be found at https://stable-audio-control.github.io. △ Less

Submitted 16 January, 2025; v1 submitted 7 October, 2024; originally announced October 2024.

Comments: Accepted for publication at ICASSP 2025

arXiv:2410.00404 [pdf, other]

3DGR-CAR: Coronary artery reconstruction from ultra-sparse 2D X-ray views with a 3D Gaussians representation

Authors: Xueming Fu, Yingtai Li, Fenghe Tang, Jun Li, Mingyue Zhao, Gao-Jun Teng, S. Kevin Zhou

Abstract: Reconstructing 3D coronary arteries is important for coronary artery disease diagnosis, treatment planning and operation navigation. Traditional reconstruction techniques often require many projections, while reconstruction from sparse-view X-ray projections is a potential way of reducing radiation dose. However, the extreme sparsity of coronary arteries in a 3D volume and ultra-limited number of… ▽ More Reconstructing 3D coronary arteries is important for coronary artery disease diagnosis, treatment planning and operation navigation. Traditional reconstruction techniques often require many projections, while reconstruction from sparse-view X-ray projections is a potential way of reducing radiation dose. However, the extreme sparsity of coronary arteries in a 3D volume and ultra-limited number of projections pose significant challenges for efficient and accurate 3D reconstruction. To this end, we propose 3DGR-CAR, a 3D Gaussian Representation for Coronary Artery Reconstruction from ultra-sparse X-ray projections. We leverage 3D Gaussian representation to avoid the inefficiency caused by the extreme sparsity of coronary artery data and propose a Gaussian center predictor to overcome the noisy Gaussian initialization from ultra-sparse view projections. The proposed scheme enables fast and accurate 3D coronary artery reconstruction with only 2 views. Experimental results on two datasets indicate that the proposed approach significantly outperforms other methods in terms of voxel accuracy and visual quality of coronary arteries. The code will be available in https://github.com/windrise/3DGR-CAR. △ Less

Submitted 1 October, 2024; originally announced October 2024.

Comments: 10 pages, 5 figures, Accepted at MICCAI 2024

arXiv:2409.07488 [pdf, other]

Contrastive Learning-based User Identification with Limited Data on Smart Textiles

Authors: Yunkang Zhang, Ziyu Wu, Zhen Liang, Fangting Xie, Quan Wan, Mingjie Zhao, Xiaohui Cai

Abstract: Pressure-sensitive smart textiles are widely applied in the fields of healthcare, sports monitoring, and intelligent homes. The integration of devices embedded with pressure sensing arrays is expected to enable comprehensive scene coverage and multi-device integration. However, the implementation of identity recognition, a fundamental function in this context, relies on extensive device-specific d… ▽ More Pressure-sensitive smart textiles are widely applied in the fields of healthcare, sports monitoring, and intelligent homes. The integration of devices embedded with pressure sensing arrays is expected to enable comprehensive scene coverage and multi-device integration. However, the implementation of identity recognition, a fundamental function in this context, relies on extensive device-specific datasets due to variations in pressure distribution across different devices. To address this challenge, we propose a novel user identification method based on contrastive learning. We design two parallel branches to facilitate user identification on both new and existing devices respectively, employing supervised contrastive learning in the feature space to promote domain unification. When encountering new devices, extensive data collection efforts are not required; instead, user identification can be achieved using limited data consisting of only a few simple postures. Through experimentation with two 8-subject pressure datasets (BedPressure and ChrPressure), our proposed method demonstrates the capability to achieve user identification across 12 sitting scenarios using only a dataset containing 2 postures. Our average recognition accuracy reaches 79.05%, representing an improvement of 2.62% over the best baseline model. △ Less

Submitted 6 September, 2024; originally announced September 2024.

arXiv:2409.04719 [pdf, other]

Unrolling Plug-and-Play Network for Hyperspectral Unmixing

Authors: Min Zhao, Linruize Tang, Jie Chen

Abstract: Deep learning based unmixing methods have received great attention in recent years and achieve remarkable performance. These methods employ a data-driven approach to extract structure features from hyperspectral image, however, they tend to be less physical interpretable. Conventional unmixing methods are with much more interpretability, whereas they require manually designing regularization and c… ▽ More Deep learning based unmixing methods have received great attention in recent years and achieve remarkable performance. These methods employ a data-driven approach to extract structure features from hyperspectral image, however, they tend to be less physical interpretable. Conventional unmixing methods are with much more interpretability, whereas they require manually designing regularization and choosing penalty parameters. To overcome these limitations, we propose a novel unmixing method by unrolling the plug-and-play unmixing algorithm to conduct the deep architecture. Our method integrates both inner and outer priors. The carefully designed unfolding deep architecture is used to learn the spectral and spatial information from the hyperspectral image, which we refer to as inner priors. Additionally, our approach incorporates deep denoisers that have been pretrained on a large volume of image data to leverage the outer priors. Secondly, we design a dynamic convolution to model the multiscale information. Different scales are fused using an attention module. Experimental results of both synthetic and real datasets demonstrate that our method outperforms compared methods. △ Less

Submitted 7 September, 2024; originally announced September 2024.

arXiv:2408.02943 [pdf, other]

Recent Advances in Data-driven Intelligent Control for Wireless Communication: A Comprehensive Survey

Authors: Wei Huo, Huiwen Yang, Nachuan Yang, Zhaohua Yang, Jiuzhou Zhang, Fuhai Nan, Xingzhou Chen, Yifan Mao, Suyang Hu, Pengyu Wang, Xuanyu Zheng, Mingming Zhao, Ling Shi

Abstract: The advent of next-generation wireless communication systems heralds an era characterized by high data rates, low latency, massive connectivity, and superior energy efficiency. These systems necessitate innovative and adaptive strategies for resource allocation and device behavior control in wireless networks. Traditional optimization-based methods have been found inadequate in meeting the complex… ▽ More The advent of next-generation wireless communication systems heralds an era characterized by high data rates, low latency, massive connectivity, and superior energy efficiency. These systems necessitate innovative and adaptive strategies for resource allocation and device behavior control in wireless networks. Traditional optimization-based methods have been found inadequate in meeting the complex demands of these emerging systems. As the volume of data continues to escalate, the integration of data-driven methods has become indispensable for enabling adaptive and intelligent control mechanisms in future wireless communication systems. This comprehensive survey explores recent advancements in data-driven methodologies applied to wireless communication networks. It focuses on developments over the past five years and their application to various control objectives within wireless cyber-physical systems. It encompasses critical areas such as link adaptation, user scheduling, spectrum allocation, beam management, power control, and the co-design of communication and control systems. We provide an in-depth exploration of the technical underpinnings that support these data-driven approaches, including the algorithms, models, and frameworks developed to enhance network performance and efficiency. We also examine the challenges that current data-driven algorithms face, particularly in the context of the dynamic and heterogeneous nature of next-generation wireless networks. The paper provides a critical analysis of these challenges and offers insights into potential solutions and future research directions. This includes discussing the adaptability, integration with 6G, and security of data-driven methods in the face of increasing network complexity and data volume. △ Less

Submitted 6 August, 2024; originally announced August 2024.

arXiv:2407.21394 [pdf, other]

Force Sensing Guided Artery-Vein Segmentation via Sequential Ultrasound Images

Authors: Yimeng Geng, Gaofeng Meng, Mingcong Chen, Guanglin Cao, Mingyang Zhao, Jianbo Zhao, Hongbin Liu

Abstract: Accurate identification of arteries and veins in ultrasound images is crucial for vascular examinations and interventions in robotics-assisted surgeries. However, current methods for ultrasound vessel segmentation face challenges in distinguishing between arteries and veins due to their morphological similarities. To address this challenge, this study introduces a novel force sensing guided segmen… ▽ More Accurate identification of arteries and veins in ultrasound images is crucial for vascular examinations and interventions in robotics-assisted surgeries. However, current methods for ultrasound vessel segmentation face challenges in distinguishing between arteries and veins due to their morphological similarities. To address this challenge, this study introduces a novel force sensing guided segmentation approach to enhance artery-vein segmentation accuracy by leveraging their distinct deformability. Our proposed method utilizes force magnitude to identify key frames with the most significant vascular deformation in a sequence of ultrasound images. These key frames are then integrated with the current frame through attention mechanisms, with weights assigned in accordance with force magnitude. Our proposed force sensing guided framework can be seamlessly integrated into various segmentation networks and achieves significant performance improvements in multiple U-shaped networks such as U-Net, Swin-unet and Transunet. Furthermore, we contribute the first multimodal ultrasound artery-vein segmentation dataset, Mus-V, which encompasses both force and image data simultaneously. The dataset comprises 3114 ultrasound images of carotid and femoral vessels extracted from 105 videos, with corresponding force data recorded by the force sensor mounted on the US probe. Our code and dataset will be publicly available. △ Less

Submitted 31 July, 2024; originally announced July 2024.

arXiv:2407.05619 [pdf, other]

AIRA: A Low-cost IR-based Approach Towards Autonomous Precision Drone Landing and NLOS Indoor Navigation

Authors: Yanchen Liu, Minghui Zhao, Kaiyuan Hou, Junxi Xia, Charlie Carver, Stephen Xia, Xia Zhou, Xiaofan Jiang

Abstract: Automatic drone landing is an important step for achieving fully autonomous drones. Although there are many works that leverage GPS, video, wireless signals, and active acoustic sensing to perform precise landing, autonomous drone landing remains an unsolved challenge for palm-sized microdrones that may not be able to support the high computational requirements of vision, wireless, or active audio… ▽ More Automatic drone landing is an important step for achieving fully autonomous drones. Although there are many works that leverage GPS, video, wireless signals, and active acoustic sensing to perform precise landing, autonomous drone landing remains an unsolved challenge for palm-sized microdrones that may not be able to support the high computational requirements of vision, wireless, or active audio sensing. We propose AIRA, a low-cost infrared light-based platform that targets precise and efficient landing of low-resource microdrones. AIRA consists of an infrared light bulb at the landing station along with an energy efficient hardware photodiode (PD) sensing platform at the bottom of the drone. AIRA costs under 83 USD, while achieving comparable performance to existing vision-based methods at a fraction of the energy cost. AIRA requires only three PDs without any complex pattern recognition models to accurately land the drone, under $10$cm of error, from up to $11.1$ meters away, compared to camera-based methods that require recognizing complex markers using high resolution images with a range of only up to $1.2$ meters from the same height. Moreover, we demonstrate that AIRA can accurately guide drones in low light and partial non line of sight scenarios, which are difficult for traditional vision-based approaches. △ Less

Submitted 8 July, 2024; originally announced July 2024.

arXiv:2406.17672 [pdf, other]

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

Authors: Marco Comunità, Zhi Zhong, Akira Takahashi, Shiqi Yang, Mengjie Zhao, Koichi Saito, Yukara Ikemiya, Takashi Shibuya, Shusuke Takahashi, Yuki Mitsufuji

Abstract: Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of mod… ▽ More Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of model parameters. To address the challenges, we propose SpecMaskGIT, a light-weighted, efficient yet effective TTA model based on the masked generative modeling of spectrograms. First, SpecMaskGIT synthesizes a realistic 10s audio clip by less than 16 iterations, an order-of-magnitude less than previous iterative TTA methods. As a discrete model, SpecMaskGIT outperforms larger VQ-Diffusion and auto-regressive models in the TTA benchmark, while being real-time with only 4 CPU cores or even 30x faster with a GPU. Next, built upon a latent space of Mel-spectrogram, SpecMaskGIT has a wider range of applications (e.g., the zero-shot bandwidth extension) than similar methods built on the latent wave domain. Moreover, we interpret SpecMaskGIT as a generative extension to previous discriminative audio masked Transformers, and shed light on its audio representation learning potential. We hope our work inspires the exploration of masked audio modeling toward further diverse scenarios. △ Less

Submitted 26 June, 2024; v1 submitted 25 June, 2024; originally announced June 2024.

Comments: 6 pages, 8 figures, 8 tables. Audio samples: https://zzaudio.github.io/SpecMaskGIT/index.html

arXiv:2406.08305 [pdf, other]

Large Language Model(LLM) assisted End-to-End Network Health Management based on Multi-Scale Semanticization

Authors: Fengxiao Tang, Xiaonan Wang, Xun Yuan, Linfeng Luo, Ming Zhao, Tianchi Huang, Nei Kato

Abstract: Network device and system health management is the foundation of modern network operations and maintenance. Traditional health management methods, relying on expert identification or simple rule-based algorithms, struggle to cope with the dynamic heterogeneous networks (DHNs) environment. Moreover, current state-of-the-art distributed anomaly detection methods, which utilize specific machine learn… ▽ More Network device and system health management is the foundation of modern network operations and maintenance. Traditional health management methods, relying on expert identification or simple rule-based algorithms, struggle to cope with the dynamic heterogeneous networks (DHNs) environment. Moreover, current state-of-the-art distributed anomaly detection methods, which utilize specific machine learning techniques, lack multi-scale adaptivity for heterogeneous device information, resulting in unsatisfactory diagnostic accuracy for DHNs. In this paper, we develop an LLM-assisted end-to-end intelligent network health management framework. The framework first proposes a Multi-Scale Semanticized Anomaly Detection Model (MSADM), incorporating semantic rule trees with an attention mechanism to address the multi-scale anomaly detection problem in DHNs. Secondly, a chain-of-thought-based large language model is embedded in downstream to adaptively analyze the fault detection results and produce an analysis report with detailed fault information and optimization strategies. Experimental results show that the accuracy of our proposed MSADM for heterogeneous network entity anomaly detection is as high as 91.31\%. △ Less

Submitted 2 March, 2025; v1 submitted 12 June, 2024; originally announced June 2024.

arXiv:2405.19516 [pdf, other]

Enabling Visual Recognition at Radio Frequency

Authors: Haowen Lai, Gaoxiang Luo, Yifei Liu, Mingmin Zhao

Abstract: This paper introduces PanoRadar, a novel RF imaging system that brings RF resolution close to that of LiDAR, while providing resilience against conditions challenging for optical signals. Our LiDAR-comparable 3D imaging results enable, for the first time, a variety of visual recognition tasks at radio frequency, including surface normal estimation, semantic segmentation, and object detection. Pano… ▽ More This paper introduces PanoRadar, a novel RF imaging system that brings RF resolution close to that of LiDAR, while providing resilience against conditions challenging for optical signals. Our LiDAR-comparable 3D imaging results enable, for the first time, a variety of visual recognition tasks at radio frequency, including surface normal estimation, semantic segmentation, and object detection. PanoRadar utilizes a rotating single-chip mmWave radar, along with a combination of novel signal processing and machine learning algorithms, to create high-resolution 3D images of the surroundings. Our system accurately estimates robot motion, allowing for coherent imaging through a dense grid of synthetic antennas. It also exploits the high azimuth resolution to enhance elevation resolution using learning-based methods. Furthermore, PanoRadar tackles 3D learning via 2D convolutions and addresses challenges due to the unique characteristics of RF signals. Our results demonstrate PanoRadar's robust performance across 12 buildings. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.16791 [pdf, ps, other]

Joint Node Selection and Resource Allocation Optimization for Cooperative Sensing with a Shared Wireless Backhaul

Authors: Mingxin Chen, Ming-Min Zhao, An Liu, Min Li, Qingjiang Shi

Abstract: In this paper, we consider a cooperative sensing framework in the context of future multi-functional network with both communication and sensing ability, where one base station (BS) serves as a sensing transmitter and several nearby BSs serve as sensing receivers. Each receiver receives the sensing signal reflected by the target and communicates with the fusion center (FC) through a wireless multi… ▽ More In this paper, we consider a cooperative sensing framework in the context of future multi-functional network with both communication and sensing ability, where one base station (BS) serves as a sensing transmitter and several nearby BSs serve as sensing receivers. Each receiver receives the sensing signal reflected by the target and communicates with the fusion center (FC) through a wireless multiple access channel (MAC) for cooperative target localization. To improve the localization performance, we present a hybrid information-signal domain cooperative sensing (HISDCS) design, where each sensing receiver transmits both the estimated time delay/effective reflecting coefficient and the received sensing signal sampled around the estimated time delay to the FC. Then, we propose to minimize the number of channel uses by utilizing an efficient Karhunen-Loéve transformation (KLT) encoding scheme for signal quantization and proper node selection, under the Cramér-Rao lower bound (CRLB) constraint and the capacity limits of MAC. A novel matrix-inequality constrained successive convex approximation (MCSCA) algorithm is proposed to optimize the wireless backhaul resource allocation, together with a greedy strategy for node selection. Despite the high non-convexness of the considered problem, we prove that the proposed MCSCA algorithm is able to converge to the set of Karush-Kuhn-Tucker (KKT) solutions of a relaxed problem obtained by relaxing the discrete variables. Besides, a low-complexity quantization bit reallocation algorithm is designed, which does not perform explicit node selection, and is able to harvest most of the performance gain brought by HISDCS. Finally, numerical simulations are presented to show that the proposed HISDCS design is able to significantly outperform the baseline schemes. △ Less

Submitted 16 December, 2024; v1 submitted 26 May, 2024; originally announced May 2024.

Comments: 16 pages, 12 figures

arXiv:2405.14598 [pdf, other]

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Authors: Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

Abstract: In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation method… ▽ More In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/ △ Less

Submitted 24 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

Comments: 10 pages

arXiv:2405.12125 [pdf, other]

doi 10.1109/TRO.2023.3327634

Design, Control, and Motion-Planning for a Root-Perching Rotor-Distributed Manipulator

Authors: Takuzumi Nishio, Moju Zhao, Kei Okada, Masayuki Inaba

Abstract: Manipulation performance improvement is crucial for aerial robots. For aerial manipulators, the baselink position and attitude errors directly affect the precision at the end effector. To address this stability problem, fixed-body approaches such as perching on the environment using the rotor suction force are useful. Additionally, conventional arm-equipped multirotors, called rotor-concentrated m… ▽ More Manipulation performance improvement is crucial for aerial robots. For aerial manipulators, the baselink position and attitude errors directly affect the precision at the end effector. To address this stability problem, fixed-body approaches such as perching on the environment using the rotor suction force are useful. Additionally, conventional arm-equipped multirotors, called rotor-concentrated manipulators (RCMs), find it difficult to generate a large wrench at the end effector due to joint torque limitations. Using distributed rotors to each link, the thrust can support each link weight, decreasing the arm joints' torque. Based on this approach, rotor-distributed manipulators (RDMs) can increase feasible wrench and reachability of the end-effector. This paper introduces a minimal configuration of a rotor-distributed manipulator that can perch on surfaces, especially ceilings, using a part of their body. First, we design a minimal rotor-distributed arm considering the flight and end-effector performance. Second, a flight controller is proposed for this minimal RDM along with a perching controller adaptable for various types of aerial robots. Third, we propose a motion planning method based on inverse kinematics (IK), considering specific constraints to the proposed RDMs such as perching force. Finally, we evaluate flight and perching motions and \revise{confirm} that the proposed manipulator can significantly improve the manipulation performance. △ Less

Submitted 20 May, 2024; originally announced May 2024.

Comments: IEEE Transactions on Robotics (2023)

arXiv:2405.04027 [pdf, other]

Joint Visibility Region Detection and Channel Estimation for XL-MIMO Systems via Alternating MAP

Authors: Wenkang Xu, An Liu, Min-jian Zhao

Abstract: We investigate a joint visibility region (VR) detection and channel estimation problem in extremely large-scale multiple-input-multiple-output (XL-MIMO) systems, where near-field propagation and spatial non-stationary effects exist. In this case, each scatterer can only see a subset of antennas, i.e., it has a certain VR over the antennas. Because of the spatial correlation among adjacent sub-arra… ▽ More We investigate a joint visibility region (VR) detection and channel estimation problem in extremely large-scale multiple-input-multiple-output (XL-MIMO) systems, where near-field propagation and spatial non-stationary effects exist. In this case, each scatterer can only see a subset of antennas, i.e., it has a certain VR over the antennas. Because of the spatial correlation among adjacent sub-arrays, VR of scatterers exhibits a two-dimensional (2D) clustered sparsity. We design a 2D Markov prior model to capture such a structured sparsity. Based on this, a novel alternating maximum a posteriori (MAP) framework is developed for high-accuracy VR detection and channel estimation. The alternating MAP framework consists of three basic modules: a channel estimation module, a VR detection module, and a grid update module. Specifically, the first module is a low-complexity inverse-free variational Bayesian inference (IF-VBI) algorithm that avoids the matrix inverse via minimizing a relaxed Kullback-Leibler (KL) divergence. The second module is a structured expectation propagation (EP) algorithm which has the ability to deal with complicated prior information. And the third module refines polar-domain grid parameters via gradient ascent. Simulations demonstrate the superiority of the proposed algorithm in both VR detection and channel estimation. △ Less

Submitted 21 May, 2024; v1 submitted 7 May, 2024; originally announced May 2024.

Comments: 13 pages, 14 figures, submitted to IEEE TSP

arXiv:2405.01242 [pdf, other]

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

Authors: Yueyuan Sui, Minghui Zhao, Junxi Xia, Xiaofan Jiang, Stephen Xia

Abstract: We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art m… ▽ More We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB. △ Less

Submitted 29 May, 2024; v1 submitted 2 May, 2024; originally announced May 2024.

arXiv:2403.10873 [pdf, other]

CSI Transfer From Sub-6G to mmWave: Reduced-Overhead Multi-User Hybrid Beamforming

Authors: Weicao Deng, Min Li, Ming-Min Zhao, Min-Jian Zhao, Osvaldo Simeone

Abstract: Hybrid beamforming is vital in modern wireless systems, especially for massive MIMO and millimeter-wave (mmWave) deployments, offering efficient directional transmission with reduced hardware complexity. However, effective beamforming in multi-user scenarios relies heavily on accurate channel state information, the acquisition of which often requires significant pilot overhead, degrading system pe… ▽ More Hybrid beamforming is vital in modern wireless systems, especially for massive MIMO and millimeter-wave (mmWave) deployments, offering efficient directional transmission with reduced hardware complexity. However, effective beamforming in multi-user scenarios relies heavily on accurate channel state information, the acquisition of which often requires significant pilot overhead, degrading system performance. To address this and inspired by the spatial congruence between sub-6GHz (sub-6G) and mmWave channels, we propose a Sub-6G information Aided Multi-User Hybrid Beamforming (SA-MUHBF) framework, avoiding excessive use of pilots at mmWave. SA-MUHBF employs a convolutional neural network to predict mmWave beamspace from sub-6G channel estimate, followed by a novel multi-layer graph neural network for analog beam selection and a linear minimum mean-square error algorithm for digital beamforming. Numerical results demonstrate that SA-MUHBF efficiently predicts the mmWave beamspace representation and achieves superior spectrum efficiency over state-of-the-art benchmarks. Moreover, SA-MUHBF demonstrates robust performance across varied sub-6G system configurations and exhibits strong generalization to unseen scenarios. △ Less

Submitted 14 November, 2024; v1 submitted 16 March, 2024; originally announced March 2024.

Comments: Accepted by IEEE JSAC NGAT

arXiv:2402.03042 [pdf, other]

Semi-Passive Intelligent Reflecting Surface Enabled Sensing Systems

Authors: Qiaoyan Peng, Qingqing Wu, Wen Chen, Shaodan Ma, Ming-Min Zhao, Octavia A. Dobre

Abstract: Intelligent reflecting surface (IRS) has garnered growing interest and attention due to its potential for facilitating and supporting wireless communications and sensing. This paper studies a semi-passive IRS-enabled sensing system, where an IRS consists of both passive reflecting elements and active sensors. Our goal is to minimize the Cramér-Rao bound (CRB) for parameter estimation under both po… ▽ More Intelligent reflecting surface (IRS) has garnered growing interest and attention due to its potential for facilitating and supporting wireless communications and sensing. This paper studies a semi-passive IRS-enabled sensing system, where an IRS consists of both passive reflecting elements and active sensors. Our goal is to minimize the Cramér-Rao bound (CRB) for parameter estimation under both point and extended target cases. Towards this goal, we begin by deriving the CRB for the direction-of-arrival (DoA) estimation in closed-form and then theoretically analyze the IRS reflecting elements and sensors allocation design based on the CRB under the point target case with a single-antenna base station (BS). To efficiently solve the corresponding optimization problem for the case with a multi-antenna BS, we propose an efficient algorithm by jointly optimizing the IRS phase shifts and the BS beamformers. Under the extended target case, the CRB for the target response matrix (TRM) estimation is minimized via the optimization of the BS transmit beamformers. Moreover, we explore the influence of various system parameters on the CRB and compare these effects to those observed under the point target case. Simulation results show the effectiveness of the semi-passive IRS and our proposed beamforming design for improving the performance of the sensing system. △ Less

Submitted 5 February, 2024; originally announced February 2024.

arXiv:2311.12745 [pdf, other]

Learn to Augment Network Simulators Towards Digital Network Twins

Authors: Yuru Zhang, Ming Zhao, Qiang Liu

Abstract: Digital network twin (DNT) is a promising paradigm to replicate real-world cellular networks toward continual assessment, proactive management, and what-if analysis. Existing discussions have been focusing on using only deep learning techniques to build DNTs, which raises widespread concerns regarding their generalization, explainability, and transparency. In this paper, we explore an alternative… ▽ More Digital network twin (DNT) is a promising paradigm to replicate real-world cellular networks toward continual assessment, proactive management, and what-if analysis. Existing discussions have been focusing on using only deep learning techniques to build DNTs, which raises widespread concerns regarding their generalization, explainability, and transparency. In this paper, we explore an alternative approach to augment network simulators with context-aware neural agents. The main challenge lies in the non-trivial simulation-to-reality (sim-to-real) discrepancy between offline simulators and real-world networks. To solve the challenge, we propose a new learn-to-bridge algorithm to cost-efficiently bridge the sim-to-real discrepancy in two alternative stages. In the first stage, we select states to query performances in real-world networks by using newly-designed cost-aware Bayesian optimization. In the second stage, we train the neural agent to learn the state context and bridge the probabilistic discrepancy based on Bayesian neural networks (BNN). In addition, we build a small-scale end-to-end network testbed based on OpenAirInterface RAN and Core with USRP B210 and a smartphone, and replicate the network in NS-3. The evaluation results show that, our proposed solution substantially outperforms existing methods, with more than 92\% reduction in the sim-to-real discrepancy. △ Less

Submitted 21 November, 2023; originally announced November 2023.

arXiv:2311.08201 [pdf, other]

Joint Location Sensing and Channel Estimation for IRS-Aided mmWave ISAC Systems

Authors: Zijian Chen, Ming-Min Zhao, Min Li, Fan Xu, Qingqing Wu, Min-Jian Zhao

Abstract: In this paper, we investigate a self-sensing intelligent reflecting surface (IRS) aided millimeter wave (mmWave) integrated sensing and communication (ISAC) system. Unlike the conventional purely passive IRS, the self-sensing IRS can effectively reduce the path loss of sensing-related links, thus rendering it advantageous in ISAC systems. Aiming to jointly sense the target/scatterer/user positions… ▽ More In this paper, we investigate a self-sensing intelligent reflecting surface (IRS) aided millimeter wave (mmWave) integrated sensing and communication (ISAC) system. Unlike the conventional purely passive IRS, the self-sensing IRS can effectively reduce the path loss of sensing-related links, thus rendering it advantageous in ISAC systems. Aiming to jointly sense the target/scatterer/user positions as well as estimate the sensing and communication (SAC) channels in the considered system, we propose a two-phase transmission scheme, where the coarse and refined sensing/channel estimation (CE) results are respectively obtained in the first phase (using scanning-based IRS reflection coefficients) and second phase (using optimized IRS reflection coefficients). For each phase, an angle-based sensing turbo variational Bayesian inference (AS-TVBI) algorithm, which combines the VBI, messaging passing and expectation-maximization (EM) methods, is developed to solve the considered joint location sensing and CE problem. The proposed algorithm effectively exploits the partial overlapping structured (POS) sparsity and 2-dimensional (2D) block sparsity inherent in the SAC channels to enhance the overall performance. Based on the estimation results from the first phase, we formulate a Cramér-Rao bound (CRB) minimization problem for optimizing IRS reflection coefficients, and through proper reformulations, a low-complexity manifold-based optimization algorithm is proposed to solve this problem. Simulation results are provided to verify the superiority of the proposed transmission scheme and associated algorithms. △ Less

Submitted 14 November, 2023; originally announced November 2023.

arXiv:2311.08188 [pdf, ps, other]

Fast List Decoding of High-Rate Polar Codes

Authors: Yang Lu, Ming-Min Zhao, Ming Lei, Min-Jian Zhao

Abstract: Due to the ability to provide superior error-correction performance, the successive cancellation list (SCL) algorithm is widely regarded as one of the most promising decoding algorithms for polar codes with short-to-moderate code lengths. However, the application of SCL decoding in low-latency communication scenarios is limited due to its sequential nature. To reduce the decoding latency, developi… ▽ More Due to the ability to provide superior error-correction performance, the successive cancellation list (SCL) algorithm is widely regarded as one of the most promising decoding algorithms for polar codes with short-to-moderate code lengths. However, the application of SCL decoding in low-latency communication scenarios is limited due to its sequential nature. To reduce the decoding latency, developing tailored fast and efficient list decoding algorithms of specific polar substituent codes (special nodes) is a promising solution. Recently, fast list decoding algorithms are proposed by considering special nodes with low code rates. Aiming to further speedup the SCL decoding, this paper presents fast list decoding algorithms for two types of high-rate special nodes, namely single-parity-check (SPC) nodes and sequence rate one or single-parity-check (SR1/SPC) nodes. In particular, we develop two classes of fast list decoding algorithms for these nodes, where the first class uses a sequential decoding procedure to yield decoding latency that is linear with the list size, and the second further parallelizes the decoding process by pre-determining the redundant candidate paths offline. Simulation results show that the proposed list decoding algorithms are able to achieve up to 70.7\% lower decoding latency than state-of-the-art fast SCL decoders, while exhibiting the same error-correction performance. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: 13 pages, 8 figures

arXiv:2310.13267 [pdf, other]

On the Language Encoder of Contrastive Cross-modal Models

Authors: Mengjie Zhao, Junya Ono, Zhi Zhong, Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Wei-Hsiang Liao, Takashi Shibuya, Hiromi Wakaki, Yuki Mitsufuji

Abstract: Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding… ▽ More Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder, which is the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding training affect language encoder quality and cross-modal task performance. In VL pretraining, we found that sentence embedding training language encoder quality and aids in cross-modal tasks, improving contrastive VL models such as CyCLIP. In contrast, AL pretraining benefits less from sentence embedding training, which may result from the limited amount of pretraining data. We analyze the representation spaces to understand the strengths of sentence embedding training, and find that it improves text-space uniformity, at the cost of decreased cross-modal alignment. △ Less

Submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.05382 [pdf, other]

A Stochastic Particle Variational Bayesian Inference Inspired Deep-Unfolding Network for Non-Convex Parameter Estimation

Authors: Zhixiang Hu, An Liu, Minjian Zhao

Abstract: Future wireless networks are envisioned to provide ubiquitous sensing services, which also gives rise to a substantial demand for high-dimensional non-convex parameter estimation, i.e., the associated likelihood function is non-convex and contains numerous local optima. Variational Bayesian inference (VBI) provides a powerful tool for modeling complex estimation problems and reasoning with prior i… ▽ More Future wireless networks are envisioned to provide ubiquitous sensing services, which also gives rise to a substantial demand for high-dimensional non-convex parameter estimation, i.e., the associated likelihood function is non-convex and contains numerous local optima. Variational Bayesian inference (VBI) provides a powerful tool for modeling complex estimation problems and reasoning with prior information, but poses a long-standing challenge on computing intractable posteriori distributions. Most existing variational methods generally rely on assumptions about specific distribution families to derive closed-form solutions, and are difficult to apply in high-dimensional, non-convex scenarios. Given these challenges, firstly, we propose a parallel stochastic particle variational Bayesian inference (PSPVBI) algorithm. Thanks to innovations such as particle approximation, additional updates of particle positions, and parallel stochastic successive convex approximation (PSSCA), PSPVBI can flexibly drive particles to fit the posteriori distribution with acceptable complexity, yielding high-precision estimates of the target parameters. Furthermore, additional speedup can be obtained by deep-unfolding (DU) the PSPVBI algorithm. Specifically, superior hyperparameters are learned to dramatically reduce the number of algorithmic iterations. In this PSPVBI-induced Deep-Unfolding Networks, some techniques related to gradient computation, data sub-sampling, differentiable sampling, and generalization ability are also employed to facilitate the practical deployment. Finally, we apply the LPSPVBI to solve several important parameter estimation problems in wireless sensing scenarios. Simulations indicate that the LPSPVBI algorithm outperforms existing solutions. △ Less

Submitted 8 October, 2023; originally announced October 2023.

arXiv:2309.04508 [pdf, other]

Spatial-Temporal Graph Attention Fuser for Calibration in IoT Air Pollution Monitoring Systems

Authors: Keivan Faghih Niresi, Mengjie Zhao, Hugo Bissig, Henri Baumann, Olga Fink

Abstract: The use of Internet of Things (IoT) sensors for air pollution monitoring has significantly increased, resulting in the deployment of low-cost sensors. Despite this advancement, accurately calibrating these sensors in uncontrolled environmental conditions remains a challenge. To address this, we propose a novel approach that leverages graph neural networks, specifically the graph attention network… ▽ More The use of Internet of Things (IoT) sensors for air pollution monitoring has significantly increased, resulting in the deployment of low-cost sensors. Despite this advancement, accurately calibrating these sensors in uncontrolled environmental conditions remains a challenge. To address this, we propose a novel approach that leverages graph neural networks, specifically the graph attention network module, to enhance the calibration process by fusing data from sensor arrays. Through our experiments, we demonstrate the effectiveness of our approach in significantly improving the calibration accuracy of sensors in IoT air pollution monitoring platforms. △ Less

Submitted 8 September, 2023; originally announced September 2023.

arXiv:2309.03114 [pdf, ps, other]

NUV-DoA: NUV Prior-based Bayesian Sparse Reconstruction with Spatial Filtering for Super-Resolution DoA Estimation

Authors: Mengyuan Zhao, Guy Revach, Tirza Routtenberg, Nir Shlezinger

Abstract: Achieving high-resolution Direction of Arrival (DoA) recovery typically requires high Signal to Noise Ratio (SNR) and a sufficiently large number of snapshots. This paper presents NUV-DoA algorithm, that augments Bayesian sparse reconstruction with spatial filtering for super-resolution DoA estimation. By modeling each direction on the azimuth's grid with the sparsity-promoting normal with unknown… ▽ More Achieving high-resolution Direction of Arrival (DoA) recovery typically requires high Signal to Noise Ratio (SNR) and a sufficiently large number of snapshots. This paper presents NUV-DoA algorithm, that augments Bayesian sparse reconstruction with spatial filtering for super-resolution DoA estimation. By modeling each direction on the azimuth's grid with the sparsity-promoting normal with unknown variance (NUV) prior, the non-convex optimization problem is reduced to iteratively reweighted least-squares under Gaussian distribution, where the mean of the snapshots is a sufficient statistic. This approach not only simplifies our solution but also accurately detects the DoAs. We utilize a hierarchical approach for interference cancellation in multi-source scenarios. Empirical evaluations show the superiority of NUV-DoA, especially in low SNRs, compared to alternative DoA estimators. △ Less

Submitted 25 December, 2023; v1 submitted 6 September, 2023; originally announced September 2023.

Comments: This paper has 5 pages including reference, 11 figures. This paper has been accepted to ICASSP 2024 - 2024 International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

arXiv:2308.13996 [pdf]

Improve in-situ life prediction and classification performance by capturing both the present state and evolution rate of battery aging

Authors: Mingyuan Zhao, Yongzhi Zhang

Abstract: This study develops a methodology by capturing both the battery aging state and degradation rate for improved life prediction performance. The aging state is indicated by six physical features of an equivalent circuit model that are extracted from the voltage relaxation data. And the degradation rate is captured by two features extracted from the differences between the voltage relaxation curves w… ▽ More This study develops a methodology by capturing both the battery aging state and degradation rate for improved life prediction performance. The aging state is indicated by six physical features of an equivalent circuit model that are extracted from the voltage relaxation data. And the degradation rate is captured by two features extracted from the differences between the voltage relaxation curves within a moving window (for life prediction), or the differences between the capacity vs. voltage curves at different cycles (for life classification). Two machine learning models, which are constructed based on Gaussian Processes, are used to describe the relationships between these physical features and battery lifetimes for the life prediction and classification, respectively. The methodology is validated with the aging data of 74 battery cells of three different types. Experimental results show that based on only 3-12 minutes' sampling data, the method with novel features predicts accurate battery lifetimes, with the prediction accuracy improved by up to 67.09% compared with the benchmark method. And the batteries are classified into three groups (long, medium, and short) with an overall accuracy larger than 90% based on only two adjacent cycles' information, enabling the highly efficient regrouping of retired batteries. △ Less

Submitted 26 August, 2023; originally announced August 2023.

Showing 1–50 of 155 results for author: Zhao, M