-
Networked pointing system: Bearing-only target localization and pointing control
Authors:
Shiyao Li,
Bo Zhu,
Yining Zhou,
Jie Ma,
Baoqing Yang,
Fenghua He
Abstract:
In the paper, we formulate the target-pointing consensus problem where the headings of agents are required to point at a common target. Only a few agents in the network can measure the bearing information of the target. A two-step solution consisting of a bearing-only estimator for target localization and a control law for target pointing is constructed to address this problem. Compared to the str…
▽ More
In the paper, we formulate the target-pointing consensus problem where the headings of agents are required to point at a common target. Only a few agents in the network can measure the bearing information of the target. A two-step solution consisting of a bearing-only estimator for target localization and a control law for target pointing is constructed to address this problem. Compared to the strong assumptions of existing works, we only require two agents not collinear with the target to ensure localizability. By introducing the concept of virtual fusion node, we prove that both the estimation error and the tracking error converge asymptotically to the origin. The video demonstration of the verification can be found at https://youtu.be/S9- eyofk1DY.
△ Less
Submitted 23 June, 2025;
originally announced June 2025.
-
Boosting Adversarial Transferability for Hyperspectral Image Classification Using 3D Structure-invariant Transformation and Intermediate Feature Distance
Authors:
Chun Liu,
Bingqian Zhu,
Tao Xu,
Zheng Zheng,
Zheng Li,
Wei Yang,
Zhigang Han,
Jiayao Wang
Abstract:
Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose security challenges to hyperspectral image (HSI) classification technologies based on DNNs. In the domain of natural images, numerous transfer-based adversarial attack methods have been studied. However, HSIs differ from natural images due to their high-dimensional and rich spectral information. Current research on HSI a…
▽ More
Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose security challenges to hyperspectral image (HSI) classification technologies based on DNNs. In the domain of natural images, numerous transfer-based adversarial attack methods have been studied. However, HSIs differ from natural images due to their high-dimensional and rich spectral information. Current research on HSI adversarial examples remains limited and faces challenges in fully utilizing the structural and feature information of images. To address these issues, this paper proposes a novel method to enhance the transferability of the adversarial examples for HSI classification models. First, while keeping the image structure unchanged, the proposed method randomly divides the image into blocks in both spatial and spectral dimensions. Then, various transformations are applied on a block by block basis to increase input diversity and mitigate overfitting. Second, a feature distancing loss targeting intermediate layers is designed, which measures the distance between the amplified features of the original examples and the features of the adversarial examples as the primary loss, while the output layer prediction serves as the auxiliary loss. This guides the perturbation to disrupt the features of the true class in adversarial examples, effectively enhancing transferability. Extensive experiments demonstrate that the adversarial examples generated by the proposed method achieve effective transferability to black-box models on two public HSI datasets. Furthermore, the method maintains robust attack performance even under defense strategies.
△ Less
Submitted 12 June, 2025;
originally announced June 2025.
-
LymphAtlas- A Unified Multimodal Lymphoma Imaging Repository Delivering AI-Enhanced Diagnostic Insight
Authors:
Jiajun Ding,
Beiyao Zhu,
Xiaosheng Liu,
Lishen Zhang,
Zhao Liu
Abstract:
This study integrates PET metabolic information with CT anatomical structures to establish a 3D multimodal segmentation dataset for lymphoma based on whole-body FDG PET/CT examinations, which bridges the gap of the lack of standardised multimodal segmentation datasets in the field of haematological malignancies. We retrospectively collected 483 examination datasets acquired between March 2011 and…
▽ More
This study integrates PET metabolic information with CT anatomical structures to establish a 3D multimodal segmentation dataset for lymphoma based on whole-body FDG PET/CT examinations, which bridges the gap of the lack of standardised multimodal segmentation datasets in the field of haematological malignancies. We retrospectively collected 483 examination datasets acquired between March 2011 and May 2024, involving 220 patients (106 non-Hodgkin lymphoma, 42 Hodgkin lymphoma); all data underwent ethical review and were rigorously de-identified. Complete 3D structural information was preserved during data acquisition, preprocessing and annotation, and a high-quality dataset was constructed based on the nnUNet format. By systematic technical validation and evaluation of the preprocessing process, annotation quality and automatic segmentation algorithm, the deep learning model trained based on this dataset is verified to achieve accurate segmentation of lymphoma lesions in PET/CT images with high accuracy, good robustness and reproducibility, which proves the applicability and stability of this dataset in accurate segmentation and quantitative analysis. The deep fusion of PET/CT images achieved with this dataset not only significantly improves the accurate portrayal of the morphology, location and metabolic features of tumour lesions, but also provides solid data support for early diagnosis, clinical staging and personalized treatment, and promotes the development of automated image segmentation and precision medicine based on deep learning. The dataset and related resources are available at https://github.com/SuperD0122/LymphAtlas-.
△ Less
Submitted 29 April, 2025;
originally announced April 2025.
-
PID-GM: PID Control with Gain Mapping
Authors:
Bo Zhu,
Wei Yu,
Hugh H. T. Liu
Abstract:
Proportional-Integral-Differential (PID) control is widely used in industrial control systems. However, up to now there are at least two open problems related with PID control. One is to have a comprehensive understanding of its robustness with respect to model uncertainties and disturbances. The other is to build intuitive, explicit and mathematically provable guidelines for PID gain tuning. In t…
▽ More
Proportional-Integral-Differential (PID) control is widely used in industrial control systems. However, up to now there are at least two open problems related with PID control. One is to have a comprehensive understanding of its robustness with respect to model uncertainties and disturbances. The other is to build intuitive, explicit and mathematically provable guidelines for PID gain tuning. In this paper, we introduce a simple nonlinear mapping to determine PID gains from three auxiliary parameters. By the mapping, PID control is shown to be equivalent to a new PD control (serving as a nominal control) plus an uncertainty and disturbance compensator (to recover the nominal performance). Then PID control can be understood, designed and tuned in a Two-Degree-of-Freedom (2-DoF) control framework. We discuss some basic properties of the mapping, including the existence, uniqueness and invertibility. Taking as an example the PID control applied to a general uncertain second-order plant, we prove by the singular perturbation theory that the closed-loop steady-state and transient performance depends explicitly on one auxiliary parameter which can be viewed as the virtual singular perturbation parameter (SPP) of PID control. All the three PID gains are monotonically decreasing functions of the SPP, indicating that the smaller the SPP is, the higher the PID gains are, and the better the robustness of PID control is. Simulation and experimental examples are provided to demonstrate the properties of the mapping as well as the effectiveness of the mapping based PID gain turning.
△ Less
Submitted 21 April, 2025;
originally announced April 2025.
-
Bridging the Gap between Continuous and Informative Discrete Representations by Random Product Quantization
Authors:
Xueqing Li,
Zehan Li,
Boyu Zhu,
Ruihao Jing,
Jian Kang,
Jie Li,
Xiao-Lei Zhang,
Xuelong Li
Abstract:
Self-supervised learning has become a core technique in speech processing, but the high dimensionality of its representations makes discretization essential for improving efficiency. However, existing discretization methods still suffer from significant information loss, resulting in a notable performance gap compared to continuous representations. To overcome these limitations, we propose two qua…
▽ More
Self-supervised learning has become a core technique in speech processing, but the high dimensionality of its representations makes discretization essential for improving efficiency. However, existing discretization methods still suffer from significant information loss, resulting in a notable performance gap compared to continuous representations. To overcome these limitations, we propose two quantization-based discretization methods: Product Quantization (PQ) and Random Product Quantization (RPQ). PQ partitions the original feature space into multiple subspaces and independently quantizes each sub-vector, producing a fused set of discrete units that retain diverse information from different subspaces, thus mitigating the loss associated with single-cluster quantization. RPQ further enhances representation diversity by randomly sampling a fixed proportion of feature dimensions multiple times to construct sub-vectors, thereby better capturing the variability in the data distribution. Theoretical analysis shows that RPQ reduces the correlation coefficient rho (where 0 <= rho <= 1) between sub-quantizers. Its quantization error is lower-bounded by the product of rho and epsilon-kms, where epsilon-kms denotes the quantization error of a single K-means quantizer. Experimental results on a combined dataset built from LibriSpeech and ML-SUPERB show that PQ and RPQ outperform standard K-means discretization, achieving relative improvements of 21.8 percent and 20.0 percent in WER on LibriSpeech, and 24.1 percent and 19.6 percent in CER on ML-SUPERB, respectively. Moreover, their performance is competitive with, and in some cases even surpasses, that of continuous SSL representations.
△ Less
Submitted 7 April, 2025;
originally announced April 2025.
-
AudioSpa: Spatializing Sound Events with Text
Authors:
Linfeng Feng,
Lei Zhao,
Boyu Zhu,
Xiao-Lei Zhang,
Xuelong Li
Abstract:
Text-to-audio (TTA) systems have recently demonstrated strong performance in synthesizing monaural audio from text. However, the task of generating binaural spatial audio from text, which provides a more immersive auditory experience by incorporating the sense of spatiality, have not been explored yet. In this work, we introduce text-guided binaural audio generation. As an early effort, we focus o…
▽ More
Text-to-audio (TTA) systems have recently demonstrated strong performance in synthesizing monaural audio from text. However, the task of generating binaural spatial audio from text, which provides a more immersive auditory experience by incorporating the sense of spatiality, have not been explored yet. In this work, we introduce text-guided binaural audio generation. As an early effort, we focus on the scenario where a monaural reference audio is given additionally. The core problem is to associate specific sound events with their directions, thereby creating binaural spatial audio. The challenge lies in the complexity of textual descriptions and the limited availability of single-source sound event datasets. To address this, we propose AudioSpa, an end-to-end model that applies large language models to process both acoustic and textual information. We employ fusion multi-head attention (FMHA) to integrate text tokens, which enhances the generation capability of the multimodal learning. Additionally, we propose a binaural source localization model to assess the quality of the generated audio. Finally, we design a data augmentation strategy to generate diverse datasets, which enables the model to spatialize sound events across various spatial positions. Experimental results demonstrate that our model is able to put sounds at the specified locations accurately. It achieves competitive performance in both localization accuracy and signal distortion. Our demonstrations are available at https://linfeng-feng.github.io/AudioSpa-demo.
△ Less
Submitted 16 February, 2025;
originally announced February 2025.
-
Line Spectral Analysis Using the G-Filter: An Atomic Norm Minimization Approach
Authors:
Bin Zhu
Abstract:
The area of spectral analysis has a traditional dichotomy between continuous spectra (spectral densities) which correspond to purely nondeterministic processes, and line spectra (Dirac impulses) which represent sinusoids. While the former case is important in the identification of discrete-time linear stochastic systems, the latter case is essential for the analysis and modeling of time series wit…
▽ More
The area of spectral analysis has a traditional dichotomy between continuous spectra (spectral densities) which correspond to purely nondeterministic processes, and line spectra (Dirac impulses) which represent sinusoids. While the former case is important in the identification of discrete-time linear stochastic systems, the latter case is essential for the analysis and modeling of time series with notable applications in radar systems. In this paper, we develop a novel approach for line spectral estimation which combines ideas of Georgiou's filter banks (G-filters) and atomic norm minimization (ANM), a mainstream method for line spectral analysis in the last decade following the theory of compressed sensing. Such a combination is only possible because a Carathéodory--Fejér-type decomposition is available for the covariance matrix of the filter output. The ANM problem can be characterized via semidefinite programming which can be solved efficiently. As a consequence, our optimization theory can be seen as a substantial generalization of the standard ANM for line spectral estimation. Moreover, our ANM approach with a G-filter has significant advantages over subspace methods because it can work with just one output vector and without \emph{a priori} knowledge about the number of sinusoids in the input. Simulation results show that our approach performs reasonably well under different signal-to-noise ratios when the G-filter is suitably designed.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
When atomic norm meets the G-filter: A general framework for line spectral estimation
Authors:
Bin Zhu,
Jiale Tang
Abstract:
This paper proposes a novel approach for line spectral estimation which combines Georgiou's filter bank (G-filter) with atomic norm minimization (ANM). A key ingredient is a Carathéodory--Fejér-type decomposition for the covariance matrix of the filter output. The resulting optimization problem can be characterized via semidefinite programming and contains the standard ANM for line spectral estima…
▽ More
This paper proposes a novel approach for line spectral estimation which combines Georgiou's filter bank (G-filter) with atomic norm minimization (ANM). A key ingredient is a Carathéodory--Fejér-type decomposition for the covariance matrix of the filter output. The resulting optimization problem can be characterized via semidefinite programming and contains the standard ANM for line spectral estimation as a special case. Simulations show that our approach outperforms the standard ANM in terms of recovering the number of spectral lines when the signal-to-noise ratio is no lower than 0 dB and the G-filter is suitably designed.
△ Less
Submitted 16 October, 2024;
originally announced October 2024.
-
OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model
Authors:
Liuhan Chen,
Zongjian Li,
Bin Lin,
Bin Zhu,
Qian Wang,
Shenghai Yuan,
Xing Zhou,
Xinhua Cheng,
Li Yuan
Abstract:
Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are. However, most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ign…
▽ More
Variational Autoencoder (VAE), compressing videos into latent representations, is a crucial preceding component of Latent Video Diffusion Models (LVDMs). With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are. However, most LVDMs utilize 2D image VAE, whose compression for videos is only in the spatial dimension and often ignored in the temporal dimension. How to conduct temporal compression for videos in a VAE to obtain more concise latent representations while promising accurate reconstruction is seldom explored. To fill this gap, we propose an omni-dimension compression VAE, named OD-VAE, which can temporally and spatially compress videos. Although OD-VAE's more sufficient compression brings a great challenge to video reconstruction, it can still achieve high reconstructed accuracy by our fine design. To obtain a better trade-off between video reconstruction quality and compression speed, four variants of OD-VAE are introduced and analyzed. In addition, a novel tail initialization is designed to train OD-VAE more efficiently, and a novel inference strategy is proposed to enable OD-VAE to handle videos of arbitrary length with limited GPU memory. Comprehensive experiments on video reconstruction and LVDM-based video generation demonstrate the effectiveness and efficiency of our proposed methods.
△ Less
Submitted 9 September, 2024; v1 submitted 2 September, 2024;
originally announced September 2024.
-
Hypergraph-Aided Task-Resource Matching for Maximizing Value of Task Completion in Collaborative IoT Systems
Authors:
Botao Zhu,
Xianbin Wang
Abstract:
With the growing scale and intrinsic heterogeneity of Internet of Things (IoT) systems, distributed device collaboration becomes essential for effective task completion by dynamically utilizing limited communication and computing resources. However, the separated design and situation-agnostic operation of computing, communication and application layers create a fundamental challenge for rapid task…
▽ More
With the growing scale and intrinsic heterogeneity of Internet of Things (IoT) systems, distributed device collaboration becomes essential for effective task completion by dynamically utilizing limited communication and computing resources. However, the separated design and situation-agnostic operation of computing, communication and application layers create a fundamental challenge for rapid task-resource matching, which further deteriorate the overall task completion effectiveness. To overcome this challenge, we utilize hypergraph as a new tool to vertically unify computing, communication, and task aspects of IoT systems for an effective matching by accurately capturing the relationships between tasks and communication and computing resources. Specifically, a state-of-the-art task-resource matching hypergraph (TRM-hypergraph) model is proposed in this paper, which is used to effectively transform the process of allocating complex heterogeneous resources to convoluted tasks into a hypergraph matching problem. Taking into account computational complexity and storage, a game-theoretic hypergraph matching algorithm is proposed via considering the hypergraph matching problem as a non-cooperative multi-player clustering game. Numerical results demonstrate that the proposed TRM-hypergraph model achieves superior performance in matching of tasks and resources compared with comparison algorithms.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Improved Soft-k-Means Clustering Algorithm for Balancing Energy Consumption in Wireless Sensor Networks
Authors:
Botao Zhu,
Ebrahim Bedeer,
Ha H. Nguyen,
Robert Barton,
Jerome Henry
Abstract:
Energy load balancing is an essential issue in designing wireless sensor networks (WSNs). Clustering techniques are utilized as energy-efficient methods to balance the network energy and prolong its lifetime. In this paper, we propose an improved soft-k-means (IS-k-means) clustering algorithm to balance the energy consumption of nodes in WSNs. First, we use the idea of ``clustering by fast search…
▽ More
Energy load balancing is an essential issue in designing wireless sensor networks (WSNs). Clustering techniques are utilized as energy-efficient methods to balance the network energy and prolong its lifetime. In this paper, we propose an improved soft-k-means (IS-k-means) clustering algorithm to balance the energy consumption of nodes in WSNs. First, we use the idea of ``clustering by fast search and find of density peaks'' (CFSFDP) and kernel density estimation (KDE) to improve the selection of the initial cluster centers of the soft k-means clustering algorithm. Then, we utilize the flexibility of the soft-k-means and reassign member nodes considering their membership probabilities at the boundary of clusters to balance the number of nodes per cluster. Furthermore, the concept of multi-cluster heads is employed to balance the energy consumption within clusters. {Extensive simulation results under different network scenarios demonstrate that for small-scale WSNs with single-hop transmission}, the proposed algorithm can postpone the first node death, the half of nodes death, and the last node death on average when compared to various clustering algorithms from the literature.
△ Less
Submitted 22 March, 2024;
originally announced March 2024.
-
LiDAR Point Cloud-based Multiple Vehicle Tracking with Probabilistic Measurement-Region Association
Authors:
Guanhua Ding,
Jianan Liu,
Yuxuan Xia,
Tao Huang,
Bing Zhu,
Jinping Sun
Abstract:
Multiple extended target tracking (ETT) has gained increasing attention due to the development of high-precision LiDAR and radar sensors in automotive applications. For LiDAR point cloud-based vehicle tracking, this paper presents a probabilistic measurement-region association (PMRA) ETT model, which can describe the complex measurement distribution by partitioning the target extent into different…
▽ More
Multiple extended target tracking (ETT) has gained increasing attention due to the development of high-precision LiDAR and radar sensors in automotive applications. For LiDAR point cloud-based vehicle tracking, this paper presents a probabilistic measurement-region association (PMRA) ETT model, which can describe the complex measurement distribution by partitioning the target extent into different regions. The PMRA model overcomes the drawbacks of previous data-region association (DRA) models by eliminating the approximation error of constrained estimation and using continuous integrals to more reliably calculate the association probabilities. Furthermore, the PMRA model is integrated with the Poisson multi-Bernoulli mixture (PMBM) filter for tracking multiple vehicles. Simulation results illustrate the superior estimation accuracy of the proposed PMRA-PMBM filter in terms of both positions and extents of the vehicles comparing with PMBM filters using the gamma Gaussian inverse Wishart and DRA implementations.
△ Less
Submitted 18 May, 2024; v1 submitted 11 March, 2024;
originally announced March 2024.
-
ByteComposer: a Human-like Melody Composition Method based on Language Model Agent
Authors:
Xia Liang,
Xingjian Du,
Jiaju Lin,
Pei Zou,
Yuan Wan,
Bilei Zhu
Abstract:
Large Language Models (LLM) have shown encouraging progress in multimodal understanding and generation tasks. However, how to design a human-aligned and interpretable melody composition system is still under-explored. To solve this problem, we propose ByteComposer, an agent framework emulating a human's creative pipeline in four separate steps : "Conception Analysis - Draft Composition - Self-Eval…
▽ More
Large Language Models (LLM) have shown encouraging progress in multimodal understanding and generation tasks. However, how to design a human-aligned and interpretable melody composition system is still under-explored. To solve this problem, we propose ByteComposer, an agent framework emulating a human's creative pipeline in four separate steps : "Conception Analysis - Draft Composition - Self-Evaluation and Modification - Aesthetic Selection". This framework seamlessly blends the interactive and knowledge-understanding features of LLMs with existing symbolic music generation models, thereby achieving a melody composition agent comparable to human creators. We conduct extensive experiments on GPT4 and several open-source large language models, which substantiate our framework's effectiveness. Furthermore, professional music composers were engaged in multi-dimensional evaluations, the final results demonstrated that across various facets of music composition, ByteComposer agent attains the level of a novice melody composer.
△ Less
Submitted 6 March, 2024; v1 submitted 23 February, 2024;
originally announced February 2024.
-
MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
Authors:
Hang Zhao,
Yifei Xin,
Zhesong Yu,
Bilei Zhu,
Lu Lu,
Zejun Ma
Abstract:
In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in developing generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instructio…
▽ More
In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in developing generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instruction tuning. MINT leverages the strength of frozen pre-trained audio encoders and large language models (LLM) to improve audio-language pre-training, enabling effective transferablility to both audio-text understanding and generation tasks. To address the modality gap, we introduce Bridge-Net, a trainable module that enhances cross-modality alignment and the model's ability to follow instructions for a variety of audio-text tasks. Bridge-Net is pivotal within MINT, initially enhancing audio-language representation learning through a multi-target pre-training approach. Subsequently, Bridge-Net further boosts audio-to-language generative learning by integrating a frozen language model with instruction tuning. This integration empowers MINT to extract features in a flexible and effective manner, specifically tailored to the provided instructions for diverse tasks. Experimental results demonstrate that MINT attains superior performance across various audio-language understanding and generation tasks, highlighting its robust generalization capabilities even in zero-shot scenarios.
△ Less
Submitted 11 June, 2024; v1 submitted 12 February, 2024;
originally announced February 2024.
-
Enhancing Epileptic Seizure Detection with EEG Feature Embeddings
Authors:
Arman Zarei,
Bingzhao Zhu,
Mahsa Shoaran
Abstract:
Epilepsy is one of the most prevalent brain disorders that disrupts the lives of millions worldwide. For patients with drug-resistant seizures, there exist implantable devices capable of monitoring neural activity, promptly triggering neurostimulation to regulate seizures, or alerting patients of potential episodes. Next-generation seizure detection systems heavily rely on high-accuracy machine le…
▽ More
Epilepsy is one of the most prevalent brain disorders that disrupts the lives of millions worldwide. For patients with drug-resistant seizures, there exist implantable devices capable of monitoring neural activity, promptly triggering neurostimulation to regulate seizures, or alerting patients of potential episodes. Next-generation seizure detection systems heavily rely on high-accuracy machine learning-based classifiers to detect the seizure onset. Here, we propose to enhance the seizure detection performance by learning informative embeddings of the EEG signal. We empirically demonstrate, for the first time, that converting raw EEG signals to appropriate embeddings can significantly boost the performance of seizure detection algorithms. Importantly, we show that embedding features, which converts the raw EEG into an alternative representation, is beneficial for various machine learning models such as Logistic Regression, Multi-Layer Perceptron, Support Vector Machines, and Gradient Boosted Trees. The experiments were conducted on the CHB-MIT scalp EEG dataset. With the proposed EEG feature embeddings, we achieve significant improvements in sensitivity, specificity, and AUC score across multiple models. By employing this approach alongside an SVM classifier, we were able to attain state-of-the-art classification performance with a sensitivity of 100% and specificity of 99%, setting a new benchmark in the field.
△ Less
Submitted 28 October, 2023;
originally announced October 2023.
-
Joint Music and Language Attention Models for Zero-shot Music Tagging
Authors:
Xingjian Du,
Zhesong Yu,
Jiaju Lin,
Bilei Zhu,
Qiuqiang Kong
Abstract:
Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio…
▽ More
Music tagging is a task to predict the tags of music recordings. However, previous music tagging research primarily focuses on close-set music tagging tasks which can not be generalized to new tags. In this work, we propose a zero-shot music tagging system modeled by a joint music and language attention (JMLA) model to address the open-set music tagging problem. The JMLA model consists of an audio encoder modeled by a pretrained masked autoencoder and a decoder modeled by a Falcon7B. We introduce preceiver resampler to convert arbitrary length audio into fixed length embeddings. We introduce dense attention connections between encoder and decoder layers to improve the information flow between the encoder and decoder layers. We collect a large-scale music and description dataset from the internet. We propose to use ChatGPT to convert the raw descriptions into formalized and diverse descriptions to train the JMLA models. Our proposed JMLA system achieves a zero-shot audio tagging accuracy of $ 64.82\% $ on the GTZAN dataset, outperforming previous zero-shot systems and achieves comparable results to previous systems on the FMA and the MagnaTagATune datasets.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Which Framework is Suitable for Online 3D Multi-Object Tracking for Autonomous Driving with Automotive 4D Imaging Radar?
Authors:
Jianan Liu,
Guanhua Ding,
Yuxuan Xia,
Jinping Sun,
Tao Huang,
Lihua Xie,
Bing Zhu
Abstract:
Online 3D multi-object tracking (MOT) has recently received significant research interests due to the expanding demand of 3D perception in advanced driver assistance systems (ADAS) and autonomous driving (AD). Among the existing 3D MOT frameworks for ADAS and AD, conventional point object tracking (POT) framework using the tracking-by-detection (TBD) strategy has been well studied and accepted for…
▽ More
Online 3D multi-object tracking (MOT) has recently received significant research interests due to the expanding demand of 3D perception in advanced driver assistance systems (ADAS) and autonomous driving (AD). Among the existing 3D MOT frameworks for ADAS and AD, conventional point object tracking (POT) framework using the tracking-by-detection (TBD) strategy has been well studied and accepted for LiDAR and 4D imaging radar point clouds. In contrast, extended object tracking (EOT), another important framework which accepts the joint-detection-and-tracking (JDT) strategy, has rarely been explored for online 3D MOT applications. This paper provides the first systematic investigation of the EOT framework for online 3D MOT in real-world ADAS and AD scenarios. Specifically, the widely accepted TBD-POT framework, the recently investigated JDT-EOT framework, and our proposed TBD-EOT framework are compared via extensive evaluations on two open source 4D imaging radar datasets: View-of-Delft and TJ4DRadSet. Experiment results demonstrate that the conventional TBD-POT framework remains preferable for online 3D MOT with high tracking performance and low computational complexity, while the proposed TBD-EOT framework has the potential to outperform it in certain situations. However, the results also show that the JDT-EOT framework encounters multiple problems and performs inadequately in evaluation scenarios. After analyzing the causes of these phenomena based on various evaluation metrics and visualizations, we provide possible guidelines to improve the performance of these MOT frameworks on real-world data. These provide the first benchmark and important insights for the future development of 4D imaging radar-based online 3D MOT algorithms.
△ Less
Submitted 25 May, 2024; v1 submitted 12 September, 2023;
originally announced September 2023.
-
Motion Planning for Aerial Pick-and-Place based on Geometric Feasibility Constraints
Authors:
Huazi Cao,
Jiahao Shen,
Cunjia Liu,
Bo Zhu,
Shiyu Zhao
Abstract:
This paper studies the motion planning problem of the pick-and-place of an aerial manipulator that consists of a quadcopter flying base and a Delta arm. We propose a novel partially decoupled motion planning framework to solve this problem. Compared to the state-of-the-art approaches, the proposed one has two novel features. First, it does not suffer from increased computation in high-dimensional…
▽ More
This paper studies the motion planning problem of the pick-and-place of an aerial manipulator that consists of a quadcopter flying base and a Delta arm. We propose a novel partially decoupled motion planning framework to solve this problem. Compared to the state-of-the-art approaches, the proposed one has two novel features. First, it does not suffer from increased computation in high-dimensional configuration spaces. That is because it calculates the trajectories of the quadcopter base and the end-effector separately in the Cartesian space based on proposed geometric feasibility constraints. The geometric feasibility constraints can ensure the resulting trajectories satisfy the aerial manipulator's geometry. Second, collision avoidance for the Delta arm is achieved through an iterative approach based on a pinhole mapping method, so that the feasible trajectory can be found in an efficient manner. The proposed approach is verified by three experiments on a real aerial manipulation platform. The experimental results show the effectiveness of the proposed method for the aerial pick-and-place task.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
Fine-Tuning Language Models with Advantage-Induced Policy Alignment
Authors:
Banghua Zhu,
Hiteshi Sharma,
Felipe Vieira Frujeri,
Shi Dong,
Chenguang Zhu,
Michael I. Jordan,
Jiantao Jiao
Abstract:
Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be…
▽ More
Reinforcement learning from human feedback (RLHF) has emerged as a reliable approach to aligning large language models (LLMs) to human preferences. Among the plethora of RLHF techniques, proximal policy optimization (PPO) is of the most widely used methods. Despite its popularity, however, PPO may suffer from mode collapse, instability, and poor sample efficiency. We show that these issues can be alleviated by a novel algorithm that we refer to as Advantage-Induced Policy Alignment (APA), which leverages a squared error loss function based on the estimated advantages. We demonstrate empirically that APA consistently outperforms PPO in language tasks by a large margin, when a separate reward model is employed as the evaluator. In addition, compared with PPO, APA offers a more stable form of control over the deviation from the model's initial policy, ensuring that the model improves its performance without collapsing to deterministic output. In addition to empirical results, we also provide a theoretical justification supporting the design of our loss function.
△ Less
Submitted 2 November, 2023; v1 submitted 3 June, 2023;
originally announced June 2023.
-
On Optimal Caching and Model Multiplexing for Large Model Inference
Authors:
Banghua Zhu,
Ying Sheng,
Lianmin Zheng,
Clark Barrett,
Michael I. Jordan,
Jiantao Jiao
Abstract:
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to…
▽ More
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings. By combining a caching algorithm, namely Greedy Dual Size with Frequency (GDSF) or Least Expected Cost (LEC), with a model multiplexer, we achieve optimal rates in both offline and online settings. Empirically, simulations show that the combination of our caching and model multiplexing algorithms greatly improves over the baselines, with up to $50\times$ improvement over the baseline when the ratio between the maximum cost and minimum cost is $100$. Experiments on real datasets show a $4.3\times$ improvement in FLOPs over the baseline when the ratio for FLOPs is $10$, and a $1.8\times$ improvement in latency when the ratio for average latency is $1.85$.
△ Less
Submitted 28 August, 2023; v1 submitted 3 June, 2023;
originally announced June 2023.
-
Doubly Robust Self-Training
Authors:
Banghua Zhu,
Mingyu Ding,
Philip Jacobson,
Ming Wu,
Wei Zhan,
Michael Jordan,
Jiantao Jiao
Abstract:
Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provabl…
▽ More
Self-training is an important technique for solving semi-supervised learning problems. It leverages unlabeled data by generating pseudo-labels and combining them with a limited labeled dataset for training. The effectiveness of self-training heavily relies on the accuracy of these pseudo-labels. In this paper, we introduce doubly robust self-training, a novel semi-supervised algorithm that provably balances between two extremes. When the pseudo-labels are entirely incorrect, our method reduces to a training process solely using labeled data. Conversely, when the pseudo-labels are completely accurate, our method transforms into a training process utilizing all pseudo-labeled data and labeled data, thus increasing the effective sample size. Through empirical evaluations on both the ImageNet dataset for image classification and the nuScenes autonomous driving dataset for 3D object detection, we demonstrate the superiority of the doubly robust loss over the standard self-training baseline.
△ Less
Submitted 2 November, 2023; v1 submitted 31 May, 2023;
originally announced June 2023.
-
A Fast and Robust Camera-IMU Online Calibration Method For Localization System
Authors:
Xiaowen Tao,
Pengxiang Meng,
Bing Zhu,
Jian Zhao
Abstract:
Autonomous driving has spurred the development of sensor fusion techniques, which combine data from multiple sensors to improve system performance. In particular, localization system based on sensor fusion , such as Visual Simultaneous Localization and Mapping (VSLAM), is an important component in environment perception, and is the basis of decision-making and motion control for intelligent vehicl…
▽ More
Autonomous driving has spurred the development of sensor fusion techniques, which combine data from multiple sensors to improve system performance. In particular, localization system based on sensor fusion , such as Visual Simultaneous Localization and Mapping (VSLAM), is an important component in environment perception, and is the basis of decision-making and motion control for intelligent vehicles. The accuracy of extrinsic calibration parameters between camera and IMU has significant effect on the positioning precision when performing VSLAM system. Currently, existing methods are time-consuming using complex optimization methods and sensitive to noise and outliers due to off-calibration, which can negatively impact system performance. To address these problems, this paper presents a fast and robust camera-IMU online calibration method based space coordinate transformation constraints and SVD (singular Value Decomposition) tricks. First, constraint equations are constructed based on equality of rotation and transformation matrices between camera frames and IMU coordinates at different moments. Secondly, the external parameters of the camera-IMU are solved using quaternion transformation and SVD techniques. Finally, the proposed method is validated using ROS platform, where images from the camera and velocity, acceleration, and angular velocity data from the IMU are recorded in a ROS bag file. The results showed that the proposed method can achieve robust and reliable camera-IMU online calibration parameters results with less tune consuming and less uncertainty.
△ Less
Submitted 14 May, 2023;
originally announced May 2023.
-
Uncertainty Estimation and Out-of-Distribution Detection for Deep Learning-Based Image Reconstruction using the Local Lipschitz
Authors:
Danyal F. Bhutto,
Bo Zhu,
Jeremiah Z. Liu,
Neha Koonjoo,
Hongwei B. Li,
Bruce R. Rosen,
Matthew S. Rosen
Abstract:
Accurate image reconstruction is at the heart of diagnostics in medical imaging. Supervised deep learning-based approaches have been investigated for solving inverse problems including image reconstruction. However, these trained models encounter unseen data distributions that are widely shifted from training data during deployment. Therefore, it is essential to assess whether a given input falls…
▽ More
Accurate image reconstruction is at the heart of diagnostics in medical imaging. Supervised deep learning-based approaches have been investigated for solving inverse problems including image reconstruction. However, these trained models encounter unseen data distributions that are widely shifted from training data during deployment. Therefore, it is essential to assess whether a given input falls within the training data distribution for diagnostic purposes. Uncertainty estimation approaches exist but focus on providing an uncertainty map to radiologists, rather than assessing the training distribution fit. In this work, we propose a method based on the local Lipschitz-based metric to distinguish out-of-distribution images from in-distribution with an area under the curve of 99.94%. Empirically, we demonstrate a very strong relationship between the local Lipschitz value and mean absolute error (MAE), supported by a high Spearman's rank correlation coefficient of 0.8475, which determines the uncertainty estimation threshold for optimal model performance. Through the identification of false positives, the local Lipschitz and MAE relationship was used to guide data augmentation and reduce model uncertainty. Our study was validated using the AUTOMAP architecture for sensor-to-image Magnetic Resonance Imaging (MRI) reconstruction. We compare our proposed approach with baseline methods: Monte-Carlo dropout and deep ensembles, and further analysis included MRI denoising and Computed Tomography (CT) sparse-to-full view reconstruction using UNET architectures. We show that our approach is applicable to various architectures and learned functions, especially in the realm of medical image reconstruction, where preserving the diagnostic accuracy of reconstructed images remains paramount.
△ Less
Submitted 1 December, 2023; v1 submitted 12 May, 2023;
originally announced May 2023.
-
ByteCover3: Accurate Cover Song Identification on Short Queries
Authors:
Xingjian Du,
Zijie Wang,
Xia Liang,
Huidong Liang,
Bilei Zhu,
Zejun Ma
Abstract:
Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI. However, with the burgeon of short videos, many real-world applications require matching short music excerpts to full-length music tracks in the database, which is still under-explored and w…
▽ More
Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI. However, with the burgeon of short videos, many real-world applications require matching short music excerpts to full-length music tracks in the database, which is still under-explored and waiting for an industrial-level solution. In this paper, we upgrade the previous ByteCover systems to ByteCover3 that utilizes local features to further improve the identification performance of short music queries. ByteCover3 is designed with a local alignment loss (LAL) module and a two-stage feature retrieval pipeline, allowing the system to perform CSI in a more precise and efficient way. We evaluated ByteCover3 on multiple datasets with different benchmark settings, where ByteCover3 beat all the compared methods including its previous versions.
△ Less
Submitted 21 March, 2023;
originally announced March 2023.
-
On the Statistical Consistency of a Generalized Cepstral Estimator
Authors:
Bin Zhu,
Mattia Zorzi
Abstract:
We consider the problem to estimate the generalized cepstral coefficients of a stationary stochastic process or stationary multidimensional random field. It turns out that a naive version of the periodogram-based estimator for the generalized cepstral coefficients is not consistent. We propose a consistent estimator for those coefficients. Moreover, we show that the latter can be used in order to…
▽ More
We consider the problem to estimate the generalized cepstral coefficients of a stationary stochastic process or stationary multidimensional random field. It turns out that a naive version of the periodogram-based estimator for the generalized cepstral coefficients is not consistent. We propose a consistent estimator for those coefficients. Moreover, we show that the latter can be used in order to build a consistent estimator for a particular class of cascade linear stochastic systems.
△ Less
Submitted 17 January, 2023;
originally announced January 2023.
-
Leveraging Modality-specific Representations for Audio-visual Speech Recognition via Reinforcement Learning
Authors:
Chen Chen,
Yuchen Hu,
Qiang Zhang,
Heqing Zou,
Beier Zhu,
Eng Siong Chng
Abstract:
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, th…
▽ More
Audio-visual speech recognition (AVSR) has gained remarkable success for ameliorating the noise-robustness of speech recognition. Mainstream methods focus on fusing audio and visual inputs to obtain modality-invariant representations. However, such representations are prone to over-reliance on audio modality as it is much easier to recognize than video modality in clean conditions. As a result, the AVSR model underestimates the importance of visual stream in face of noise corruption. To this end, we leverage visual modality-specific representations to provide stable complementary information for the AVSR task. Specifically, we propose a reinforcement learning (RL) based framework called MSRL, where the agent dynamically harmonizes modality-invariant and modality-specific representations in the auto-regressive decoding process. We customize a reward function directly related to task-specific metrics (i.e., word error rate), which encourages the MSRL to effectively explore the optimal integration strategy. Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions. Furthermore, we demonstrate the better generality of MSRL system than other baselines when test set contains unseen noises.
△ Less
Submitted 2 February, 2023; v1 submitted 10 December, 2022;
originally announced December 2022.
-
EASpace: Enhanced Action Space for Policy Transfer
Authors:
Zheng Zhang,
Qingrui Zhang,
Bo Zhu,
Xiaohan Wang,
Tianjiang Hu
Abstract:
Formulating expert policies as macro actions promises to alleviate the long-horizon issue via structured exploration and efficient credit assignment. However, traditional option-based multi-policy transfer methods suffer from inefficient exploration of macro action's length and insufficient exploitation of useful long-duration macro actions. In this paper, a novel algorithm named EASpace (Enhanced…
▽ More
Formulating expert policies as macro actions promises to alleviate the long-horizon issue via structured exploration and efficient credit assignment. However, traditional option-based multi-policy transfer methods suffer from inefficient exploration of macro action's length and insufficient exploitation of useful long-duration macro actions. In this paper, a novel algorithm named EASpace (Enhanced Action Space) is proposed, which formulates macro actions in an alternative form to accelerate the learning process using multiple available sub-optimal expert policies. Specifically, EASpace formulates each expert policy into multiple macro actions with different execution {times}. All the macro actions are then integrated into the primitive action space directly. An intrinsic reward, which is proportional to the execution time of macro actions, is introduced to encourage the exploitation of useful macro actions. The corresponding learning rule that is similar to Intra-option Q-learning is employed to improve the data efficiency. Theoretical analysis is presented to show the convergence of the proposed learning rule. The efficiency of EASpace is illustrated by a grid-based game and a multi-agent pursuit problem. The proposed algorithm is also implemented in physical systems to validate its effectiveness.
△ Less
Submitted 24 July, 2023; v1 submitted 7 December, 2022;
originally announced December 2022.
-
Control Lyapunov-Barrier Function Based Model Predictive Control for Stochastic Nonlinear Affine Systems
Authors:
Weijiang Zheng,
Bing Zhu
Abstract:
A stochastic model predictive control (MPC) framework is presented in this paper for nonlinear affine systems with stability and feasibility guarantee. We first introduce the concept of stochastic control Lyapunov-barrier function (CLBF) and provide a method to construct CLBF by combining an unconstrained control Lyapunov function (CLF) and control barrier functions. The unconstrained CLF is obtai…
▽ More
A stochastic model predictive control (MPC) framework is presented in this paper for nonlinear affine systems with stability and feasibility guarantee. We first introduce the concept of stochastic control Lyapunov-barrier function (CLBF) and provide a method to construct CLBF by combining an unconstrained control Lyapunov function (CLF) and control barrier functions. The unconstrained CLF is obtained from its corresponding semi-linear system through dynamic feedback linearization. Based on the constructed CLBF, we utilize sampled-data MPC framework to deal with states and inputs constraints, and to analyze stability of closed-loop systems. Moreover, event-triggering mechanisms are integrated into MPC framework to improve performance during sampling intervals. The proposed CLBF based stochastic MPC is validated via an obstacle avoidance example.
△ Less
Submitted 26 June, 2023; v1 submitted 11 November, 2022;
originally announced November 2022.
-
Dead-beat model predictive control for discrete-time linear systems
Authors:
Bing Zhu
Abstract:
In this paper, model predictive control (MPC) strategies are proposed for dead-beat control of linear systems with and without state and control constraints. In unconstrained MPC, deadbeat performance can be guaranteed by setting the control horizon to the system dimension, and adding an terminal equality constraint. It is proved that the unconstrained deadbeat MPC is equivalent to linear deadbeat…
▽ More
In this paper, model predictive control (MPC) strategies are proposed for dead-beat control of linear systems with and without state and control constraints. In unconstrained MPC, deadbeat performance can be guaranteed by setting the control horizon to the system dimension, and adding an terminal equality constraint. It is proved that the unconstrained deadbeat MPC is equivalent to linear deadbeat control. The proposed constrained deadbeat MPC is designed by setting the control horizon equal to the system dimension and penalizing only the terminal cost. The recursive feasibility and deadbeat performance are proved theoretically.
△ Less
Submitted 30 August, 2022;
originally announced August 2022.
-
Sampling Gaussian Stationary Random Fields: A Stochastic Realization Approach
Authors:
Bin Zhu,
Jiahao Liu,
Zhengshou Lai,
Tao Qian
Abstract:
Generating large-scale samples of stationary random fields is of great importance in the fields such as geomaterial modeling and uncertainty quantification. Traditional methodologies based on covariance matrix decomposition have the diffculty of being computationally expensive, which is even more serious when the dimension of the random field is large. This paper proposes an effcient stochastic re…
▽ More
Generating large-scale samples of stationary random fields is of great importance in the fields such as geomaterial modeling and uncertainty quantification. Traditional methodologies based on covariance matrix decomposition have the diffculty of being computationally expensive, which is even more serious when the dimension of the random field is large. This paper proposes an effcient stochastic realization approach for sampling Gaussian stationary random fields from a systems and control point of view. Specifically, we take the exponential and Gaussian covariance functions as examples and make a decoupling assumption when there are multiple dimensions. Then a rational spectral density is constructed in each dimension using techniques from covariance extension, and the corresponding autoregressive moving-average (ARMA) model is obtained via spectral factorization. As a result, samples of the random field with a specific covariance function can be generated very effciently in the space domain by implementing the ARMA recursion using a white noise input. Such a procedure is computationally cheap due to the fact that the constructed ARMA model has a low order. Furthermore, the same method is integrated to multiscale simulations where interpolations of the generated samples are achieved when one zooms into finer scales. Both theoretical analysis and simulation results show that our approach performs favorably compared with covariance matrix decomposition methods.
△ Less
Submitted 22 August, 2022;
originally announced August 2022.
-
GNN-PMB: A Simple but Effective Online 3D Multi-Object Tracker without Bells and Whistles
Authors:
Jianan Liu,
Liping Bai,
Yuxuan Xia,
Tao Huang,
Bing Zhu,
Qing-Long Han
Abstract:
Multi-object tracking (MOT) is among crucial applications in modern advanced driver assistance systems (ADAS) and autonomous driving (AD) systems. The global nearest neighbor (GNN) filter, as the earliest random vector-based Bayesian tracking framework, has been adopted in most of state-of-the-arts trackers in the automotive industry. The development of random finite set (RFS) theory facilitates a…
▽ More
Multi-object tracking (MOT) is among crucial applications in modern advanced driver assistance systems (ADAS) and autonomous driving (AD) systems. The global nearest neighbor (GNN) filter, as the earliest random vector-based Bayesian tracking framework, has been adopted in most of state-of-the-arts trackers in the automotive industry. The development of random finite set (RFS) theory facilitates a mathematically rigorous treatment of the MOT problem, and different variants of RFS-based Bayesian filters have then been proposed. However, their effectiveness in the real ADAS and AD application is still an open problem. In this paper, it is demonstrated that the latest RFS-based Bayesian tracking framework could be superior to typical random vector-based Bayesian tracking framework via a systematic comparative study of both traditional random vector-based Bayesian filters with rule-based heuristic track maintenance and RFS-based Bayesian filters on the nuScenes validation dataset. An RFS-based tracker, namely Poisson multi-Bernoulli filter using the global nearest neighbor (GNN-PMB), is proposed to LiDAR-based MOT tasks. This GNN-PMB tracker is simple to use, and it achieves competitive results on the nuScenes dataset. Specifically, the proposed GNN-PMB tracker outperforms most state-of-the-art LiDAR-only trackers and LiDAR and camera fusion-based trackers, ranking the $3^{rd}$ among all LiDAR-only trackers on nuScenes 3D tracking challenge leader board at the time of submission.
△ Less
Submitted 8 February, 2023; v1 submitted 21 June, 2022;
originally announced June 2022.
-
Brachial Plexus Nerve Trunk Segmentation Using Deep Learning: A Comparative Study with Doctors' Manual Segmentation
Authors:
Yu Wang,
Binbin Zhu,
Lingsi Kong,
Jianlin Wang,
Bin Gao,
Jianhua Wang,
Dingcheng Tian,
Yudong Yao
Abstract:
Ultrasound-guided nerve block anesthesia (UGNB) is a high-tech visual nerve block anesthesia method that can observe the target nerve and its surrounding structures, the puncture needle's advancement, and local anesthetics spread in real-time. The key in UGNB is nerve identification. With the help of deep learning methods, the automatic identification or segmentation of nerves can be realized, ass…
▽ More
Ultrasound-guided nerve block anesthesia (UGNB) is a high-tech visual nerve block anesthesia method that can observe the target nerve and its surrounding structures, the puncture needle's advancement, and local anesthetics spread in real-time. The key in UGNB is nerve identification. With the help of deep learning methods, the automatic identification or segmentation of nerves can be realized, assisting doctors in completing nerve block anesthesia accurately and efficiently. Here, we establish a public dataset containing 320 ultrasound images of brachial plexus (BP). Three experienced doctors jointly produce the BP segmentation ground truth and label brachial plexus trunks. We design a brachial plexus segmentation system (BPSegSys) based on deep learning. BPSegSys achieves experienced-doctor-level nerve identification performance in various experiments. We evaluate BPSegSys' performance in terms of intersection-over-union (IoU), a commonly used performance measure for segmentation experiments. Considering three dataset groups in our established public dataset, the IoU of BPSegSys are 0.5238, 0.4715, and 0.5029, respectively, which exceed the IoU 0.5205, 0.4704, and 0.4979 of experienced doctors. In addition, we show that BPSegSys can help doctors identify brachial plexus trunks more accurately, with IoU improvement up to 27%, which has significant clinical application value.
△ Less
Submitted 17 May, 2022;
originally announced May 2022.
-
NeuralTree: A 256-Channel 0.227-$μ$J/Class Versatile Neural Activity Classification and Closed-Loop Neuromodulation SoC
Authors:
Uisub Shin,
Cong Ding,
Bingzhao Zhu,
Yashwanth Vyza,
Alix Trouillet,
Emilie C. M. Revol,
Stéphanie P. Lacour,
Mahsa Shoaran
Abstract:
Closed-loop neural interfaces with on-chip machine learning can detect and suppress disease symptoms in neurological disorders or restore lost functions in paralyzed patients. While high-density neural recording can provide rich neural activity information for accurate disease-state detection, existing systems have low channel counts and poor scalability, which could limit their therapeutic effica…
▽ More
Closed-loop neural interfaces with on-chip machine learning can detect and suppress disease symptoms in neurological disorders or restore lost functions in paralyzed patients. While high-density neural recording can provide rich neural activity information for accurate disease-state detection, existing systems have low channel counts and poor scalability, which could limit their therapeutic efficacy. This work presents a highly scalable and versatile closed-loop neural interface SoC that can overcome these limitations. A 256-channel time-division multiplexed (TDM) front-end with a two-step fast-settling mixed-signal DC servo loop (DSL) is proposed to record high-spatial-resolution neural activity and perform channel-selective brain-state inference. A tree-structured neural network (NeuralTree) classification processor extracts a rich set of neural biomarkers in a patient- and disease-specific manner. Trained with an energy-aware learning algorithm, the NeuralTree classifier detects the symptoms of underlying disorders (e.g., epilepsy and movement disorders) at an optimal energy-accuracy tradeoff. A 16-channel high-voltage (HV) compliant neurostimulator closes the therapeutic loop by delivering charge-balanced biphasic current pulses to the brain. The proposed SoC was fabricated in 65-nm CMOS and achieved a 0.227-$μ$J/class energy efficiency in a compact area of 0.014mm$^2$/channel. The SoC was extensively verified on human electroencephalography (EEG) and intracranial EEG (iEEG) epilepsy datasets, obtaining 95.6%/94% sensitivity and 96.8%/96.9% specificity, respectively. In vivo neural recordings using soft $μ$ECoG arrays and multi-domain biomarker extraction were further performed on a rat model of epilepsy. In addition, for the first time in literature, on-chip classification of rest-state tremor in Parkinson's disease (PD) from human local field potentials (LFPs) was demonstrated.
△ Less
Submitted 8 December, 2022; v1 submitted 12 May, 2022;
originally announced May 2022.
-
Fast and Arbitrary Beam Pattern Design for RIS-Assisted Terahertz Wireless Communication
Authors:
Jian Dang,
Zaichen Zhang,
Yewei Li,
Liang Wu,
Bingcheng Zhu,
Lei Wang
Abstract:
Reconfigurable intelligent surface (RIS) can assist terahertz wireless communication to restore the fragile line-of-sight links and facilitate beam steering. Arbitrary reflection beam patterns are desired to meet diverse requirements in different applications. This paper establishes relationship between RIS beam pattern design with two-dimensional finite impulse response filter design and proposes…
▽ More
Reconfigurable intelligent surface (RIS) can assist terahertz wireless communication to restore the fragile line-of-sight links and facilitate beam steering. Arbitrary reflection beam patterns are desired to meet diverse requirements in different applications. This paper establishes relationship between RIS beam pattern design with two-dimensional finite impulse response filter design and proposes a fast non-iterative algorithm to solve the problem. Simulations show that the proposed method outperforms baseline method. Hence, it represents a promising solution for fast and arbitrary beam pattern design in RIS-assisted terahertz wireless communication.
△ Less
Submitted 5 May, 2022;
originally announced May 2022.
-
Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast
Authors:
Boqing Zhu,
Kele Xu,
Changjian Wang,
Zheng Qin,
Tao Sun,
Huaimin Wang,
Yuxing Peng
Abstract:
We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based…
▽ More
We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based on the natural correlation between audio clips and visual frames. However, this correlation might be weak or inaccurate in a large amount of real-world data, which leads to deviating positives into the contrastive paradigm. To address these issues, we propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives. On one hand, CMPC could learn the intra-class invariance by constructing semantic-wise positives via unsupervised clustering in different modalities. On the other hand, by comparing the similarities of cross-modal instances from that of cross-modal prototypes, we dynamically recalibrate the unlearnable instances' contribution to overall loss. Experiments show that the proposed approach outperforms state-of-the-art unsupervised methods on various voice-face association evaluation protocols. Additionally, in the low-shot supervision setting, our method also has a significant improvement compared to previous instance-wise contrastive learning.
△ Less
Submitted 26 May, 2022; v1 submitted 28 April, 2022;
originally announced April 2022.
-
Amplitude-Constrained Constellation and Reflection Pattern Designs for Directional Backscatter Communications Using Programmable Metasurface
Authors:
Wei Wang,
Bincheng Zhu,
Yongming Huang,
Wei Zhang
Abstract:
The large scale reflector array of programmable metasurfaces is capable of increasing the power efficiency of backscatter communications via passive beamforming and thus has the potential to revolutionize the low-data-rate nature of backscatter communications. In this paper, we propose to design the power-efficient higher-order constellation and reflection pattern under the amplitude constraint br…
▽ More
The large scale reflector array of programmable metasurfaces is capable of increasing the power efficiency of backscatter communications via passive beamforming and thus has the potential to revolutionize the low-data-rate nature of backscatter communications. In this paper, we propose to design the power-efficient higher-order constellation and reflection pattern under the amplitude constraint brought by backscatter communications. For the constellation design, we adopt the amplitude and phase-shift keying (APSK) constellation and optimize the parameters of APSK such as ring number, ring radius, and inter-ring phase difference. Specifically, we derive closed-form solutions to the optimal ring radius and interring phase difference for an arbitrary modulation order in the decomposed subproblems. For the reflection pattern design, we propose to optimize the passive beamforming vector by solving a multi-objective optimization problem that maximizes reflection power and guarantees beam homogenization within the interested angle range. To solve the problem, we propose a constant-modulus power iteration method, which is proven to be monotonically increasing, to maximize the objective function in each iteration. Numerical results show that the proposed APSK constellation design and reflection pattern design outperform the existing modulation and beam pattern designs in programmable metasurface enabled backscatter communications.
△ Less
Submitted 30 March, 2023; v1 submitted 8 March, 2022;
originally announced March 2022.
-
S3T: Self-Supervised Pre-training with Swin Transformer for Music Classification
Authors:
Hang Zhao,
Chen Zhang,
Belei Zhu,
Zejun Ma,
Kejun Zhang
Abstract:
In this paper, we propose S3T, a self-supervised pre-training method with Swin Transformer for music classification, aiming to learn meaningful music representations from massive easily accessible unlabeled music data. S3T introduces a momentum-based paradigm, MoCo, with Swin Transformer as its feature extractor to music time-frequency domain. For better music representations learning, S3T contrib…
▽ More
In this paper, we propose S3T, a self-supervised pre-training method with Swin Transformer for music classification, aiming to learn meaningful music representations from massive easily accessible unlabeled music data. S3T introduces a momentum-based paradigm, MoCo, with Swin Transformer as its feature extractor to music time-frequency domain. For better music representations learning, S3T contributes a music data augmentation pipeline and two specially designed pre-processors. To our knowledge, S3T is the first method combining the Swin Transformer with a self-supervised learning method for music classification. We evaluate S3T on music genre classification and music tagging tasks with linear classifiers trained on learned representations. Experimental results show that S3T outperforms the previous self-supervised method (CLMR) by 12.5 percents top-1 accuracy and 4.8 percents PR-AUC on two tasks respectively, and also surpasses the task-specific state-of-the-art supervised methods. Besides, S3T shows advances in label efficiency using only 10% labeled data exceeding CLMR on both tasks with 100% labeled data.
△ Less
Submitted 21 February, 2022;
originally announced February 2022.
-
On Real-time Image Reconstruction with Neural Networks for MRI-guided Radiotherapy
Authors:
David E. J. Waddington,
Nicholas Hindley,
Neha Koonjoo,
Christopher Chiu,
Tess Reynolds,
Paul Z. Y. Liu,
Bo Zhu,
Danyal Bhutto,
Chiara Paganelli,
Paul J. Keall,
Matthew S. Rosen
Abstract:
MRI-guidance techniques that dynamically adapt radiation beams to follow tumor motion in real-time will lead to more accurate cancer treatments and reduced collateral healthy tissue damage. The gold-standard for reconstruction of undersampled MR data is compressed sensing (CS) which is computationally slow and limits the rate that images can be available for real-time adaptation. Here, we demonstr…
▽ More
MRI-guidance techniques that dynamically adapt radiation beams to follow tumor motion in real-time will lead to more accurate cancer treatments and reduced collateral healthy tissue damage. The gold-standard for reconstruction of undersampled MR data is compressed sensing (CS) which is computationally slow and limits the rate that images can be available for real-time adaptation. Here, we demonstrate the use of automated transform by manifold approximation (AUTOMAP), a generalized framework that maps raw MR signal to the target image domain, to rapidly reconstruct images from undersampled radial k-space data. The AUTOMAP neural network was trained to reconstruct images from a golden-angle radial acquisition, a benchmark for motion-sensitive imaging, on lung cancer patient data and generic images from ImageNet. Model training was subsequently augmented with motion-encoded k-space data derived from videos in the YouTube-8M dataset to encourage motion robust reconstruction. We find that AUTOMAP-reconstructed radial k-space has equivalent accuracy to CS but with much shorter processing times after initial fine-tuning on retrospectively acquired lung cancer patient data. Validation of motion-trained models with a virtual dynamic lung tumor phantom showed that the generalized motion properties learned from YouTube lead to improved target tracking accuracy. Our work shows that AUTOMAP can achieve real-time, accurate reconstruction of radial data. These findings imply that neural-network-based reconstruction is potentially superior to existing approaches for real-time image guidance applications.
△ Less
Submitted 18 May, 2022; v1 submitted 9 February, 2022;
originally announced February 2022.
-
Robust Estimation for Nonparametric Families via Generative Adversarial Networks
Authors:
Banghua Zhu,
Jiantao Jiao,
Michael I. Jordan
Abstract:
We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptic…
▽ More
We provide a general framework for designing Generative Adversarial Networks (GANs) to solve high dimensional robust statistics problems, which aim at estimating unknown parameter of the true distribution given adversarially corrupted samples. Prior work focus on the problem of robust mean and covariance estimation when the true distribution lies in the family of Gaussian distributions or elliptical distributions, and analyze depth or scoring rule based GAN losses for the problem. Our work extend these to robust mean estimation, second moment estimation, and robust linear regression when the true distribution only has bounded Orlicz norms, which includes the broad family of sub-Gaussian, sub-Exponential and bounded moment distributions. We also provide a different set of sufficient conditions for the GAN loss to work: we only require its induced distance function to be a cumulative density function of some light-tailed distribution, which is easily satisfied by neural networks with sigmoid activation. In terms of techniques, our proposed GAN losses can be viewed as a smoothed and generalized Kolmogorov-Smirnov distance, which overcomes the computational intractability of the original Kolmogorov-Smirnov distance used in the prior work.
△ Less
Submitted 2 February, 2022;
originally announced February 2022.
-
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection
Authors:
Ke Chen,
Xingjian Du,
Bilei Zhu,
Zejun Ma,
Taylor Berg-Kirkpatrick,
Shlomo Dubnov
Abstract:
Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in au…
▽ More
Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.
△ Less
Submitted 1 February, 2022;
originally announced February 2022.
-
Performance Analysis of Hybrid RF-Reconfigurable Intelligent Surfaces Assisted FSO Communication
Authors:
Haibo Wang,
Zaichen Zhang,
Bingcheng Zhu,
Yidi Zhang
Abstract:
Optical reconfigurable intelligent surface (ORIS) is an emerging technology that can achieve reconfigurable optical propagation environments by precisely adjusting signal's reflection and shape through a large number of passive reflecting elements. In this paper, we investigate the performance of an ORIS-assisted dual-hop hybrid radio frequency (RF) and free space optics (FSO) communication system…
▽ More
Optical reconfigurable intelligent surface (ORIS) is an emerging technology that can achieve reconfigurable optical propagation environments by precisely adjusting signal's reflection and shape through a large number of passive reflecting elements. In this paper, we investigate the performance of an ORIS-assisted dual-hop hybrid radio frequency (RF) and free space optics (FSO) communication system. By jointly considering the physical models of ORIS, RF channel, atmospheric turbulence, and pointing error, the closed-form solutions of the system's precise outage probability, asymptotic outage probability and BER have been derived. It is shown through numerical results that the derivation results are accurate and the RF-FSO links with ORISs show a slightly worse performance than the traditional RF-FSO links. Based on theoretical analysis and simulation results, the system design and effect of each parameter have been discussed.
△ Less
Submitted 21 January, 2022;
originally announced January 2022.
-
Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data
Authors:
Ke Chen,
Xingjian Du,
Bilei Zhu,
Zejun Ma,
Taylor Berg-Kirkpatrick,
Shlomo Dubnov
Abstract:
Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a univ…
▽ More
Deep learning techniques for separating audio into different sound sources face several challenges. Standard architectures require training separate models for different types of audio sources. Although some universal separators employ a single model to target multiple sources, they have difficulty generalizing to unseen sources. In this paper, we propose a three-component pipeline to train a universal audio source separator from a large, but weakly-labeled dataset: AudioSet. First, we propose a transformer-based sound event detection system for processing weakly-labeled training data. Second, we devise a query-based audio separation model that leverages this data for model training. Third, we design a latent embedding processor to encode queries that specify audio targets for separation, allowing for zero-shot generalization. Our approach uses a single model for source separation of multiple sound types, and relies solely on weakly-labeled data for training. In addition, the proposed audio separator can be used in a zero-shot setting, learning to separate types of audio sources that were never seen in training. To evaluate the separation performance, we test our model on MUSDB18, while training on the disjoint AudioSet. We further verify the zero-shot performance by conducting another experiment on audio source types that are held-out from training. The model achieves comparable Source-to-Distortion Ratio (SDR) performance to current supervised models in both cases.
△ Less
Submitted 12 February, 2022; v1 submitted 15 December, 2021;
originally announced December 2021.
-
Joint Cluster Head Selection and Trajectory Planning in UAV-Aided IoT Networks by Reinforcement Learning with Sequential Model
Authors:
Botao Zhu,
Ebrahim Bedeer,
Ha H. Nguyen,
Robert Barton,
Jerome Henry
Abstract:
Employing unmanned aerial vehicles (UAVs) has attracted growing interests and emerged as the state-of-the-art technology for data collection in Internet-of-Things (IoT) networks. In this paper, with the objective of minimizing the total energy consumption of the UAV-IoT system, we formulate the problem of jointly designing the UAV's trajectory and selecting cluster heads in the IoT network as a co…
▽ More
Employing unmanned aerial vehicles (UAVs) has attracted growing interests and emerged as the state-of-the-art technology for data collection in Internet-of-Things (IoT) networks. In this paper, with the objective of minimizing the total energy consumption of the UAV-IoT system, we formulate the problem of jointly designing the UAV's trajectory and selecting cluster heads in the IoT network as a constrained combinatorial optimization problem which is classified as NP-hard and challenging to solve. We propose a novel deep reinforcement learning (DRL) with a sequential model strategy that can effectively learn the policy represented by a sequence-to-sequence neural network for the UAV's trajectory design in an unsupervised manner. Through extensive simulations, the obtained results show that the proposed DRL method can find the UAV's trajectory that requires much less energy consumption when compared to other baseline algorithms and achieves close-to-optimal performance. In addition, simulation results show that the trained model by our proposed DRL algorithm has an excellent generalization ability to larger problem sizes without the need to retrain the model.
△ Less
Submitted 1 December, 2021;
originally announced December 2021.
-
Deep Instance Segmentation with Automotive Radar Detection Points
Authors:
Jianan Liu,
Weiyi Xiong,
Liping Bai,
Yuxuan Xia,
Tao Huang,
Wanli Ouyang,
Bing Zhu
Abstract:
Automotive radar provides reliable environmental perception in all-weather conditions with affordable cost, but it hardly supplies semantic and geometry information due to the sparsity of radar detection points. With the development of automotive radar technologies in recent years, instance segmentation becomes possible by using automotive radar. Its data contain contexts such as radar cross secti…
▽ More
Automotive radar provides reliable environmental perception in all-weather conditions with affordable cost, but it hardly supplies semantic and geometry information due to the sparsity of radar detection points. With the development of automotive radar technologies in recent years, instance segmentation becomes possible by using automotive radar. Its data contain contexts such as radar cross section and micro-Doppler effects, and sometimes can provide detection when the field of view is obscured. The outcome from instance segmentation could be potentially used as the input of trackers for tracking targets. The existing methods often utilize a clustering-based classification framework, which fits the need of real-time processing but has limited performance due to minimum information provided by sparse radar detection points. In this paper, we propose an efficient method based on clustering of estimated semantic information to achieve instance segmentation for the sparse radar detection points. In addition, we show that the performance of the proposed approach can be further enhanced by incorporating the visual multi-layer perceptron. The effectiveness of the proposed method is verified by experimental results on the popular RadarScenes dataset, achieving 89.53% mean coverage and 86.97% mean average precision with the IoU threshold of 0.5, which is superior to other approaches in the literature. More significantly, the consumed memory is around 1MB, and the inference time is less than 40ms, indicating that our proposed algorithm is storage and time efficient. These two criteria ensure the practicality of the proposed method in real-world systems.
△ Less
Submitted 5 February, 2023; v1 submitted 4 October, 2021;
originally announced October 2021.
-
A Fast Robust Numerical Continuation Solver to a Two-Dimensional Spectral Estimation Problem
Authors:
Bin Zhu,
Jiahao Liu
Abstract:
This paper presents a fast algorithm to solve a spectral estimation problem for two-dimensional random fields. The latter is formulated as a convex optimization problem with the Itakura-Saito pseudodistance as the objective function subject to the constraints of moment equations. We exploit the structure of the Hessian of the dual objective function in order to make possible a fast Newton solver.…
▽ More
This paper presents a fast algorithm to solve a spectral estimation problem for two-dimensional random fields. The latter is formulated as a convex optimization problem with the Itakura-Saito pseudodistance as the objective function subject to the constraints of moment equations. We exploit the structure of the Hessian of the dual objective function in order to make possible a fast Newton solver. Then we incorporate the Newton solver to a predictor-corrector numerical continuation method which is able to produce a parametrized family of solutions to the moment equations. We have performed two sets of numerical simulations to test our algorithm and spectral estimator. The simulations on the frequency estimation problem shows that our spectral estimator outperforms the classical windowed periodograms in the case of two hidden frequencies and has a higher resolution. The other set of simulations on system identification indicates that the numerical continuation method is more robust than Newton's method alone in ill-conditioned instances.
△ Less
Submitted 30 September, 2021;
originally announced September 2021.
-
Closed-Loop Neural Prostheses with On-Chip Intelligence: A Review and A Low-Latency Machine Learning Model for Brain State Detection
Authors:
Bingzhao Zhu,
Uisub Shin,
Mahsa Shoaran
Abstract:
The application of closed-loop approaches in systems neuroscience and therapeutic stimulation holds great promise for revolutionizing our understanding of the brain and for developing novel neuromodulation therapies to restore lost functions. Neural prostheses capable of multi-channel neural recording, on-site signal processing, rapid symptom detection, and closed-loop stimulation are critical to…
▽ More
The application of closed-loop approaches in systems neuroscience and therapeutic stimulation holds great promise for revolutionizing our understanding of the brain and for developing novel neuromodulation therapies to restore lost functions. Neural prostheses capable of multi-channel neural recording, on-site signal processing, rapid symptom detection, and closed-loop stimulation are critical to enabling such novel treatments. However, the existing closed-loop neuromodulation devices are too simplistic and lack sufficient on-chip processing and intelligence. In this paper, we first discuss both commercial and investigational closed-loop neuromodulation devices for brain disorders. Next, we review state-of-the-art neural prostheses with on-chip machine learning, focusing on application-specific integrated circuits (ASIC). System requirements, performance and hardware comparisons, design trade-offs, and hardware optimization techniques are discussed. To facilitate a fair comparison and guide design choices among various on-chip classifiers, we propose a new energy-area (E-A) efficiency figure of merit that evaluates hardware efficiency and multi-channel scalability. Finally, we present several techniques to improve the key design metrics of tree-based on-chip classifiers, both in the context of ensemble methods and oblique structures.
△ Less
Submitted 13 September, 2021;
originally announced September 2021.
-
A Novel Method to Estimate the Coordinates of LEDs in Wireless Optical Positioning Systems
Authors:
Kehan Zhang,
Zaichen Zhang,
Bingcheng Zhu
Abstract:
Traditional visible light positioning (VLP) systems estimate receivers' coordinates based on the known light-emitting diode (LED) coordinates. However, the LED coordinates are not always known accurately. Because of the structural changes of the buildings due to temperature, humidity or material aging, even measured by highly accurate laser range finders, the LED coordinates may change unpredictab…
▽ More
Traditional visible light positioning (VLP) systems estimate receivers' coordinates based on the known light-emitting diode (LED) coordinates. However, the LED coordinates are not always known accurately. Because of the structural changes of the buildings due to temperature, humidity or material aging, even measured by highly accurate laser range finders, the LED coordinates may change unpredictably. In this paper, we propose an easy and low-cost method to update the position information of the LEDs. We use two optical angle-of-arrival (AOA) estimators to detect the beam directions of the LEDs. Each AOA estimator has four differently oriented photodiodes (PDs). Considering the additive noises of the PDs, we derive the closed-form error expression for the proposed LED coordinates estimator. Both analytical and Monte Carlo experimental results show that the layout of the AOA estimators could affect the estimation error. These results may provide intuitive insights for the design of the optical indoor positioning systems.
△ Less
Submitted 8 September, 2021;
originally announced September 2021.
-
Outage Analysis and Beamwidth Optimization for Positioning-Assisted Beamforming
Authors:
Bingcheng Zhu,
Zaichen Zhang,
Julian Cheng
Abstract:
Conventional beamforming is based on channel estimation, which can be computationally intensive and inaccurate when the antenna array is large. In this work, we study the outage probability of positioning-assisted beamforming systems. Closed-form outage probability bounds are derived by considering positioning error, link distance and beamwidth. Based on the analytical result, we show that the bea…
▽ More
Conventional beamforming is based on channel estimation, which can be computationally intensive and inaccurate when the antenna array is large. In this work, we study the outage probability of positioning-assisted beamforming systems. Closed-form outage probability bounds are derived by considering positioning error, link distance and beamwidth. Based on the analytical result, we show that the beamwidth should be optimized with respect to the link distance and the transmit power, and such optimization significantly suppresses the outage probability.
△ Less
Submitted 9 April, 2022; v1 submitted 1 September, 2021;
originally announced September 2021.
-
UAV Trajectory Planning in Wireless Sensor Networks for Energy Consumption Minimization by Deep Reinforcement Learning
Authors:
Botao Zhu,
Ebrahim Bedeer,
Ha H. Nguyen,
Robert Barton,
Jerome Henry
Abstract:
Unmanned aerial vehicles (UAVs) have emerged as a promising candidate solution for data collection of large-scale wireless sensor networks (WSNs). In this paper, we investigate a UAV-aided WSN, where cluster heads (CHs) receive data from their member nodes, and a UAV is dispatched to collect data from CHs along the planned trajectory. We aim to minimize the total energy consumption of the UAV-WSN…
▽ More
Unmanned aerial vehicles (UAVs) have emerged as a promising candidate solution for data collection of large-scale wireless sensor networks (WSNs). In this paper, we investigate a UAV-aided WSN, where cluster heads (CHs) receive data from their member nodes, and a UAV is dispatched to collect data from CHs along the planned trajectory. We aim to minimize the total energy consumption of the UAV-WSN system in a complete round of data collection. Toward this end, we formulate the energy consumption minimization problem as a constrained combinatorial optimization problem by jointly selecting CHs from nodes within clusters and planning the UAV's visiting order to the selected CHs. The formulated energy consumption minimization problem is NP-hard, and hence, hard to solve optimally. In order to tackle this challenge, we propose a novel deep reinforcement learning (DRL) technique, pointer network-A* (Ptr-A*), which can efficiently learn from experiences the UAV trajectory policy for minimizing the energy consumption. The UAV's start point and the WSN with a set of pre-determined clusters are fed into the Ptr-A*, and the Ptr-A* outputs a group of CHs and the visiting order to these CHs, i.e., the UAV's trajectory. The parameters of the Ptr-A* are trained on small-scale clusters problem instances for faster training by using the actor-critic algorithm in an unsupervised manner. At inference, three search strategies are also proposed to improve the quality of solutions. Simulation results show that the trained models based on 20-clusters and 40-clusters have a good generalization ability to solve the UAV's trajectory planning problem in WSNs with different numbers of clusters, without the need to retrain the models. Furthermore, the results show that our proposed DRL algorithm outperforms two baseline techniques.
△ Less
Submitted 31 July, 2021;
originally announced August 2021.
-
Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams
Authors:
Yuanbo Hou,
Zhesong Yu,
Xia Liang,
Xingjian Du,
Bilei Zhu,
Zejun Ma,
Dick Botteldooren
Abstract:
Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate inform…
▽ More
Many previous audio-visual voice-related works focus on speech, ignoring the singing voice in the growing number of musical video streams on the Internet. For processing diverse musical video data, voice activity detection is a necessary step. This paper attempts to detect the speech and singing voices of target performers in musical video streams using audiovisual information. To integrate information of audio and visual modalities, a multi-branch network is proposed to learn audio and image representations, and the representations are fused by attention based on semantic similarity to shape the acoustic representations through the probability of anchor vocalization. Experiments show the proposed audio-visual multi-branch network far outperforms the audio-only model in challenging acoustic environments, indicating the cross-modal information fusion based on semantic correlation is sensible and successful.
△ Less
Submitted 21 June, 2021;
originally announced June 2021.